Research·May 5, 2026·4 min read

Creatives keep telling us the same thing about AI: every output looks the same.

Across 12 models and 5 domains, evaluators in the Human Creativity Benchmark kept circling the same complaint: the work is technically fine, it just all looks the same.

Contra Labs
Contra Labs
Research

Working creatives across our research keep reaching for the same words to describe what they liked: alive, dynamic, distinctive, real.

When AI work has clear failures (broken hierarchy, unreadable type, visible artifacts), evaluators agree. It's straightforward. Beyond that, taste comes into play. Evaluators stop scoring against standards and start scoring against feeling.

Convergence and divergence as two interacting signals. Convergence rises as work approaches production. Divergence stays present where the question shifts to taste.
Share

The agreement gap

We measured this directly. Kendall's W (a measure of evaluator agreement) tracks the transition. In ad images, agreement on prompt adherence is high. Agreement on visual appeal is much lower. In brand assets, the gap is wider still. The same evaluators, looking at the same outputs, agree where the criteria are objective and disagree where the criteria are personal.

Same evaluators, same outputs. Agreement is high on objective criteria, much lower on subjective ones.
Share

One brand designer evaluating four AI-generated brand visuals put it this way:

Honestly, I feel like all four images could be used as brand visuals. What made me choose some over others was the sense of life: some felt more dynamic, realistic, and human.

That sentence describes the entire problem with current AI evaluation.

What averaging destroys

Most benchmarks treat evaluator disagreement as noise. Adjudicate, vote, average it out. That works when there's a ground truth, but not for creative work. Mood, conceptual risk, aesthetic direction: the dimensions creatives care about most are precisely the dimensions where professionals legitimately disagree.

Models trained against averaged judgments collapse toward safe defaults. Multiple models given the same brief produce similar work.
Share

Models trained against averaged judgments collapse toward safe defaults. Multiple models given the same brief produce similar work. It's the predictable output of evaluation systems that flatten taste into a single quality score.

Two signals, not one score

The fix is structural: treat convergence and divergence as separate signals. Convergence captures best practices that models can and should learn (typography, CTA placement, hierarchy). Divergence captures the steerability that creative work depends on. Optimizing on one doesn't guarantee the other, because a model can be technically excellent and creatively flat.

Best-practice fit and steerability as orthogonal axes. Models cluster by where they earn their advantage: strong defaults, strong steerability, or one without the other.
Share

If you're building creative tools, this is a product decision before it's a technical one.

Continue reading
All research
Benchmark
Grok Imagine is the "Polisher" model. Hand off the early rounds, bring it in for refinement.
Contra Labs ran xAI's Grok Imagine through every phase of ad video production: ideation, mockup, refinement. It produced the most dramatic phase-over-phase improvement of any video model in the study.
April 28, 2026 · 5 min
Research
The creative process has 3 phases. AI performs very differently in each.
Contra Labs has been studying how working creatives integrate AI into their workflows. What emerged is a consistent 3-stage structure: ideation, mockup, refinement. The way creatives use AI shifts significantly at each one.
April 23, 2026 · 5 min
Research
Solo creatives are earning more with AI and staying independent.
The majority of independent creatives surveyed report higher earning potential since adopting AI. They're taking on more projects, charging more, and hiring no one.
April 21, 2026 · 5 min

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings