Working creatives across our research keep reaching for the same words to describe what they liked: alive, dynamic, distinctive, real.
When AI work has clear failures (broken hierarchy, unreadable type, visible artifacts), evaluators agree. It's straightforward. Beyond that, taste comes into play. Evaluators stop scoring against standards and start scoring against feeling.

The agreement gap
We measured this directly. Kendall's W (a measure of evaluator agreement) tracks the transition. In ad images, agreement on prompt adherence is high. Agreement on visual appeal is much lower. In brand assets, the gap is wider still. The same evaluators, looking at the same outputs, agree where the criteria are objective and disagree where the criteria are personal.

One brand designer evaluating four AI-generated brand visuals put it this way:
Honestly, I feel like all four images could be used as brand visuals. What made me choose some over others was the sense of life: some felt more dynamic, realistic, and human.
That sentence describes the entire problem with current AI evaluation.
What averaging destroys
Most benchmarks treat evaluator disagreement as noise. Adjudicate, vote, average it out. That works when there's a ground truth, but not for creative work. Mood, conceptual risk, aesthetic direction: the dimensions creatives care about most are precisely the dimensions where professionals legitimately disagree.

Models trained against averaged judgments collapse toward safe defaults. Multiple models given the same brief produce similar work. It's the predictable output of evaluation systems that flatten taste into a single quality score.
Two signals, not one score
The fix is structural: treat convergence and divergence as separate signals. Convergence captures best practices that models can and should learn (typography, CTA placement, hierarchy). Divergence captures the steerability that creative work depends on. Optimizing on one doesn't guarantee the other, because a model can be technically excellent and creatively flat.

If you're building creative tools, this is a product decision before it's a technical one.

