Benchmark·April 22, 2026·6 min read

Veo 3.1 is the "Creative Director" model. Use it early, but hand off before refinement.

61% win rate at ideation. 39% at refinement. The clearest model profile in our video evaluation.

01Dominates ideation: 61% win rate. Highest scores of any model in any phase.
02Collapses in refinement: 39% win rate, last place. Below the 3.0 neutral midpoint.
03Only model in the study that degraded monotonically across every metric.
04It keeps generating instead of iterating. Use it early, hand off before refinement.

Contra Labs

Research

Contra Labs ran Google Veo 3.1 through every phase of ad video production: ideation, mockup, refinement. The data produced the clearest model profile we've recorded.

Peak at ideation

At ideation, Google DeepMind's Veo 3.1 posted scalars of 3.81 / 3.81 / 3.78 across quality, creativity, and brand fit. Those are the highest scores of any model across any phase in our entire experiment. Win rate: 61%. Given an open brief and maximum creative latitude, it generated with range, tonal confidence, and visual variety. At a stage where breadth matters more than precision, it delivered exactly what that stage demands.

One flag even at its peak: Motion & Blur criticism showed up in roughly 27% of evaluations. Even the best phase for Veo carried a persistent motion quality weakness. Worth noting for video-heavy briefs.

Veo 3.1 monotonic performance decline across creative phases.

Cliff at refinement

By refinement, those scalars had dropped below 3.0, the neutral midpoint where outputs tip from useful to net-negative. Win rate: 39%. Last place. Evaluators were calling the outputs counterproductive.

Veo 3.1 theme-level strengths vs weaknesses across phases.

What makes this unusual is the shape of the decline. Veo 3.1 is the only model in the experiment that degraded monotonically across every single metric as phases progressed. Grok Imagine, Kling AI, and ByteDance's Seedance 2.0 each improved in at least some dimensions by refinement. Veo got worse across all of them.

Why: it keeps generating

The qualitative data points to the mechanism. Veo keeps generating. When asked to iterate on an existing cut, it introduces new visual artifacts and unwanted transitions rather than preserving what already worked. It reads an iteration prompt as a generation prompt. The more you ask it to tighten, the more it creates.

The properties that make a model strong at open-ended generation: novelty, creative latitude, output diversity. Those same properties work against constrained iteration, where the job is to hold a direction stable and sand down details toward production readiness.

The practical implication

Veo 3.1 performs like a creative director. You want it at the start of a project, generating raw concepts and setting visual direction. By mockup and refinement, hand off to a model built for iteration.

Knowing which model to use at each phase is now a real creative skill.

How we ran this study → Methodology

Continue reading3 studies

All research

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Benchmark

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Benchmark

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Research

Datasets

Jobs