By mid-2026, AI video had turned from a demo category into a production-tool race, with four frontier models launching in close succession: Seedance 2.0, Veo 3.1, Grok Imagine, and Adobe Firefly Video. Seedance 2.0 arrived with the most drama: within days of its release in China, Axios reported a Disney cease-and-desist letter. Veo 3.1 led with native, synchronized audio generation, paired with better camera and narrative control. Grok Imagine claimed many #1 leaderboard spots, while Adobe Firefly Video trained exclusively on licensed content and focused on commercial use.
To test which model generates the best professional-grade outputs, we recruited 12 professional video editors to compare 40 videos from the four models, also scoring them on prompt adherence, visual aesthetics, motion, and camera control. Our results showed that:
- Seedance 2.0 led almost every scalar dimension, but only narrowly. Grok Imagine and Veo 3.1 closely matched Seedance 2.0's performance, while Adobe Firefly Video trailed by a large margin.
- The top three models each performed best on different prompt types. Seedance 2.0 scored highest on dynamic cinematic motion, Grok Imagine on human subjects, and Veo 3.1 on low-motion product and mood shots.
- All models still showed physical glitches and realism issues. Even the strongest outputs exhibited various motion, continuity, and realism failures.
- Image references improved visual aesthetics, but reduced controllability. To prioritize the visual look and feel of a video, consider adding reference images to your prompts.
We evaluated four video models across a set of 10 prompts, designed to test a set of quality aspects for video generation, including prompt adherence, model quality, camera shot quality, and visual aesthetics. Evaluators judged each output across five dimensions: physical glitches and issues, prompt adherence, motion quality, camera shot quality, and visual aesthetics. Then, they compared the video outputs in a tournament-style, pairwise ranking.
Seedance 2.0 finished first overall, but the margin was narrow, and the best-performing model changed by prompt type. The expert reviews showed where current video generation models do well and where they still lag.
Seedance 2.0, Grok Imagine, and Veo 3.1 finished in a near-tie
Seedance 2.0 had the highest average score in the study, but it was not a dominant winner. Across 10 prompts, Seedance 2.0 won 4. Grok Imagine won 3. Veo 3.1 won 3. Adobe Firefly Video won 0.

That makes Seedance 2.0 the aggregate leader, but not a model that clearly separated from the rest of the field. Its advantage over Grok Imagine was only +0.07 on a 5-point scale. If you look into model performance on individual prompts, the results are also close: Seedance 2.0 won 4 out of 10 prompts, while Grok Imagine and Veo 3.1 both won 3 out of 10.

The average score can make the race look cleaner than it was. Seedance 2.0, Grok Imagine, and Veo 3.1 finished within a fraction of a point of each other, so the top of the leaderboard reads as a near-tie rather than a decisive win.
Adobe Firefly Video trailed the field across every dimension
The dimension-level scalar results were more favorable to Seedance 2.0. It led in almost all measured dimensions, with the only exception being a -0.02 point loss to Veo 3.1 in Motion Quality.

Meanwhile, Adobe Firefly Video trailed by a large margin in all categories. It did not win any of the 10 prompts and consistently lagged behind the other models. In this dataset, Adobe Firefly Video was not competitive with the top three frontier video models.

The ranking distribution tells a similar story. Seedance 2.0 and Grok Imagine reached nearly identical average win rates through opposite behaviors. Seedance 2.0 was the high-ceiling, more polarizing pick: it earned the most 1st-place finishes (4 of 10 prompts) and never finished last, but landed in 3rd place half the time, with little in between. Grok Imagine was the more consistent top-half performer, ranking 1st or 2nd in 80% of prompts, though it occasionally fell to 4th (2 of 10 prompts).
Image references improved visual aesthetics, but reduced controllability
Adding reference images to prompts improved visual aesthetics, but reduced controllability. Average visual aesthetics increased from 3.04 to 3.47, a gain of +0.43. But prompt adherence fell from 3.07 to 2.67, a drop of -0.40. The reference image helped the outputs look better, but it also made the models less responsive to the written prompt.
| Dimension | Text only | Text + image | Change |
|---|---|---|---|
| Visual Aesthetics | 3.04 | 3.47 | +0.43 |
| Prompt Adherence | 3.07 | 2.67 | -0.40 |
That tradeoff showed up at the model level too. Three of four models got worse. Seedance 2.0, the text-to-video leader, fell the most, dropping -0.72. The only model that improved was Adobe Firefly Video, which gained +0.57. The reference image appears to improve surface-level aesthetics while making the model harder to steer.

For Adobe Firefly Video, adding images to the prompt noticeably improved the Elo rating of the model's output. This suggests image references can raise output quality for a model that is further from the frontier.
All models still suffered from physical glitches and realism issues
Even when the outputs were strong, all models still showed physical glitches and realism issues. When asked to identify these issues in the videos, most participants were able to identify multiple issues in all of the generated videos.
Even the strongest outputs were not clean. The two clips below were among the best-rated in the study, yet evaluators still flagged multiple issues in each (representative, not comprehensive):
- Camera movement (0:03-0:08): 'slight upward camera movement, which would not happen in a news programme either.'
- Hands (0:04-0:06): 'The pen on his right hand glitches and distorts as he moves his hands.'
- Body and edges (0:03-0:05): 'Very slight edge issues with his shoulders, but nothing someone like me (a VFX sup) would notice.'
The second clip, despite topping its prompt, drew the same kind of physics complaints (representative examples, not comprehensive):
- Duplicate vehicle (0:01.967): 'On the left-hand side, a duplicate black car right behind the original one.'
- Traffic (full video): 'All cars are going at the same time, causing wrecks;' 'No visible signal or lights changing for traffic.'
- Vehicle glitch (0:01-0:07): 'A black car entering from the left side of the intersection appears unusually elongated.'
- Pedestrian glitch (0:06-0:08): 'several instances where pedestrians on the sidewalks merge into one another.'
This was the shared failure mode across the benchmark. Even though the models have improved on overall scene, mood, and visual style, they still break down in the details that make a video physically believable. Physical realism remains a major weakness even for the best models. If you are a creative professional generating videos with AI, make sure you plan for manual VFX cleanup on hands and props, or keep clips short to limit drift.
Without a clear winner, the models showed specialization
Because the top three models were close, the clearer pattern is specialization by prompt type. Seedance 2.0 performed best on dynamic, high-energy cinematic shots. Veo 3.1 and Grok Imagine performed well on static or slower movement shots. Grok Imagine specialized in human details, while Veo 3.1 worked great for product shots and object showcase.
Specifically, Seedance 2.0 was strongest when the prompt required dynamic action or cinematic motion. It performed best on the alley chase, night market, and wizard prompts, and was the least-bad option on traffic.
Grok Imagine stood out on human performance, performing best on the barista prompt. Its strength in this dataset was in scenes centered on people's facial expressions and body movement.
Veo 3.1 performed best on lower-motion, mood-driven, product-oriented prompts. It won the mug, sneaker, and candle prompts, where lighting, object presentation, and atmosphere mattered more than high-energy motion.
Noticeably, Veo 3.1 was the lone exception to Seedance 2.0's lead. It received slightly higher scalar rating than Seedance 2.0 on Motion Quality and tied it on Visual Aesthetics. That complicates the dimension-level story: the three frontier models performed closer than the aggregate leaderboard makes it look.
The main takeaway is that the best model depended on the type of video being generated. Seedance 2.0 had the strongest average, but the prompt-level winners showed that model choice should change by workflow pattern.
Conclusion
Seedance 2.0 was the strongest overall model in this video head-to-head study, but it was not a runaway winner.
It won the highest average score and led in almost all measured dimensions. But across 10 prompts, it won only 4. Grok Imagine and Veo 3.1 each won 3. Seedance 2.0's margin over Grok Imagine was only +0.07, and its head-to-head pattern against Veo 3.1 was effectively split.
The more useful conclusion is that the models specialize. Seedance 2.0 was strongest for high-energy cinematic motion. Grok Imagine was strongest for human performance and identity-heavy scenes. Veo 3.1 was strongest for low-motion product, mood, and lighting shots. Adobe Firefly Video trailed the field in this dataset.
The benchmark also shows that AI video is still constrained by physical glitches, realism issues, and controllability tradeoffs. Image references improved aesthetics, but reduced prompt adherence.
The takeaway: Seedance 2.0, Grok Imagine, and Veo 3.1 are effectively neck and neck, and the best model depends on the shot.
Methodology
Twelve professional video editors sourced from the Contra Labs network evaluated 40 AI-generated video outputs across 10 prompts. The prompt set was designed to cover a range of creative video use cases: dynamic cinematic motion, human performance, product and object shots, mood-driven scenes, stylized visual effects, and physically complex scenes where temporal consistency and realism are difficult to maintain.
The study compared four video generation models: Seedance 2.0, Grok Imagine, Veo 3.1, and Adobe Firefly Video. Each prompt was run through all four models, producing four outputs per prompt. Five prompts were text-to-video tasks, generated from a written prompt only. Five prompts were image-to-video tasks, generated from a reference image plus a written prompt. For the image-to-video condition, reference images were generated and iterated using ChatGPT Image until they met the quality bar needed for the evaluation. Outputs were presented blind as Output A-D, so evaluators judged the videos without seeing which model produced each clip.
The prompt set spanned several creative use cases. Three examples:
| Sample Prompts | What it tested |
|---|---|
| A white ceramic mug tips over on a polished marble kitchen counter and spills hot black coffee, the liquid spreading and dripping over the edge as steam rises. Soft morning light from a window on the left. Static medium shot at counter height, shallow depth of field, warm natural color grade. | Liquid physics, steam, object interaction, lighting, and static composition. |
| A slow crane shot rising from street level up over a crowded Asian night market at dusk, revealing rows of glowing red lanterns and food stalls with steam, people walking below. Begins as a low eye-level shot and ends on a high wide establishing shot. Saturated neon color palette, rich contrast. | Smooth crane movement, shot-scale transition, crowd consistency, lighting, and composition. |
| A robed wizard raises one hand and conjures a glowing orb of magical energy that swells and crackles with arcs of blue-and-violet light, streams of glowing particles and embers swirling outward and trailing through the air as wisps of smoke curl around the hand. A dark stone chamber lit only by the spell's shifting glow. Stylized, painterly fantasy look with dramatic rim lighting, rich saturated color, slow-motion energy. Static medium shot, eye level. | Particle effects, stylized lighting, color consistency, motion, and fantasy visual coherence. |
Evaluators completed three tasks for each prompt. First, they identified visible glitches and physical realism issues, describing what was wrong and what should change. These annotations captured artifacts such as warping, flicker, hallucinated details, object deformation, unnatural motion, camera problems, prompt mismatches, and failures of physical plausibility. Second, evaluators rated each output individually on four scalar dimensions using a shared 1-5 rubric: Prompt / Reference Adherence, Motion Quality, Camera & Shot Quality, and Visual Aesthetics. Third, evaluators completed a blinded pairwise tournament across the four outputs for the same prompt, choosing the stronger output in each head-to-head comparison.
We aggregated the results across prompts to compare average model performance, prompt-level winners, dimension-level scores, ranking distribution, and derived pairwise win rates. Qualitative comments were used to interpret why outputs succeeded or failed, especially around motion stability, physical realism, visual polish, prompt adherence, and production-readiness.
Limitations
This study is limited to 10 prompts, four models, and 12 professional video editors, so the results should be read as directional rather than definitive. Model rankings may change with different prompts, different model versions, different generation settings, longer or shorter clips, or a different evaluator panel.
The comparison reflects first-pass output quality under the study setup, not each model's ceiling under extensive prompt iteration, manual selection, editing, or post-production. The prompt set intentionally covered several creative use cases, but it does not represent the full range of video generation workflows, such as long-form narrative continuity, multi-shot editing, brand-safe production pipelines, audio-first generation, or complex client revision cycles.
The evaluation also emphasizes visible video quality rather than every production constraint. Editors judged prompt / reference adherence, motion, camera work, visual aesthetics, and physical glitches, but the study did not separately score factors such as audio quality, editability, latency, cost, licensing, safety controls, or downstream integration into professional tools.
Evaluator judgment is inherently subjective. The shared rubric helped anchor scoring, but professional taste, tolerance for artifacts, and expectations for production-ready video can vary. We therefore place more weight on patterns that repeated across prompts, models, and evaluator comments than on any single score or individual matchup.
How we ran this study → Methodology
