Benchmark·May 6, 2026·4 min read

Seedream 5.0 Lite swept the field on product detail shots.

A blind head-to-head against the leading image models from Google, OpenAI, and Black Forest Labs, evaluated by professional creatives.

  1. 01Seedream 5.0 Lite won 63.9 percent of all blind head-to-head matchups. Next-best model trailed at 52.8 percent.
  2. 02Led every scalar category measured: lighting, color, product detail, prompt adherence, production-readiness.
  3. 03Evaluators picked it for one specific quality: fidelity to the reference, not visual flourish.
  4. 04Lost on one prompt (MacBook), where evaluators cited weaker compositional focus.
Contra Labs
Contra Labs
Research

In March 2026, we ran a blind pairwise evaluation of the four leading image models on a single job: product detail shots from a hero reference. The field: Seedream 5.0 Lite (ByteDance), Gemini 3 Pro Image Preview (Google), GPT Image 1.5 (OpenAI), and FLUX.2 [max] (Black Forest Labs).

The result

Seedream 5.0 Lite won 63.9% of all head-to-head matchups. Gemini 3 Pro followed at 52.8%. GPT Image 1.5 (44.4%) and FLUX.2 [max] (38.9%) both finished below the 50% line, losing more matchups than they won.

Pairwise win rates across all blind comparisons (March 2026). Seedream 5.0 Lite won the majority of head-to-head matchups against Gemini 3 Pro, GPT Image 1.5, and FLUX.2 [max].
Share

In Bradley-Terry Elo, Seedream landed at 1567 and Gemini 3 Pro followed at 1545. GPT Image 1.5 (1446) and FLUX.2 [max] (1442) were virtually tied at the bottom. A clean top and bottom tier with a ~120-point gap between them.

Bradley-Terry Elo derived from pairwise rankings (March 2026). Seedream (1567) and Gemini 3 Pro (1545) at the top; GPT Image 1.5 (1446) and FLUX.2 [max] (1442) virtually tied at the bottom.
Share

How we ran it

Professional creatives evaluated outputs from all four models, blind, against a real product reference (jewelry, leather goods, electronics, audio). Each evaluator ranked the outputs head-to-head and rated them across five scalar categories: lighting and shadows, color consistency, product detail handling, prompt adherence, and production-readiness. Models were the latest publicly available versions in March 2026.

No model names or watermarks. Bias removed at the evaluator's eye.

Seedream led every category we measured

Professional creatives rated each output across all five scalar categories. Seedream led all five.

Its top marks were in lighting and shadows (4.25 / 5) and color consistency (4.17 / 5). The widest gap between Seedream and the lowest-scoring model was in color consistency (1.09 points). The narrowest was in lighting and shadows (0.58). Even where the field tightened, Seedream stayed on top.

Why evaluators picked it: fidelity to the reference

The pattern in the rationales was consistent. Evaluators didn't pick Seedream because it was the prettiest. They picked it because it preserved the source.

Creative director Joao Paulo Bastos, on a handbag prompt:

Model C keeps the neutral background closer to the original, preserves the leather texture more accurately, and the clasp hardware reads truer to the reference image. That consistency with the input is what pushed it to first place for me, since the whole point of this task is evaluating how well the models maintain details from the source.Joao Paulo Bastos, creative director
Seedream 5.0 Lite output, handbag prompt. Selected for fidelity to the reference over a more visually striking alternative.
Share

Brand director Anna Gudvin, on a jewelry prompt where Seedream avoided embellishing details that weren't in the source:

The selected image maintained the original design without introducing additional milgrain detailing that was not present in the reference, preserving product integrity. It also replicated the lighting most accurately, with soft, diffused highlights and coherent reflections consistent with the hero image.Anna Gudvin, brand director
Seedream 5.0 Lite output, jewelry prompt. Preserved the original design without adding milgrain detailing not present in the reference.
Share
Themes evaluators cited most often as reasons for picking Seedream over Gemini 3 Pro, GPT Image 1.5, and FLUX.2 [max] (March 2026). Fidelity to the reference dominated the rationales, with production-readiness and lighting accuracy close behind.
Share

For product detail shots specifically, this is the right thing to optimize for. The output isn't a new image. It's a faithful zoom on an existing one. Inventing details, however tasteful, is a hard fail.

Where it lost

Seedream wasn't perfect. On a MacBook prompt, it ranked last. The issue was compositional focus.

Model C felt the least focused overall, with less control over where the viewer's attention goes.Joao Paulo Bastos, creative director
Seedream 5.0 Lite output, MacBook prompt. The only prompt where Seedream ranked last; evaluators cited weaker compositional focus.
Share

Worth flagging because the same evaluator picked Seedream first on the handbag prompt. The model is strong, but compositional control on tech products with hard edges and screens is an open weakness.

What this means

For catalog work, PDP imagery, and any use case where the output has to read as a continuation of the hero shot, Seedream 5.0 Lite is the strongest model we've tested. Its lead comes from a specific quality (fidelity to the reference) that matters more for product shots than aesthetic flourish does.

How we ran this study → Methodology
Continue reading3 studies
All research
  1. May 13, 2026Methodology
    Methodology: human-panel evaluation of generative models at Contra Labs.The standard playbook behind every Contra Labs battle, profile, and field note: blinded panels of practicing creatives, forced-choice rankings paired with scalar ratings and rationale, and a reliability battery that travels with every number we publish.Read
  2. June 17, 2026Research
    Introducing Design Crit: we taught AI to judge design like a designer.Ten professional designers ranked four frontier image models across nine dimensions of real design work. The models can make the work. Nothing on the market could reliably judge it, until we trained on the right data.Read
  3. May 29, 2026Benchmark
    Cursor took 60% of head-to-heads. Claude Code took 63% of client meetings.Four coding tools, 24 outputs, five working designers. The tool designers preferred to look at and the tool they'd put their name on turned out to be different.Read

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings