Benchmark·May 13, 2026·7 min read

The image-model leaderboard flips by brief.

Four frontier image models, six brand campaigns, ranked blind by working creatives. GPT Image 2 wins the aggregate. Every other model owns a category.

01GPT Image 2 wins the aggregate by every measure: 61.1% vs Seedream, 64.8% vs Gemini, 66.7% vs FLUX.
02The leaderboard flips by brief. Gemini wins product photography, FLUX wins digital banking, Seedream wins music posters.
03GPT's 77.8% win rate on luxury skincare is the single most dominant category-model combination in the dataset.
04Every model nets negative on typography sentiment: GPT -6, Gemini -5, Seedream -7, FLUX -9.
05Failure modes diverge: FLUX and Seedream omit, Gemini hallucinates, GPT distorts. Model choice is also a QA choice.

Contra Labs

Research

We ran four frontier image generation models through a blind creative tournament, scored by a small bespoke panel of vetted working creatives. The setup is focused by design: depth of read across six brand campaigns, the same shape of eval we can scale up against any specific brief. The task: produce brand-campaign visuals across six fictional brands spanning luxury skincare, digital banking, product photography, alpine footwear, music event posters, and fashion lookbooks.

Four models, six brand campaigns. Outputs as evaluators saw them, anonymized side-by-side.

Evaluators ranked every output from 1st to 4th without knowing which model made what.

The models: OpenAI's GPT Image 2, Google DeepMind's Gemini 3.1 Flash Image Preview, Black Forest Labs' FLUX.2 [pro], BytePlus's Seedream 5.0 Lite. Identical prompts, same criteria.

The aggregate leaderboard

A note on scale before the numbers: this is a focused expert eval, which is why we lead with Elo and pairwise rates. Treat the percentages as directional signal.

Bradley-Terry Elo across the four models. GPT Image 2 leads by 79 points; the other three sit within 85 points of each other.

GPT Image 2 wins the aggregate by every measure: 61.1% against Seedream, 64.8% against Gemini, 66.7% against FLUX. Every opponent, every cut.

The 79-point Elo gap between 1st and 2nd is significant. An 80-point Elo gap in chess corresponds to a 60/40 win expectation, which is almost exactly what we observed. The other three models sit within 85 Elo points of each other, a far tighter race for the rest of the field.

The ranking distribution tells the story more vividly

Ranking distribution by model. How often each model finished 1st, 2nd, 3rd, or 4th across the panel.

Across the panel's rankings, GPT finished first 40.7% of the time and second 27.8%, with the rest split between 3rd and 4th. FLUX is the most polarizing model in the field: high first-place rate (22.2%), but also the highest 4th-place rate. FLUX swings for the fences and often misses.

Gemini clusters safely in the middle. Rarely first (14.8%), but also rarely catastrophic. Seedream spreads the most evenly, the most predictable model in the eval.

Failure mode #1: brand fidelity by brief

Here's the first measurable failure mode for any brand-bounded creative workflow. The leaderboard flattens, and occasionally inverts, when you break results down by campaign type. No single model holds the bar across every brief. The category-level data reveals genuine specialization and a real gap for any workflow betting on one model.

Winning model by brief. Average rank (lower is better) across six brand campaigns. The leader flips four ways.

ASAGI (luxury skincare): GPT Image 2 (1.44). Polish, color harmony, elevated aesthetic.
Studio Citrine (fashion lookbook): GPT Image 2 (1.67). Same playbook.
Plunge Artesian Pops (product photo): Gemini 3.1 Flash (1.78). Realistic lighting, object clarity.
Kinetix (digital banking): FLUX.2 [pro] (1.67). Clean UI compositions, corporate visual language.
Sonic Collective (music poster): Seedream 5.0 Lite (2.0). Bold graphic energy.
Ridge (alpine footwear editorial): Gemini and Seedream tie (2.22). Tightest race in the eval.

Winning output from each brand brief. Four different models, six different jobs.

GPT's 77.8% win rate on luxury skincare is the single most dominant category-model combination in the dataset. In a sample this size, that's a strong signal even if the exact percentage will move with more evaluators. But FLUX hits 55.6% on digital banking and Gemini hits 55.6% on product photography. The other models genuinely own their territory.

If you're choosing a model for a specific campaign type, the overall leaderboard can mislead you. A category-matched model can outperform the overall leader by a meaningful margin.

The theme profiles explain why the niches exist

Evaluators tagged strengths and weaknesses across eight visual dimensions: detail, background, product placement, lighting, brand alignment, texture, color palette, typography, realism, composition. The per-model profiles are distinct.

Seedream 5.0 Lite strengths and weaknesses by visual dimension.

Seedream 5.0 Lite: the most balanced profile. No dramatic peaks or valleys. Maps directly to its even ranking distribution. Seedream is the model you pick when you need consistency and can't afford a disaster. It rarely delivers the "wow" moment that GPT or FLUX produces on a good day.

Gemini 3.1 Flash strengths and weaknesses by visual dimension.

Gemini 3.1 Flash: clusters toward the middle on most dimensions, which explains its "safe but unspectacular" ranking distribution. It doesn't catastrophically fail on any axis. It also rarely produces the standout strength that pushes an image to 1st place. Product placement is its quiet advantage: the reason it wins product photography.

FLUX.2 [pro] strengths and weaknesses by visual dimension.

FLUX.2 [pro]: almost the inverse. Background integration is its biggest weakness (8 weaknesses). It posts respectable numbers on lighting (6 strengths) and detail (4). The polarization in its rankings makes sense through this lens. When FLUX nails the composition, the detail work is striking. When the background falls apart, the whole image collapses.

GPT Image 2 strengths and weaknesses by visual dimension.

GPT Image 2: the realism and composition machine. 14 strength mentions for realism against 6 weaknesses. 15 for composition against 5. Color palette (12 strengths) and typography (10) round out a profile built for polished, premium-feeling work. Relative weak spot: backgrounds (7 weaknesses vs 6 strengths). Environments occasionally fall flat against the subject.

Failure mode #2: typography across the board

Typography. Every model nets negative in the cross-model analysis: GPT -6, Gemini -5, Seedream -7, FLUX -9. GPT is the only model with net-positive typography sentiment when measured against itself. The rest struggle.

Typography mentions by model. Strengths in green, weaknesses in red. Every model nets negative.

Across all four, headlines come in too thin, body copy gets crammed, font choices read as "default", and lockups invent text the brief never asked for.

The failure modes diverge

About 20% of weakness entries carried a structured prompt-adherence tag. The patterns split clean by model:

FLUX.2 [pro] and Seedream 5.0 Lite skew toward omissions. Requested elements missing.
Gemini 3.1 Flash skews toward hallucinations. Extra elements added that were never requested.
GPT Image 2 skews toward inaccuracies. Right element, wrong rendering.

Prompt-adherence issue types by model. Omissions, hallucinations, and inaccuracies split cleanly across the field.

Model choice is also a QA choice. If your review process catches missing assets, run FLUX or Seedream. If it catches fabrications, Gemini works. If it catches distorted execution, GPT works.

The verdict

Four models, six brand campaigns, every output ranked blind. The verdict is clear at the top: GPT Image 2 is the best general-purpose image generation model for brand-campaign work right now. It wins the aggregate stats, the Elo, and every head-to-head against every competitor.

Best overall stops short of best everywhere. The category data shows model selection should be brief-dependent:

Luxury, mood, aesthetic-led brand work: GPT Image 2.
Clean product photography: Gemini 3.1 Flash.
Banking, financial systems, structured campaigns: FLUX.2 [pro].
Music posters and graphic-heavy work: Seedream 5.0 Lite.
Editorial fashion or footwear: Gemini or Seedream, with GPT close.

The smartest creative workflow in 2026 picks the right model for the right job.

Typography is on you regardless.

Methodology

The evaluation ran in two phases. In the first, evaluators worked through pairwise comparisons. For each prompt, every model was matched against every other model for six pairs per prompt. Evaluators saw two anonymized outputs side by side, picked a winner, rated their confidence, and wrote out individual strengths and weaknesses for each output in free-text.

In the second phase, evaluators took the same prompts and used GPT Image 2's conversational editing to try to make outputs production-ready. They were given three edit rounds, then offered a choice: keep refining in GPT or switch to any tool they preferred.

All models were accessed via their respective APIs, and given the same prompt, single generation per prompt, no cherry-picking, no iterative refinement in the ranking phase. Aggregate metrics (average rank, pairwise win rate, and Bradley-Terry Elo) were all computed from the full pairwise matrix. The visual dimension profiles and prompt-adherence classifications were extracted from the free-text responses during analysis. Category-level breakdowns use the same metrics sliced by brand campaign.

The panel is small by design: vetted working creatives with production experience, not crowd-sourced raters. Treat exact percentages as directional signal.

Context

Per OpenAI, GPT Image 2.0 brings greater control and performs better with small text, iconography, UI elements, dense compositions, and subtle stylistic constraints. This served as a cornerstone to selecting the categories and the crafting of the individual prompts.

We selected six campaign types that stress different ends of the visual spectrum. Product photography for object precision, music posters for typographic density, digital banking for clean UI compositions, luxury skincare for subtle control, fashion lookbooks for editorial composition, and alpine footwear for object physics and realism. The prompts were crafted to hit the dimensions OpenAI highlighted, then applied the same to all four models. If the claims hold, GPT should separate on exactly these tasks. If the other models have edges of their own, the category breakdown should find them.

How we ran this study → Methodology

Continue reading3 studies

All research

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

Creative Human Data

Creative Arena

Jobs

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Creative Arena

Jobs