We ran four frontier image generation models through a blind creative tournament, scored by a small bespoke panel of vetted working creatives. The setup is focused by design: depth of read across six brand campaigns, the same shape of eval we can scale up against any specific brief. The task: produce brand-campaign visuals across six fictional brands spanning luxury skincare, digital banking, product photography, alpine footwear, music event posters, and fashion lookbooks.

Evaluators ranked every output from 1st to 4th without knowing which model made what.
The models: OpenAI's GPT Image 2, Google DeepMind's Gemini 3.1 Flash Image Preview, Black Forest Labs' FLUX.2 [pro], BytePlus's Seedream 5.0 Lite. Identical prompts, same criteria.
The aggregate leaderboard
A note on scale before the numbers: this is a focused expert eval, which is why we lead with Elo and pairwise rates. Treat the percentages as directional signal.

GPT Image 2 wins the aggregate by every measure: 61.1% against Seedream, 64.8% against Gemini, 66.7% against FLUX. Every opponent, every cut.
The 79-point Elo gap between 1st and 2nd is significant. An 80-point Elo gap in chess corresponds to a 60/40 win expectation, which is almost exactly what we observed. The other three models sit within 85 Elo points of each other, a far tighter race for the rest of the field.
The ranking distribution tells the story more vividly

Across the panel's rankings, GPT finished first 40.7% of the time and second 27.8%, with the rest split between 3rd and 4th. FLUX is the most polarizing model in the field: high first-place rate (22.2%), but also the highest 4th-place rate. FLUX swings for the fences and often misses.
Gemini clusters safely in the middle. Rarely first (14.8%), but also rarely catastrophic. Seedream spreads the most evenly, the most predictable model in the eval.
Failure mode #1: brand fidelity by brief
Here's the first measurable failure mode for any brand-bounded creative workflow. The leaderboard flattens, and occasionally inverts, when you break results down by campaign type. No single model holds the bar across every brief. The category-level data reveals genuine specialization and a real gap for any workflow betting on one model.

- ASAGI (luxury skincare): GPT Image 2 (1.44). Polish, color harmony, elevated aesthetic.
- Studio Citrine (fashion lookbook): GPT Image 2 (1.67). Same playbook.
- Plunge Artesian Pops (product photo): Gemini 3.1 Flash (1.78). Realistic lighting, object clarity.
- Kinetix (digital banking): FLUX.2 [pro] (1.67). Clean UI compositions, corporate visual language.
- Sonic Collective (music poster): Seedream 5.0 Lite (2.0). Bold graphic energy.
- Ridge (alpine footwear editorial): Gemini and Seedream tie (2.22). Tightest race in the eval.

GPT's 77.8% win rate on luxury skincare is the single most dominant category-model combination in the dataset. In a sample this size, that's a strong signal even if the exact percentage will move with more evaluators. But FLUX hits 55.6% on digital banking and Gemini hits 55.6% on product photography. The other models genuinely own their territory.
If you're choosing a model for a specific campaign type, the overall leaderboard can mislead you. A category-matched model can outperform the overall leader by a meaningful margin.
The theme profiles explain why the niches exist
Evaluators tagged strengths and weaknesses across eight visual dimensions: detail, background, product placement, lighting, brand alignment, texture, color palette, typography, realism, composition. The per-model profiles are distinct.

Seedream 5.0 Lite: the most balanced profile. No dramatic peaks or valleys. Maps directly to its even ranking distribution. Seedream is the model you pick when you need consistency and can't afford a disaster. It rarely delivers the "wow" moment that GPT or FLUX produces on a good day.

Gemini 3.1 Flash: clusters toward the middle on most dimensions, which explains its "safe but unspectacular" ranking distribution. It doesn't catastrophically fail on any axis. It also rarely produces the standout strength that pushes an image to 1st place. Product placement is its quiet advantage: the reason it wins product photography.

FLUX.2 [pro]: almost the inverse. Background integration is its biggest weakness (8 weaknesses). It posts respectable numbers on lighting (6 strengths) and detail (4). The polarization in its rankings makes sense through this lens. When FLUX nails the composition, the detail work is striking. When the background falls apart, the whole image collapses.

GPT Image 2: the realism and composition machine. 14 strength mentions for realism against 6 weaknesses. 15 for composition against 5. Color palette (12 strengths) and typography (10) round out a profile built for polished, premium-feeling work. Relative weak spot: backgrounds (7 weaknesses vs 6 strengths). Environments occasionally fall flat against the subject.
Failure mode #2: typography across the board
Typography. Every model nets negative in the cross-model analysis: GPT -6, Gemini -5, Seedream -7, FLUX -9. GPT is the only model with net-positive typography sentiment when measured against itself. The rest struggle.

Across all four, headlines come in too thin, body copy gets crammed, font choices read as "default", and lockups invent text the brief never asked for.
The failure modes diverge
About 20% of weakness entries carried a structured prompt-adherence tag. The patterns split clean by model:
- FLUX.2 [pro] and Seedream 5.0 Lite skew toward omissions. Requested elements missing.
- Gemini 3.1 Flash skews toward hallucinations. Extra elements added that were never requested.
- GPT Image 2 skews toward inaccuracies. Right element, wrong rendering.

Model choice is also a QA choice. If your review process catches missing assets, run FLUX or Seedream. If it catches fabrications, Gemini works. If it catches distorted execution, GPT works.
The verdict
Four models, six brand campaigns, every output ranked blind. The verdict is clear at the top: GPT Image 2 is the best general-purpose image generation model for brand-campaign work right now. It wins the aggregate stats, the Elo, and every head-to-head against every competitor.
Best overall stops short of best everywhere. The category data shows model selection should be brief-dependent:
- Luxury, mood, aesthetic-led brand work: GPT Image 2.
- Clean product photography: Gemini 3.1 Flash.
- Banking, financial systems, structured campaigns: FLUX.2 [pro].
- Music posters and graphic-heavy work: Seedream 5.0 Lite.
- Editorial fashion or footwear: Gemini or Seedream, with GPT close.
The smartest creative workflow in 2026 picks the right model for the right job.
Typography is on you regardless.
Methodology
The evaluation ran in two phases. In the first, evaluators worked through pairwise comparisons. For each prompt, every model was matched against every other model for six pairs per prompt. Evaluators saw two anonymized outputs side by side, picked a winner, rated their confidence, and wrote out individual strengths and weaknesses for each output in free-text.
In the second phase, evaluators took the same prompts and used GPT Image 2's conversational editing to try to make outputs production-ready. They were given three edit rounds, then offered a choice: keep refining in GPT or switch to any tool they preferred.
All models were accessed via their respective APIs, and given the same prompt, single generation per prompt, no cherry-picking, no iterative refinement in the ranking phase. Aggregate metrics (average rank, pairwise win rate, and Bradley-Terry Elo) were all computed from the full pairwise matrix. The visual dimension profiles and prompt-adherence classifications were extracted from the free-text responses during analysis. Category-level breakdowns use the same metrics sliced by brand campaign.
The panel is small by design: vetted working creatives with production experience, not crowd-sourced raters. Treat exact percentages as directional signal.
Context
Per OpenAI, GPT Image 2.0 brings greater control and performs better with small text, iconography, UI elements, dense compositions, and subtle stylistic constraints. This served as a cornerstone to selecting the categories and the crafting of the individual prompts.
We selected six campaign types that stress different ends of the visual spectrum. Product photography for object precision, music posters for typographic density, digital banking for clean UI compositions, luxury skincare for subtle control, fashion lookbooks for editorial composition, and alpine footwear for object physics and realism. The prompts were crafted to hit the dimensions OpenAI highlighted, then applied the same to all four models. If the claims hold, GPT should separate on exactly these tasks. If the other models have edges of their own, the category breakdown should find them.
How we ran this study → Methodology
