Benchmark·May 14, 2026·7 min read

ChatGPT Images 2.0 won every head-to-head. Here's where it still breaks.

41 sessions, 7 designers, 6 briefs. GPT Image 2 nails the concept, then breaks at production.

01GPT Image 2 swept our head-to-head against Seedream 5.0 Lite, Gemini 3.1 Flash, and Flux 2 Pro. Production readiness is where the gap shows.
02Typography climbs with editing (33% to 59% "no issues"). Prompt adherence drops the other way. Realism flatlines at 60-65%.
03Only 5 of 41 sessions shipped from GPT alone. 10 more were "client-presentable" but still moved to Photoshop, Illustrator, or Figma.

Contra Labs

Research

Can a designer actually ship with ChatGPT Images 2.0? OpenAI bills it as designed for production-quality visuals: greater control and better performance with small text, iconography, UI elements, dense compositions, and subtle stylistic constraints. It's also the model that swept our brand-campaign eval against Seedream 5.0 Lite, Gemini 3.1 Flash Image Preview, and Flux 2 Pro. So we pressure-tested the harder claim: production readiness.

GPT Image 2 generations from the production-readiness test.

To test this, we designed six fictional campaign types to stress the visual dimensions OpenAI highlighted in the GPT Image 2 launch: small text and iconography, dense compositions, UI elements, and subtle stylistic constraints, plus realism, texture, and depth of field.

Every session was recorded end-to-end with Rollout, Contra Labs' session-capture tool. Rollout captures the screen, video, audio, mouse trails, clicks, and keyboard input, including everything designers did after they left the tool being evaluated. Full methodology at the end of the article.

Rollout captures the full session: screen, audio, mouse, clicks, and every tool the designer touches after they leave GPT.

The production-readiness gap

Typography "no issues" share climbs from 33% at Edit 1 to 59% at Edit 3.

ChatGPT Images 2.0 doesn't fail evenly across the Rollout. Typography gets better with iteration. Prompt adherence gets worse. Designers can hear themselves doing both.

Typography improves with editing, and designers walk through the fix in real time:

Ridge brief across three edits: typography progressively sharpens.

Edit 1: "The small text in general feels not very sharp, but it is still... it's very readable."

Edit 2: "Okay, now going to prompt some changes I wanted to make. So change initial font to original font and background can be a bit better."

Edit 3: Typography, now I think we are good, but still maybe on the logo. Um, well, it's the same as before. I would say no issues.

Kinetix brief across three edits: prompt adherence drifts away from the brief.

By Edit 3, the same designer has gone from saying "not very sharp" to "no issues, that looks good." Typography is fixable on iteration.

Prompt adherence gets worse.

Edit 1: "I like that. And the focus on just the credit card, as mentioned in the prompt, is pretty spot on. She is, like, kind of posing. It's cool. That's a cool image. I like that."

Edit 2: "What could I edit? Make the background more of a soft gradient, kind of like colored foggy glass. I don't know why that's the vibe it gave me. So what will it give me?"

Edit 3: "I don't think this was, uh, no, really that good. I don't like the background. It looks low resolution, I did not ask for that. Her, half of her body went somewhere."

By Edit 3, GPT has drifted from "spot on the prompt" to inventing texture and losing anatomy. The longer designers iterate, the more the model lets go of the original brief.

Edit 3 is a regression. The longer designers iterate, the more likely GPT is to lose track of the original brief.

"No issues" share by quality dimension across Edit 1, 2, 3. Prompt adherence and color regress at Edit 3.

Prompt adherence isn't alone in regressing. Color and lighting follow the same arc. Realism and physics is among the worst-performing dimensions at every iteration, hovering around the 60-65% mark. Typography climbs with editing. Prompt adherence regresses. Realism just sits there. It's the dimension designers leave GPT for.

What designers were actually asking for

Change areas requested by edit phase across all 41 sessions.

An ASAGI brief across three phases. Edit 1 refines the composition. Edit 2 grounds the product in the scene. By Edit 3, the design is settled: small last-mile choices, not big redirects.

Two topics never go away:

Background is the #1 ask at every phase, 16 mentions total. The same dimension flagged as GPT's relative weak spot in the Rollout.
Typography stretches across all three edits, the only major topic designers keep returning to until it's right. That's why typography's "no issues" rate climbs from 33% to 59% with iteration. Designers force the fix themselves.

Where designers go when GPT runs out

Of the 41 sessions by 7 designers, 15 sessions ended with a generation the designer would present to a client. But "presentable to a client" didn't mean "ready to ship," because 10 of those 15 still moved to Photoshop, Illustrator, or Figma before the asset could go out. Only 5 of the sessions produced a shippable asset from GPT alone.

The remaining 26 sessions never reached client-ready at all, and 2 of those handed off to another tool to fix what GPT couldn't.

The pattern lines up with what designers asked for in Edit 3: logos, layout, composition, color. Last-mile polish, and precisely the work GPT couldn't deliver reliably.

Across all 41 sessions, transcripts surface the same five tools:

Photoshop (most-mentioned): alignment, selective remove, AI cleanup.
Adobe Illustrator: vector polish and structural fixes.
Figma: typography, brand elements, final asset assembly.
Pinterest and Google Images: visual references, run in parallel.

Activities outside ChatGPT, ranked by mentions across the Rollout.

A Studio Citrine brief made it concrete. After 8 minutes 43 seconds and three iterations inside ChatGPT, the designer landed on a generation they approved: "yes, present to client." But the asset wasn't shippable yet. They moved to Photoshop to remove the background, and finally to Figma for typography and brand assembly, all before the work could go out.

Session 445 timeline: GPT for concept, Photoshop for cleanup, Figma for the shippable asset.

The generative model handled the concept. The pro tools handled everything that needed precision the model couldn't deliver.

The verdict

GPT Image 2 delivers on the easy half of OpenAI's claim. It hits a strong concept fast, handles dense compositions and stylistic constraints, and lets designers iterate without leaving the chat window. Production readiness is where the Rollout exposes the gap. Prompt adherence regresses by Edit 3. Realism flatlines in the low 60s. Background and typography never stop demanding attention. Only 5 of 41 sessions shipped from GPT alone, while the other 10 client-presentable generations still moved through Photoshop, Illustrator, or Figma before they could go out.

The smartest creative workflow in 2026 knows when to hand off. Use ChatGPT Images 2.0 for concept, composition, and the first two rounds of editing, where iteration still pays. By Edit 3, when prompt adherence starts to drift and the asks turn into logo placement, color, and pixel-level fixes, you've left the model's strength zone. Take anatomy and background work to Photoshop. Take typography and brand assembly to Figma. The model gets you to a concept fast, and the pro stack gets you to a shippable asset.

Methodology

To test this, we selected six campaign types to stress the visual dimensions OpenAI highlighted in the GPT Image 2 launch: small text and iconography, dense compositions, UI elements, and subtle stylistic constraints. The six briefs: Studio Citrine for realism and texture, Sonic Collective for typography, iconography, and dense layouts, Ridge for dense layouts and small text, ASAGI for subtle stylistic constraints and brand mood, Kinetix for depth of field and fine grain control, and Alpine for realism and texture.

Seven designers were given all six fictional campaign briefs and asked to use GPT Image 2 as their primary tool. Each session followed the same pattern: an initial generation, followed by three rounds of edits (Edit 1, Edit 2, Edit 3). After the third edit, designers could either continue in ChatGPT or move to a tool of their choice.

Sessions were recorded end-to-end using Rollout, Contra Labs' session-capture tool. Rollout records the screen, video, audio, mouse trails, clicks, and keyboard input, including everything designers did after they left ChatGPT.

Each generation was flagged for errors by the evaluators on five dimensions:

Typography: incorrect or incoherent copy, broken text formatting, illegible font, lacking contrast, not readable, or no issues.
Prompt Adherence: hallucinations, omissions, inaccuracies, or no issues.
Realism and Physics: unnatural textures, broken object or body physics, impossible light or shadow directions, or no issues.
Modernity and Trend: dated font, dated color scheme, dated imagery, dated graphic asset styles, or no issues.
Color and Lighting: inconsistent palette, incorrect shades, clashing colors, or no issues.

During analysis, each prompt was tagged using 16 categories (background, typography, pose, texture, etc.) to surface what designers were trying to change at each phase.

Continue reading3 studies

All research

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

Creative Human Data

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Human Creativity Benchmark

Creative Arena

Jobs

The world's leading independent

human data & creative evaluation lab

Get In Touch

Creative Human Data

Creative Arena

Jobs