Benchmark·June 3, 2026·6 min read

Ideogram v4 won 47.9% of typography matchups.

10 designers, 4 models, 240 images. Spelling is solved. Typographic craft and client-readiness are where Ideogram v4 pulls away.

01Ideogram v4 won 47.9% of typography matchups, far ahead of Gemini (30.0%), FLUX.2 (15.5%), and Grok (15.0%).
02Text accuracy is near-tied across the top three (4.54–4.59 of 5). Typographic craft is where Ideogram pulls away (3.92 vs 3.34–3.50).
03Ideogram is the only model whose typography lands in client-ready territory (3.55 of 5). The rest sit in "needs rework."
04Gemini led detail accuracy on busy scenes (4.00), but reviewers were slightly more willing to ship Ideogram's detailed work (3.42).
05Gemini topped stylized work (3.81). Best-scoring prompts across all categories were lean; worst were over-specified with stacked directives.

Contra Labs

Research

Spelling letters correctly is no longer the bottleneck for AI image models. The top three (Ideogram v4, Gemini 3.1 Flash, Grok Imagine 1.0) sit within hundredths of a point on text accuracy. The bottleneck is typographic craft: the font choice, kerning, spacing, and integration that decide whether a designer would actually ship the image. On that dimension, only one of the four lands in client-ready territory.

Ideogram v4 outputs from the typography category, drawn from the 240-image blind evaluation.

Contra Labs ran a blind, head-to-head evaluation of four current AI image models to answer a question every designer is quietly asking: which one can you actually put in front of a client? The four contenders were Ideogram v4, Gemini 3.1 Flash Image Preview (Nano Banana 2), Grok Imagine 1.0, and FLUX.2 [max].

10 creative professionals with expertise in content, branding, and image editing lent their judgment to evaluate 240 images on category-specific technical criteria. This write-up covers typography, detailed scenes, and stylized prompt results.

Methodology

Ten working designers sourced from Contra's top-earning talent evaluated three prompt categories (Typography, Highly Detailed Scenes, Dreamy/Stylized) with 20 prompts each. We did an initial round of 10 prompts per category, then sent out an expansion task with another 10. The prompting methodology stayed the same; the same rater group participated in both rounds.

Every prompt ran through all four models. The four outputs were shown to reviewers blind: model identity hidden, order randomized, so brand reputation and screen position couldn't tilt the results.

Reviewers did three things per task. First, pairwise preference: every model went up against every other in a round-robin (six matchups per prompt), which forces a clean ranking of all four. Second, Likert ratings (1–5) on category-specific scales. For typography, that's Text Accuracy and Typographic Craft, plus one universal question that applies to every image: would you use this in real client work? Third, open-ended notes on strengths and weaknesses. With 20 prompts per category and 10 reviewers each, every prompt set yields roughly 200 observations.

Typography: Ideogram out front

In head-to-head preference, Ideogram v4 won first place 47.9% of the time across both rounds, far ahead of Gemini (30.0%), FLUX.2 (15.5%), and Grok (15.0%). Ideogram was the only model to clear the 50% mark in a single round (51.1% in round two).

Typography head-to-head: share of 1st-place finishes by model across both rounds.

The rating scales tell a more nuanced story. On raw Text Accuracy (does the text say exactly what the prompt asked, spelled correctly and legibly) the top three are nearly tied: Ideogram (4.59), Gemini (4.55), and Grok (4.54) are separated by hundredths of a point, with FLUX.2 trailing at 3.30. Getting the letters right is close to a solved problem for the leading models.

Where Ideogram pulls away is Typographic Craft. This dimension referred to font choice, kerning, spacing, and how the type integrates with the image. Ideogram scored 3.92, a clear gap over Gemini (3.50), Grok (3.36), and FLUX.2 (3.34). Most models can spell. Ideogram makes the type look designed.

Typographic Craft averages, 1–5 Likert · Ideogram v4 leads the field by ~0.4 points.

That edge carries into the headline question. On "Would you use this in real client work?" for typography, Ideogram led at 3.55, with Gemini at 2.84, Grok at 2.61, and FLUX.2 at 2.49. For type-driven work, Ideogram is the only model whose average lands in "likely usable" territory.

Client-work gate for typography prompts · Ideogram is the only model whose average crosses into client-ready territory.

Two prompts from the typography set illustrate what separates a client-ready image from one that needs rework: the same brief, the same blind conditions, but a visible gap in how each model handles font choice, spacing, and layout.

Book back cover prompt · Ideogram v4 (4.0 / 5) and Grok Imagine 1.0 (3.1 / 5), with reviewer notes.

Café menu board prompt · Ideogram v4 (4.4 / 5) and FLUX.2 [max] (2.4 / 5), with reviewer notes.

Detailed scenes: a two-horse race at the top

The Highly Detailed Scenes category covers busy prompts with specific counts, attributes, and spatial relationships. Here Gemini sets the pace on the technical scales. On Detail Accuracy (how faithfully each model renders the specific counts, attributes, and positioning a prompt asks for), Gemini led at 4.00, ahead of Grok (3.77), Ideogram (3.71), and FLUX.2 (3.66). On Render Coherence (clean, plausible rendering, free of artifacts and warping), Gemini and Ideogram are effectively neck and neck at 4.18 and 4.15, with FLUX.2 (3.86) and Grok (3.81) a step behind. All four clear a 3.6 average on both scales, so complex scenes are broadly within reach for every model.

Detail Accuracy and Render Coherence on highly detailed scenes · all four models clear 3.6 on both scales.

But the client-work gate reorders things. For detailed scenes, Ideogram edged ahead at 3.42, with Gemini close behind at 3.37, then FLUX.2 (3.01) and Grok (2.82). Gemini is marginally more accurate and coherent, yet reviewers were slightly more willing to ship Ideogram's detailed scenes. "Client-ready" is a judgment about the whole image. The top two are essentially tied and both sit in "usable with minor edits" territory; the bottom two hover around "maybe."

Client-work gate for detailed scenes · Ideogram edges Gemini on usability despite Gemini's accuracy lead.

Stylized work: a different leaderboard

On Style Execution for dreamy/stylized prompts, Gemini 3.1 Flash led at 3.81 overall, followed by Grok (3.62) and FLUX.2 (3.60), with Ideogram last at 3.36. A similar pattern holds for Mood and Evocativeness: Gemini at 3.77, Grok at 3.61, FLUX.2 at 3.45, Ideogram at 3.29.

Ideogram's strength is precision and type. When the brief is mood and aesthetic, the other models close the gap.

Style Execution averages on dreamy/stylized prompts · Gemini takes the lead Ideogram held in typography.

Stylized work is the most competitive category, and it plays to different strengths than Ideogram's. Where the brief is pure mood and atmosphere, Gemini edges ahead on the rating scales, but the gap is narrow, and Ideogram's images still drew praise for their aesthetics and emotional pull. Reviewers' main note was that Ideogram occasionally added its own creative flourishes, interpreting a prompt rather than executing it literally. For designers who want a model that nails type and precision, that's a reasonable trade. For open-ended mood-boarding, the field is wide open.

Mood and Evocativeness averages on dreamy/stylized prompts · Gemini leads, Ideogram trails.

Even where Ideogram trailed on the scales, its outputs still landed with reviewers. Two prompts from the stylized set show why: the images read as polished and evocative, even when the model did not fully adhere to the prompt.

Whimsical greenhouse prompt · Ideogram v4 (3.5 / 5), with reviewer note.

Translucent perfume bottle prompt · Ideogram v4 (3.8 / 5), with reviewer notes.

We coded all 400 open-ended comments on stylized prompts into recurring themes. The most common praise was for style execution and aesthetic fit (20% of comments), followed by composition and evocative mood. The most common critique was prompt mismatch and missing details (12.5%), with under-executed mood and composition issues close behind.

Top themes from 400 open-ended reviewer comments on stylized prompts, coded by strength and weakness.

What separated the best prompts from the worst

That tension between literal instruction and interpretation showed up in the prompts themselves. When we looked back at which prompts produced the best and worst scoring images per category for Ideogram, one pattern cut across all three categories: the best-rated images consistently came from leaner prompts, while the worst-rated outputs tended to come from longer, more directive prompts. The "worst" profiles (red) spike on word count, specificity, and stacked directives like style, camera/lighting, and technical terms. The "best" profiles (green) sit tighter to the center.

What the best prompts emphasized instead varied by category. For typography, top prompts leaned on what actually matters for type: quoted phrases and brand or named copy that spell out exactly what the words should say, plus reference examples. Stylistic directives stayed out of the way. For dreamy/stylized work, the best prompts were shorter and left more room (higher ambiguity per 100 words, anchored by examples), while the worst were over-specified with detail the model then had to reconcile.

One caveat: this is a correlation. Don't read it as a recipe. More demanding briefs are both longer and harder to satisfy, so prompt length and low scores may share a common cause. State the words you need precisely. Resist loading every stylistic directive into one prompt.

Linguistic profile of Ideogram's highest- and lowest-scoring prompts across the three categories. Red (worst) spikes on word count, specificity, and stacked directives; green (best) stays leaner.

The bottom line so far

Across the three categories, a clear division of labor emerged. Ideogram v4 is the model to reach for when text is part of the design: posters, packaging, ad creative, anything typographic. Most leading models spell correctly now. Ideogram creates type designers would actually ship, leading both typographic craft and the client-work gate by a wide margin. It won nearly half of all head-to-head typography matchups (47.9%, well ahead of every rival) and was the only model whose type-driven work landed in genuinely client-ready territory rather than needing rework. Its strength reaches past text, too: Ideogram was the only model to top the client-work gate in two of the three categories, it held its own against Gemini on the toughest, most detailed scenes, and even in stylized work, where it trailed on the scales, its images still drew praise for their aesthetics and emotional pull. Gemini 3.1 Flash is the steadiest all-rounder, the most accurate on busy, literal scenes and the top pick for pure mood and style work. Grok Imagine and FLUX.2 trail the leaders across the board, landing in "usable with edits" territory.

The "would you use this in real client work?" verdict: Ideogram leads typography (3.55) and edges ahead on detailed scenes (3.42). Gemini takes stylized (3.21). No single model wins everything. The right choice depends on whether the brief is built on words, accuracy, or atmosphere. And whatever you reach for, the prompts that won were the lean, precise ones: say exactly what you need and resist burying the model in directives.

Limitations

20 prompts × 10 reviewers is ~200 observations per category, and should be considered explicitly directional, not powered for significance.

How we ran this study → Methodology

Continue reading3 studies

All research

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Benchmark

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Benchmark

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Research

Datasets

Jobs