Benchmark·June 2, 2026·6 min read

Ideogram v4 won 47.9% of typography matchups.

10 designers, 4 models, 240 images. Spelling is solved. Typographic craft and client-readiness are where Ideogram v4 pulls away.

  1. 01Ideogram v4 won 47.9% of typography matchups, far ahead of Gemini (30.0%), FLUX.2 (15.5%), and Grok (15.0%).
  2. 02Text accuracy is near-tied across the top three (4.54–4.59 of 5). Typographic craft is where Ideogram pulls away (3.92 vs 3.34–3.50).
  3. 03Ideogram is the only model whose typography lands in client-ready territory (3.55 of 5). The rest sit in "needs rework."
  4. 04Gemini led detail accuracy on busy scenes (4.00), but reviewers were slightly more willing to ship Ideogram's detailed work (3.42).
  5. 05Gemini topped stylized work (3.81). Best-scoring prompts across all categories were lean; worst were over-specified with stacked directives.
Contra Labs
Contra Labs
Research

Spelling letters correctly is no longer the bottleneck for AI image models. The top three (Ideogram v4, Gemini 3.1 Flash, Grok Imagine 1.0) sit within hundredths of a point on text accuracy. The bottleneck is typographic craft: the font choice, kerning, spacing, and integration that decide whether a designer would actually ship the image. On that dimension, only one of the four lands in client-ready territory.

Contra Labs ran a blind, head-to-head evaluation of four current AI image models to answer a question every designer is quietly asking: which one can you actually put in front of a client? The four contenders were Ideogram v4, Gemini 3.1 Flash Image Preview (Nano Banana 2), Grok Imagine 1.0, and FLUX.2 [max].

10 creative professionals with expertise in content, branding, and image editing lent their judgment to evaluate 240 images on category-specific technical criteria. This write-up covers typography, detailed scenes, and stylized prompt results.

Methodology

Ten working designers sourced from Contra's top-earning talent evaluated three prompt categories (Typography, Highly Detailed Scenes, Dreamy/Stylized) with 20 prompts each. We did an initial round of 10 prompts per category, then sent out an expansion task with another 10. The prompting methodology stayed the same; the same rater group participated in both rounds.

Every prompt ran through all four models. The four outputs were shown to reviewers blind: model identity hidden, order randomized, so brand reputation and screen position couldn't tilt the results.

Reviewers did three things per task. First, pairwise preference: every model went up against every other in a round-robin (six matchups per prompt), which forces a clean ranking of all four. Second, Likert ratings (1–5) on category-specific scales. For typography, that's Text Accuracy and Typographic Craft, plus one universal question that applies to every image: would you use this in real client work? Third, open-ended notes on strengths and weaknesses. With 20 prompts per category and 10 reviewers each, every prompt set yields roughly 200 observations.

Typography: Ideogram out front

In head-to-head preference, Ideogram v4 won first place 47.9% of the time across both rounds, far ahead of Gemini (30.0%), FLUX.2 (15.5%), and Grok (15.0%). Ideogram was the only model to clear the 50% mark in a single round (51.1% in round two).

The rating scales tell a more nuanced story. On raw Text Accuracy (does the text say exactly what the prompt asked, spelled correctly and legibly) the top three are nearly tied: Ideogram (4.59), Gemini (4.55), and Grok (4.54) are separated by hundredths of a point, with FLUX.2 trailing at 3.30. Getting the letters right is close to a solved problem for the leading models.

Where Ideogram pulls away is Typographic Craft. This dimension referred to font choice, kerning, spacing, and how the type integrates with the image. Ideogram scored 3.92, a clear gap over Gemini (3.50), Grok (3.36), and FLUX.2 (3.34). Most models can spell. Ideogram makes the type look designed.

That edge carries into the headline question. On "Would you use this in real client work?" for typography, Ideogram led at 3.55, with Gemini at 2.84, Grok at 2.61, and FLUX.2 at 2.49. For type-driven work, Ideogram is the only model whose average lands in "likely usable" territory.

The qualitative notes tracked the scores. On a photorealistic book back cover with gold-foil serif type on deep green linen, Ideogram v4 scored 4.0 out of 5, Grok Imagine 3.1.

Consistent alignment contributes to a strong sense of order and professionalism. Title is a strong typographic choice. Body tracking details are off. Overall looks polished.Reviewer note · Ideogram v4
Text accuracy. Text artifacts in the small italic phrase. The typography lacks contrast and visual interest. Design overall is lacking a professional look.Reviewer note · Grok Imagine 1.0

On a modern café menu board (bold header, three columns of drinks and prices in clean geometric type, small italic footer, bright even daylight), Ideogram landed at 4.4 out of 5; FLUX.2 [max] at 2.4.

Text is accurate. Typography choices and spacing are correct. The typography supports the concept and provides strong visual communication.Reviewer note · Ideogram v4
Text is all accurate. The three columns feel evenly spaced and align prices to the right which helps with readability. The overall typography matches the prompt, and the background feels natural, with the warm white requested.Reviewer note · Ideogram v4
Mistakes in accuracy, design is also a bit outdated.Reviewer note · FLUX.2 [max]
Text is not accurate. Body typography lacks thoughtful type choice and spacing. The design lacks the refinement expected in professional work.Reviewer note · FLUX.2 [max]

Detailed scenes: a two-horse race at the top

The Highly Detailed Scenes category covers busy prompts with specific counts, attributes, and spatial relationships. Here Gemini sets the pace on the technical scales. On Detail Accuracy (how faithfully each model renders the specific counts, attributes, and positioning a prompt asks for), Gemini led at 4.00, ahead of Grok (3.77), Ideogram (3.71), and FLUX.2 (3.66). On Render Coherence (clean, plausible rendering, free of artifacts and warping), Gemini and Ideogram are effectively neck and neck at 4.18 and 4.15, with FLUX.2 (3.86) and Grok (3.81) a step behind. All four clear a 3.6 average on both scales, so complex scenes are broadly within reach for every model.

But the client-work gate reorders things. For detailed scenes, Ideogram edged ahead at 3.42, with Gemini close behind at 3.37, then FLUX.2 (3.01) and Grok (2.82). Gemini is marginally more accurate and coherent, yet reviewers were slightly more willing to ship Ideogram's detailed scenes. "Client-ready" is a judgment about the whole image. The top two are essentially tied and both sit in "usable with minor edits" territory; the bottom two hover around "maybe."

Stylized work: a different leaderboard

On Style Execution for dreamy/stylized prompts, Gemini 3.1 Flash led at 3.81 overall, followed by Grok (3.62) and FLUX.2 (3.60), with Ideogram last at 3.36. A similar pattern holds for Mood and Evocativeness: Gemini at 3.77, Grok at 3.61, FLUX.2 at 3.45, Ideogram at 3.29.

Ideogram's strength is precision and type. When the brief is mood and aesthetic, the other models close the gap.

Stylized work is the most competitive category, and it plays to different strengths than Ideogram's. Where the brief is pure mood and atmosphere, Gemini edges ahead on the rating scales, but the gap is narrow, and Ideogram's images still drew praise for their aesthetics and emotional pull. Reviewers' main note was that Ideogram occasionally added its own creative flourishes, interpreting a prompt rather than executing it literally. For designers who want a model that nails type and precision, that's a reasonable trade. For open-ended mood-boarding, the field is wide open.

This image definitely evokes a whimsical, aspirational feeling; however, it does not follow the prompt completely, with the house on the cloud in the background. One thing I like about the image is the leaves in the foreground of the image, blurred due to depth of field. It adds another whimsical and aesthetic element.Reviewer note · Ideogram v4
Overall composition is very good. The style is executed really well. The only thing that's a bit overboard is the droplets around the product. Those should be more scattered and less organized. The product itself is well placed. It's seamless. The only issue would be the label. But other than that, every element is well integrated.Reviewer note · Ideogram v4
Strong luxury fragrance-advertising aesthetic. The mood feels serene and aspirational. The floating pebble bottle is the clear focal point.Reviewer note · Ideogram v4

We coded all 400 open-ended comments on stylized prompts into recurring themes. The most common praise was for style execution and aesthetic fit (20% of comments), followed by composition and evocative mood. The most common critique was prompt mismatch and missing details (12.5%), with under-executed mood and composition issues close behind.

What separated the best prompts from the worst

That tension between literal instruction and interpretation showed up in the prompts themselves. When we looked back at which prompts produced the best and worst scoring images per category for Ideogram, one pattern cut across all three categories: the best-rated images consistently came from leaner prompts, while the worst-rated outputs tended to come from longer, more directive prompts. The "worst" profiles (red) spike on word count, specificity, and stacked directives like style, camera/lighting, and technical terms. The "best" profiles (green) sit tighter to the center.

What the best prompts emphasized instead varied by category. For typography, top prompts leaned on what actually matters for type: quoted phrases and brand or named copy that spell out exactly what the words should say, plus reference examples. Stylistic directives stayed out of the way. For dreamy/stylized work, the best prompts were shorter and left more room (higher ambiguity per 100 words, anchored by examples), while the worst were over-specified with detail the model then had to reconcile.

One caveat: this is a correlation. Don't read it as a recipe. More demanding briefs are both longer and harder to satisfy, so prompt length and low scores may share a common cause. State the words you need precisely. Resist loading every stylistic directive into one prompt.

The bottom line so far

Across the three categories, a clear division of labor emerged. Ideogram v4 is the model to reach for when text is part of the design: posters, packaging, ad creative, anything typographic. Most leading models spell correctly now. Ideogram creates type designers would actually ship, leading both typographic craft and the client-work gate by a wide margin. It won nearly half of all head-to-head typography matchups (47.9%, well ahead of every rival) and was the only model whose type-driven work landed in genuinely client-ready territory rather than needing rework. Its strength reaches past text, too: Ideogram was the only model to top the client-work gate in two of the three categories, it held its own against Gemini on the toughest, most detailed scenes, and even in stylized work, where it trailed on the scales, its images still drew praise for their aesthetics and emotional pull. Gemini 3.1 Flash is the steadiest all-rounder, the most accurate on busy, literal scenes and the top pick for pure mood and style work. Grok Imagine and FLUX.2 trail the leaders across the board, landing in "usable with edits" territory.

The "would you use this in real client work?" verdict: Ideogram leads typography (3.55) and edges ahead on detailed scenes (3.42). Gemini takes stylized (3.21). No single model wins everything. The right choice depends on whether the brief is built on words, accuracy, or atmosphere. And whatever you reach for, the prompts that won were the lean, precise ones: say exactly what you need and resist burying the model in directives.

Limitations

20 prompts × 10 reviewers is ~200 observations per category, and should be considered explicitly directional, not powered for significance.

Continue reading3 studies
All research
  1. May 13, 2026Methodology
    Methodology: human-panel evaluation of generative models at Contra Labs.The standard playbook behind every Contra Labs battle, profile, and field note: blinded panels of practicing creatives, forced-choice rankings paired with scalar ratings and rationale, and a reliability battery that travels with every number we publish.Read
  2. May 29, 2026Benchmark
    Cursor took 60% of head-to-heads. Claude Code took 63% of client meetings.Four coding tools, 24 outputs, five working designers. The tool designers preferred to look at and the tool they'd put their name on turned out to be different.Read
  3. June 1, 2026Benchmark
    Gemini reliably edits, but can it keep the rest of the image still?11 production-style sessions. Gemini made the edit 73% of the time, kept the rest of the image still 64%, held both in 55%.Read

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings