Spelling letters correctly is no longer the bottleneck for AI image models. The top three (Ideogram v4, Gemini 3.1 Flash, Grok Imagine 1.0) sit within hundredths of a point on text accuracy. The bottleneck is typographic craft: the font choice, kerning, spacing, and integration that decide whether a designer would actually ship the image. On that dimension, only one of the four lands in client-ready territory.
Contra Labs ran a blind, head-to-head evaluation of four current AI image models to answer a question every designer is quietly asking: which one can you actually put in front of a client? The four contenders were Ideogram v4, Gemini 3.1 Flash Image Preview (Nano Banana 2), Grok Imagine 1.0, and FLUX.2 [max].
10 creative professionals with expertise in content, branding, and image editing lent their judgment to evaluate 240 images on category-specific technical criteria. This write-up covers typography, detailed scenes, and stylized prompt results.
Methodology
Ten working designers sourced from Contra's top-earning talent evaluated three prompt categories (Typography, Highly Detailed Scenes, Dreamy/Stylized) with 20 prompts each. We did an initial round of 10 prompts per category, then sent out an expansion task with another 10. The prompting methodology stayed the same; the same rater group participated in both rounds.
Every prompt ran through all four models. The four outputs were shown to reviewers blind: model identity hidden, order randomized, so brand reputation and screen position couldn't tilt the results.
Reviewers did three things per task. First, pairwise preference: every model went up against every other in a round-robin (six matchups per prompt), which forces a clean ranking of all four. Second, Likert ratings (1–5) on category-specific scales. For typography, that's Text Accuracy and Typographic Craft, plus one universal question that applies to every image: would you use this in real client work? Third, open-ended notes on strengths and weaknesses. With 20 prompts per category and 10 reviewers each, every prompt set yields roughly 200 observations.
Typography: Ideogram out front
In head-to-head preference, Ideogram v4 won first place 47.9% of the time across both rounds, far ahead of Gemini (30.0%), FLUX.2 (15.5%), and Grok (15.0%). Ideogram was the only model to clear the 50% mark in a single round (51.1% in round two).
The rating scales tell a more nuanced story. On raw Text Accuracy (does the text say exactly what the prompt asked, spelled correctly and legibly) the top three are nearly tied: Ideogram (4.59), Gemini (4.55), and Grok (4.54) are separated by hundredths of a point, with FLUX.2 trailing at 3.30. Getting the letters right is close to a solved problem for the leading models.
Where Ideogram pulls away is Typographic Craft. This dimension referred to font choice, kerning, spacing, and how the type integrates with the image. Ideogram scored 3.92, a clear gap over Gemini (3.50), Grok (3.36), and FLUX.2 (3.34). Most models can spell. Ideogram makes the type look designed.
That edge carries into the headline question. On "Would you use this in real client work?" for typography, Ideogram led at 3.55, with Gemini at 2.84, Grok at 2.61, and FLUX.2 at 2.49. For type-driven work, Ideogram is the only model whose average lands in "likely usable" territory.
The qualitative notes tracked the scores. On a photorealistic book back cover with gold-foil serif type on deep green linen, Ideogram v4 scored 4.0 out of 5, Grok Imagine 3.1.
Consistent alignment contributes to a strong sense of order and professionalism. Title is a strong typographic choice. Body tracking details are off. Overall looks polished.Reviewer note · Ideogram v4
Text accuracy. Text artifacts in the small italic phrase. The typography lacks contrast and visual interest. Design overall is lacking a professional look.Reviewer note · Grok Imagine 1.0
On a modern café menu board (bold header, three columns of drinks and prices in clean geometric type, small italic footer, bright even daylight), Ideogram landed at 4.4 out of 5; FLUX.2 [max] at 2.4.
Text is accurate. Typography choices and spacing are correct. The typography supports the concept and provides strong visual communication.Reviewer note · Ideogram v4
Text is all accurate. The three columns feel evenly spaced and align prices to the right which helps with readability. The overall typography matches the prompt, and the background feels natural, with the warm white requested.Reviewer note · Ideogram v4
Mistakes in accuracy, design is also a bit outdated.Reviewer note · FLUX.2 [max]
Text is not accurate. Body typography lacks thoughtful type choice and spacing. The design lacks the refinement expected in professional work.Reviewer note · FLUX.2 [max]
Detailed scenes: a two-horse race at the top
The Highly Detailed Scenes category covers busy prompts with specific counts, attributes, and spatial relationships. Here Gemini sets the pace on the technical scales. On Detail Accuracy (how faithfully each model renders the specific counts, attributes, and positioning a prompt asks for), Gemini led at 4.00, ahead of Grok (3.77), Ideogram (3.71), and FLUX.2 (3.66). On Render Coherence (clean, plausible rendering, free of artifacts and warping), Gemini and Ideogram are effectively neck and neck at 4.18 and 4.15, with FLUX.2 (3.86) and Grok (3.81) a step behind. All four clear a 3.6 average on both scales, so complex scenes are broadly within reach for every model.
But the client-work gate reorders things. For detailed scenes, Ideogram edged ahead at 3.42, with Gemini close behind at 3.37, then FLUX.2 (3.01) and Grok (2.82). Gemini is marginally more accurate and coherent, yet reviewers were slightly more willing to ship Ideogram's detailed scenes. "Client-ready" is a judgment about the whole image. The top two are essentially tied and both sit in "usable with minor edits" territory; the bottom two hover around "maybe."
Stylized work: a different leaderboard
On Style Execution for dreamy/stylized prompts, Gemini 3.1 Flash led at 3.81 overall, followed by Grok (3.62) and FLUX.2 (3.60), with Ideogram last at 3.36. A similar pattern holds for Mood and Evocativeness: Gemini at 3.77, Grok at 3.61, FLUX.2 at 3.45, Ideogram at 3.29.
Ideogram's strength is precision and type. When the brief is mood and aesthetic, the other models close the gap.
Stylized work is the most competitive category, and it plays to different strengths than Ideogram's. Where the brief is pure mood and atmosphere, Gemini edges ahead on the rating scales, but the gap is narrow, and Ideogram's images still drew praise for their aesthetics and emotional pull. Reviewers' main note was that Ideogram occasionally added its own creative flourishes, interpreting a prompt rather than executing it literally. For designers who want a model that nails type and precision, that's a reasonable trade. For open-ended mood-boarding, the field is wide open.
This image definitely evokes a whimsical, aspirational feeling; however, it does not follow the prompt completely, with the house on the cloud in the background. One thing I like about the image is the leaves in the foreground of the image, blurred due to depth of field. It adds another whimsical and aesthetic element.Reviewer note · Ideogram v4
Overall composition is very good. The style is executed really well. The only thing that's a bit overboard is the droplets around the product. Those should be more scattered and less organized. The product itself is well placed. It's seamless. The only issue would be the label. But other than that, every element is well integrated.Reviewer note · Ideogram v4
Strong luxury fragrance-advertising aesthetic. The mood feels serene and aspirational. The floating pebble bottle is the clear focal point.Reviewer note · Ideogram v4
We coded all 400 open-ended comments on stylized prompts into recurring themes. The most common praise was for style execution and aesthetic fit (20% of comments), followed by composition and evocative mood. The most common critique was prompt mismatch and missing details (12.5%), with under-executed mood and composition issues close behind.
What separated the best prompts from the worst
That tension between literal instruction and interpretation showed up in the prompts themselves. When we looked back at which prompts produced the best and worst scoring images per category for Ideogram, one pattern cut across all three categories: the best-rated images consistently came from leaner prompts, while the worst-rated outputs tended to come from longer, more directive prompts. The "worst" profiles (red) spike on word count, specificity, and stacked directives like style, camera/lighting, and technical terms. The "best" profiles (green) sit tighter to the center.
What the best prompts emphasized instead varied by category. For typography, top prompts leaned on what actually matters for type: quoted phrases and brand or named copy that spell out exactly what the words should say, plus reference examples. Stylistic directives stayed out of the way. For dreamy/stylized work, the best prompts were shorter and left more room (higher ambiguity per 100 words, anchored by examples), while the worst were over-specified with detail the model then had to reconcile.
One caveat: this is a correlation. Don't read it as a recipe. More demanding briefs are both longer and harder to satisfy, so prompt length and low scores may share a common cause. State the words you need precisely. Resist loading every stylistic directive into one prompt.
The bottom line so far
Across the three categories, a clear division of labor emerged. Ideogram v4 is the model to reach for when text is part of the design: posters, packaging, ad creative, anything typographic. Most leading models spell correctly now. Ideogram creates type designers would actually ship, leading both typographic craft and the client-work gate by a wide margin. It won nearly half of all head-to-head typography matchups (47.9%, well ahead of every rival) and was the only model whose type-driven work landed in genuinely client-ready territory rather than needing rework. Its strength reaches past text, too: Ideogram was the only model to top the client-work gate in two of the three categories, it held its own against Gemini on the toughest, most detailed scenes, and even in stylized work, where it trailed on the scales, its images still drew praise for their aesthetics and emotional pull. Gemini 3.1 Flash is the steadiest all-rounder, the most accurate on busy, literal scenes and the top pick for pure mood and style work. Grok Imagine and FLUX.2 trail the leaders across the board, landing in "usable with edits" territory.
The "would you use this in real client work?" verdict: Ideogram leads typography (3.55) and edges ahead on detailed scenes (3.42). Gemini takes stylized (3.21). No single model wins everything. The right choice depends on whether the brief is built on words, accuracy, or atmosphere. And whatever you reach for, the prompts that won were the lean, precise ones: say exactly what you need and resist burying the model in directives.
Limitations
20 prompts × 10 reviewers is ~200 observations per category, and should be considered explicitly directional, not powered for significance.

