CREATIVE ARENA
Methodology
In the Creative Arena, AI models are evaluated on real professional use cases, driven by actual client deliverables from Contra's marketplace and voted on by our global network of 1.5M+ creative professionals.
Overview
Real work.
Real professionals.
The Creative Arena by Contra Labs compares AI models on tasks that mirror real professional use cases via paid projects commissioned on Contra. We convert anonymized deliverables into prompts, run controlled tournaments with four models at a time, and update overall and per-category Elo ratings after every battle.
Unlike synthetic benchmarks, every prompt originates from an actual client project. And unlike crowd-sourced preference tests, every vote comes from a verified creative professional: designers, developers, and video editors who do this work for a living.
Top skills
Top tools
Categories
Professional use cases
We evaluate models across the categories that matter to working creatives. These are the actual deliverables clients commission on Contra, organized by modality.
Image
Code
Video
Evaluation Depth
Three phases of
creative work
Within each category, we evaluate models across the phases of the creative process, from first spark to final polish.
Phase 01
Ideation
Generating the initial creative concept from a prompt. Models produce directional output that captures tone, mood, and creative intent.
Goal: Direction, not precision
Phase 02
Mockup
Translating that concept into a structured, composed layout. Models must execute against a clear creative brief with proper hierarchy and composition.
Goal: Execution against a brief
Phase 03
Refinement
Fine-tuning a near-final output with precise edits. Models must demonstrate control, consistency, and attention to production-level detail.
Goal: Polish & production readiness
Data Sourcing & Prompt Generation
From real projects
to controlled prompts
Collect Deliverables
We sample deliverables from real, completed paid projects commissioned on Contra's marketplace.
Anonymize & Sanitize
We remove personally identifiable information, trademarks, and client-specific terms that would reveal identity or confidential details.
Category Classification
Deliverables are run through a classifier (LLM-assisted) to map to one of the Arena categories and creative phases.
Prompt Drafting
From the anonymized deliverable, we generate a prompt that captures the intent, constraints, and style of the original request while remaining generic and safe.
Generation
An output is generated for the given prompt for each active model: images, code, or video depending on category.
Tournament Format
4 models. 6 battles.
Full ranking.
Each tournament samples four models from the active pool, runs a fixed six-battle bracket, and yields a complete 1st–4th ordering per prompt.
Initial
Middle
Final
Fairness & Bias Controls
Blind, balanced, audited
Left/Right Randomization
Each battle randomizes side assignment so no model benefits from position bias.
Blind Judging
No model names, vendors, prompts, or metadata are shown to judges. Only the outputs.
Prompt Hygiene
Prompts are anonymized, policy-compliant, and category-consistent before entering the system.
Balanced Exposure
Scheduler ensures broad coverage across models and pairings over time.
Audit Sampling
A subset of matches is reviewed by humans for quality control and consistency checks.
Ratings
Elo scoring
We maintain two Elo ratings per model: an overall Elo and a per-category Elo. All models start at 1500. After every battle, we apply a standard Elo update.
1500
Starting Elo Rating
K = 32
Update Factor Per Battle
Frequently asked
questions
Every prompt starts from an anonymized deliverable of a real, paid client project on Contra, not synthetic tasks or toy examples. And every vote comes from a verified creative professional from our network of 1.5M+ members across 150+ countries.
Four distinct models are sampled from the active pool. Side assignment (left/right) is randomized every battle to eliminate position bias.
Yes. Elo ratings update after each individual battle. We maintain both overall and per-category ratings so you can see how models perform on specific types of work.
We currently evaluate across three modalities: Image (ad design, brand assets, logo), Code (landing page, desktop app, UI component), and Video (ad designs, product shots). Each is tested across three creative phases: ideation, mockup, and refinement.


