Cyberpunk Alley in Three.js

We gave eight AI models one prompt and one shot: write a complete Three.js scene from scratch — a neon-lit cyberpunk alley at night in the rain — with a locked-down 10-second camera dolly so every render lines up frame for frame. No tools, no retries, no edits. Every scene below is the model’s own code running in real time, from a polished neon corridor to one that never compiles. Which one nails the brief? Your call.

The prompt

Create a complete single-file index.html using Three.js that renders a neon-lit cyberpunk alley at night in light rain — neon signs, wet reflective ground, rain particles, volumetric fog, and a fixed 10-second camera dolly down the alley so every model’s render is comparable.

Run this brief yourself →

The renders

Each scene is the model’s unedited code, recorded in a headless browser with software WebGL. Same prompt, same camera path, same 10 seconds — the only variable is the model that wrote it. Which one nails the brief? Your call.

Claude Sonnet 5

Anthropic’s new mid-tier model, launched 30 June 2026 — its most agentic Sonnet yet, near Opus-class capability at a fraction of the price. Added the day it shipped, under the identical brief and one-shot rule. Ran with adaptive thinking at high effort.

View generation & live render →

Claude Fable 5

Anthropic’s flagship, launched June 2026 as the successor to the Claude 4 line. Built for long-horizon agentic work, with creative coding as a headline strength. Ran with adaptive thinking at high effort.

View generation & live render →

Claude Opus 4.8

The top tier of Anthropic’s Claude 4 family, and the company’s flagship coding model until Fable 5 arrived. The Opus line is the one most big AI coding tools default to. Ran with adaptive thinking at high effort.

View generation & live render →

GPT-5.5

OpenAI’s newest mainline flagship, the successor to GPT-5. Ran at medium reasoning effort, and took the most reasoning of any OpenAI model here.

GPT-5

The previous mainline GPT flagship, successor to GPT-4, kept as the prior-generation reference. Ran at medium reasoning effort.

Gemini 3.1 Pro

Google’s flagship Gemini, the long-context generalist of the lineup. Re-run with an explicit 32,768-token thinking budget — raised from an earlier 8,192-token cap so its reasoning headroom matches the rest of the field. On this brief it stayed lean regardless, writing the scene after only a few thousand tokens.

GPT-5 mini

OpenAI’s compact reasoning model, the small member of the GPT-5 family, at medium effort. Its code crashes on load: a resize handler reads the camera object before that object is declared — a JavaScript temporal-dead-zone error — so the scene never renders. The black frame is its actual, unedited output.

GLM 5.2

Z.ai’s open-weights flagship (Zhipu) — strong long-horizon coding at a fraction of frontier cost, and the only open-weights model on the page. Run direct on the z.ai API with thinking enabled. It is the one stated exception to the shared 64,000-token ceiling: at that budget it spends every token reasoning and returns no code at all, so it is shown here on the larger ~120k budget where it does finish — the single lane running on more room than the rest, called out rather than hidden. The longest run on the page, at twenty minutes.

Side by side

The headline models on one screen — same prompt, same camera, same clock.

Share this showdown

Split-screen comparison assets, ready to post. Credit appreciated, not required.

⬇ Side-by-side comparison (2×2, MP4)⬇ Comparison grid (PNG)

How we ran it

Every model received the identical prompt in a single turn: a complete single-file Three.js scene with no external assets, plus an exact camera path (eye-level start, constant-speed 10-second dolly down the alley centerline with a ±4° sinusoidal sway) so the renders line up frame for frame. No tools, no retries, no human edits — the first complete HTML file returned is what you see, recorded in headless Chromium under software WebGL (SwiftShader) on a virtualized clock for constant frame pacing. The fair comparison rests on one shared control: every model runs under the same 64,000-token total-output ceiling. Reasoning bills as output on every provider, so that single number bounds cost and thinking-room identically — and it is why each model’s billed tokens exceed the size of the code it returned. We deliberately do NOT force one "effort" label across vendors, because the tiers are not calibrated to each other: OpenAI’s GPT-5 family runs at medium reasoning effort here, which on this brief already spends more reasoning than the Claude models do at high (GPT-5.5 spent 53k tokens against Sonnet 5’s 31k). Matching the label would equalise nothing — and when we tested forcing high effort, it only pushed the OpenAI models to write more ambitious code that failed to compile. So we match the envelope, not the label, and disclose each model’s setting: the three Claude entries ran with adaptive thinking at high effort; the three GPT-5 entries at medium reasoning effort; Gemini 3.1 Pro with an explicit 32,768-token thinking budget (raised from an earlier 8,192 after a reader rightly flagged the smaller cap as unfair). One model is a stated exception: GLM 5.2 cannot finish inside the 64k ceiling — at high effort it spends the whole budget reasoning and emits no code — so it is shown on the larger ~120k budget where it does complete, the single lane on more room than the rest, called out here rather than hidden. Two honest failures stayed in unedited: GPT-5 mini returned code that throws on load and renders a black frame; that black frame is its real output. GPT-5.5 Pro is excluded entirely because its reasoning spend cannot be capped.

New showdown every week

Same format, new brief, latest models — Fable, GPT, Gemini and whatever ships next. Get each one the morning it goes live.

Run your own showdowns

Think you can write a better brief? Run this showdown’s prompt yourself — pick a model, tweak the brief, and get your own live render.

Run this brief yourself →Browse benchmarks Compare models