BachBench: Five Models Play Bach

We gave five frontier models the same MusicXML score — Bach’s Cello Suite No. 1 Prelude — and one deceptively hard task: read it and turn it into something you’d want to watch and hear. Same file, same single shot, each vendor’s current flagship. The results run from a blooming mandala to a 32-second speed-run.

The prompt

You are given a MusicXML file of Bach’s Cello Suite No. 1 Prelude (BWV 1007). Build a single self-contained HTML page that performs it both audibly (Web Audio) and visually (canvas), as creatively as you can — no libraries, one shot. Avoid the notes-falling-on-a-piano-roll cliché.

The performances

Every model got the same MusicXML file and the same instruction: perform it, audibly and visually, as creatively as you can — one shot, no libraries. 🔊 Unmute any panel to hear it. “Decode fidelity” is the share of notes a model actually played that match the score, measured by intercepting each page’s Web Audio output.

Claude Sonnet 5

“a spiral of sound” — Claude Sonnet 5

The richest render of the five — a pitch-coloured spiral that blooms into a full mandala as the music builds, doubling every note with an octave voice. It also thought the hardest: ~45k output tokens, most of it reasoning before a line of code was written. 96.7% decode fidelity, and it titled its own piece.

View generation & live render →

GPT-5.5

“luminous score” — GPT-5.5

Elegant restraint — glowing orbs strung on threads of light, a constellation that swings with register. 99.4% of its notes match the score, rendered in a lean 13 KB file, and the quickest of the heavy reasoners to finish.

View generation & live render →

Gemini 3.1 Pro

Warm golden rings and orbiting nodes. Under the hood it is a maximalist — it spawns tens of thousands of short oscillator grains for a dense, shimmering wash of sound (the busiest, loudest mix here). 92.9% decode fidelity.

View generation & live render →

GLM-5.2

Zhipu’s frontier, and the deepest thinker in the field — it reasoned for ~20 minutes and ~83k tokens before writing a line of code, so heavily it needed a bigger budget than the rest just to finish. The payoff is elegant: a luminous pitch-contour that traces the melody, led by a rippling comet of light. 98.2% decode fidelity.

View generation & live render →

DeepSeek V4 Pro

The cautionary tale. Every pitch is correct — a perfect 100% decode — but it misreads the rhythm and plays the whole prelude roughly four times too fast, the entire piece over in 32 seconds. Proof that decode fidelity is not the same as musicality.

View generation & live render →

How we ran it

Five current flagship models, each given the identical MusicXML for Bach’s Cello Suite No. 1 Prelude (BWV 1007) and the identical instruction, in a single shot — no follow-up, no tools, default reasoning effort, and a generous output-token cap (64k; raised for GLM-5.2, which reasons so heavily it needed more room to finish). Each returned one self-contained HTML file; we ran it unmodified, recorded 145 seconds of its canvas and Web Audio output, and normalised loudness across entries. “Decode fidelity” is the share of note-onsets whose synthesised pitch falls in the score’s pitch set, measured by intercepting each page’s Web Audio scheduling. Every panel links to a runnable version you can open and play yourself. Routes: Anthropic and OpenAI direct, Z.ai for GLM, a unified gateway for the rest. Cost is an estimate from billed tokens at each model’s list price. DeepSeek’s clip is 32 seconds because it played the piece roughly four times too fast.

New showdown every week

Same format, new brief, latest models — Fable, GPT, Gemini and whatever ships next. Get each one the morning it goes live.

Run your own showdowns

PromptFrenzy benchmarks the big AI models on real prompts — images, styles, and now code. Browse the full library or compare models head to head.

Browse benchmarks Compare models