The Great AI Calculator Experiment

December 20, 2025 · 2 min read · Raymond

BenchmarkllmAI Coding Assistantexperimentation

I decided to run a fun little experiment to see who could code he best calculator in one-shot.

I took seven cutting-edge LLMs: ChatGPT 5.2, Google Gemini 3 Pro, Claude 4.5 Sonnet, Grok 4.1, GLM 4.6, Kimi K2, and Qwen3 Max, and gave them all the exact same prompt:

"Build a fully functional, high-quality calculator with an impressive UI using HTML, JavaScript, and Tailwind CSS in a single HTML file. Respond with the complete working code from top to bottom."

To be sure of an even playing field, I imposed strict limitations. I disabled all "thinking modes" (chain-of-thought reasoning) and turned off web search capabilities. This forced the models to rely entirely on their internal training data and immediate generation capabilities.

There were no retries, no tweaks, and no intermediaries like Perplexity or OpenRouter, just a single shot for each model to prove its ability to produce polished, working code on demand.

Below, you’ll find the unedited results from each model, exactly as they were generated:

The Verdict?

Scan through the results, maybe copy-paste a few into your browser to see how they feel, and let me know what you think.