I Tested the Same Prompt on GPT-4, Claude, and Gemini

By AppliedAI

No model scored zero. No model was perfect. And the “best” one kept changing depending on what I asked it to do.

That’s the honest summary of running the same prompt battery across GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro over the course of several weeks. Benchmarks have their place, but they measure performance on curated datasets designed by researchers. This is about practical use — the kinds of tasks most people actually run.

The Test Setup

Every prompt was structured identically: a clear role, a specific task, relevant context, and a defined output format. No vague one-liners, no underspecified requests. Using a consistent structure matters because sloppy prompts introduce noise that makes it impossible to tell whether a weak output is the model’s fault or yours.

For anyone who wants to apply this systematically, Prompt Scaffold enforces exactly this kind of structure — Role, Task, Context, Format — and lets you assemble the same prompt cleanly before testing across models.

The tasks covered five categories: long-form writing, code generation, summarization, reasoning under constraints, and extracting structured data from unstructured text.

Category 1: Long-Form Writing

GPT-4o produces well-structured prose with good paragraph flow and appropriate length calibration. Its main weakness is a slight tendency toward the generic. Ask it to write an opinion piece and it produces something technically correct but mild — as if it’s being careful not to offend.

Claude 3.5 Sonnet is the strongest writer of the three. It takes a clear stance, uses more varied sentence structure, and produces text that feels authored rather than assembled. It’s the only model that, unprompted, challenged a weak assumption in my brief.

Gemini 1.5 Pro writes competently but tends toward a listicle structure even when the task calls for prose. It also added caveats I didn’t ask for, slowing down the argument.

Winner: Claude — not by a little.

Category 2: Code Generation

This is GPT-4o’s home territory. The code it produces is clean, idiomatic, and almost always runnable on the first attempt. It handles edge cases well and its inline comments are actually useful rather than redundant.

Claude is a close second. Its code is readable and it’s more likely to explain why it made a particular design decision — which is genuinely useful when you’re learning or reviewing code, not just copying it.

Gemini lagged noticeably here. On more complex tasks involving async operations and error handling, it produced code that looked right but failed on edge cases that both GPT-4o and Claude caught automatically.

Winner: GPT-4o, with Claude as a near-equal alternative.

Category 3: Summarization

All three models can summarize. The meaningful differences are in what they choose to keep.

GPT-4o tends toward comprehensive summaries — it errs on the side of including more. For dense technical documents, this is usually the right call.

Claude prioritizes the core argument. Its summaries are shorter and more opinionated. This is excellent for executive-style briefs but can drop nuance that matters.

Gemini performs well here too, particularly with long documents. Its summarization of a 40-page PDF was the most balanced of the three.

Winner: Context-dependent. Claude for brevity, Gemini for long-document fidelity, GPT-4o for completeness.

Category 4: Reasoning Under Constraints

This is where the gaps become most visible. The task: apply a set of rules — some contradictory, some layered — to a specific scenario and explain the reasoning.

Claude handled contradictions explicitly. When two rules couldn’t both be satisfied, it named the conflict, explained the trade-off, and made a reasoned call. That’s the behavior you want.

GPT-4o resolved conflicts silently — it picked an answer without flagging that there was a contradiction. The output looked confident, but the reasoning was opaque.

Gemini, on the most complex version of the test, applied rules in the wrong order. The reasoning trace showed it processed constraints sequentially rather than treating them as a system.

Winner: Claude — and it’s not close on complex reasoning tasks.

Category 5: Structured Data Extraction

Given a block of unstructured text — meeting notes, scraped content, mixed-format documents — extract specific fields and return them as JSON.

All three handle simple cases well. The differences emerge at the edges: ambiguous values, conflicting data, missing fields.

GPT-4o and Claude both handled ambiguity correctly by either flagging it or using a null value with a note. Gemini occasionally hallucinated plausible-sounding values for fields that weren’t in the source — a problem in any production context.

Winner: GPT-4o and Claude tied, Gemini a notable step behind.

What the Differences Actually Mean

The model choice matters far less than most people think for simple, well-defined tasks. Feed any of these three a clear, complete prompt and you’ll get a usable output.

The differences show up at the edges: ambiguous reasoning, subtle creative judgment, edge cases in code, and tasks where the model has to decide what to prioritize when the brief doesn’t tell it. That’s where model selection becomes a real decision rather than personal preference.

This is also why prompt quality remains the highest-leverage variable. A well-structured prompt consistently outperforms a weak one, regardless of which model it’s sent to. If you’re testing models and writing different prompts for each one, you’re not testing the models — you’re testing your prompts.

The Cost Dimension

Performance is only half the equation. Claude 3.5 Sonnet and GPT-4o are priced similarly at the API level, but Gemini 1.5 Pro has historically been cheaper per token for high-volume use. If you’re running thousands of requests and the task is one where all three perform adequately, the cost difference is real.

The LLM Cost Calculator is a fast way to run those numbers side-by-side before committing — enter your estimated token volume and see the monthly projection for each model.

Which Model to Actually Use

Use Claude as your default for writing, nuanced reasoning, and any task where the output will be read by humans who have high standards.

Use GPT-4o as your default for code, tool-calling, and structured output tasks that need to be correct on the first pass.

Use Gemini when you need to process very long documents, or when you’re operating at a scale where cost matters and the task quality delta is acceptable.

The right answer for most workflows isn’t to pick one. It’s to know when each model earns its keep — and to stop writing prompts that make it impossible to tell the difference.