Cover image for How to Evaluate the Quality of a Prompt

How to Evaluate the Quality of a Prompt

By AppliedAI

Most people evaluate prompts by running them and seeing what comes back. That is an evaluation method — but it is reactive, slow, and expensive when you are iterating at scale.

There is a faster and more consistent approach: evaluate the prompt before you run it, using a structured rubric. This article defines that rubric. Six dimensions, each scored 1–3. A total score guides your decision on whether to run, revise, or redesign.

This is not theoretical. These dimensions map directly to the failure modes that produce bad outputs — each one is something you can assess by reading a prompt, without touching a model.

Why Most Prompt Reviews Fail

The typical approach is to write a prompt, run it, read the output, and decide if it was “good.” The problem is that this conflates two separate questions: did the prompt work? and was the prompt well-constructed?

A poorly constructed prompt can produce a good output by luck — particularly if the task is simple or the model is guessing in the right direction. And a well-constructed prompt can produce a mediocre output if the model version you are using has known weaknesses on that task type.

Evaluating outputs tells you what happened. Evaluating prompts tells you why — and gives you a way to fix it systematically rather than by trial and error.

The rubric below is designed for pre-run evaluation. You apply it to the prompt text itself. No outputs required.

The Six Dimensions

1. Specificity of the Task

What it measures: Whether the task instruction is an action (specific) or a topic (vague).

A task description that could be rephrased as a noun phrase is a topic, not a task. “Marketing strategy” is a topic. “Write a 90-day content marketing plan for a B2B SaaS company targeting mid-market HR teams” is a task. The difference is: a verb, a scope, and a product.

Score 1: The task is a topic or a vague verb (“help me with,” “discuss,” “talk about”). No scope, no product.
Score 2: A clear action verb is present, but scope or output type is ambiguous. A capable person could start, but would have to make significant assumptions.
Score 3: The task specifies an action, a scope, and an expected product. Someone could execute this without clarifying questions.

2. Presence and Quality of Role

What it measures: Whether the model has been given a professional context that constrains its reasoning style and vocabulary.

Without a defined role, the model samples across every context in which the topic has appeared in its training data — technical writers, Reddit commenters, academic papers, marketing copy. The role collapses that distribution.

A role that just names a title (“You are a lawyer”) is better than nothing, but a role that adds a domain, an experience signal, and a behavioral note (“You are a senior employment attorney who writes in plain language for non-legal audiences”) constrains meaningfully.

Score 1: No role defined.
Score 2: Role names a generic title but includes no domain specificity, experience level, or behavioral signal.
Score 3: Role includes at minimum a title, a relevant domain, and either an experience signal or a communication style cue.

3. Context Sufficiency

What it measures: Whether the model has the background information it needs to operate on your actual situation, not a generic version of it.

This is the dimension that separates prompts that produce specific output from prompts that produce plausible-sounding output. Context is the raw material. When it is absent, the model invents a plausible situation — and writes for that instead of yours.

The diagnostic test: could a capable human freelancer, given only this prompt, do the task competently without asking a single clarifying question? If not, context is insufficient.

Score 1: No context provided. The model must invent the situation entirely.
Score 2: Partial context — some background is provided, but the audience, constraints, or downstream purpose is missing.
Score 3: Context covers the situation, the audience (if relevant), and the purpose the output will serve. A freelancer could start immediately.

4. Format Specification

What it measures: Whether the expected output shape is explicitly defined — length, structure, and any formatting rules.

The model has no default format preference. It generates what is statistically most common for the content type. For an analytical question, that might be long-form prose with headers. For a creative question, it might be open-ended narrative. These defaults are often wrong for your specific use context.

Specifying format turns “a reasonable output” into a usable one. This dimension is particularly important when the output feeds into another system, another person, or another prompt.

Score 1: No format specified. Length, structure, and formatting are entirely at the model’s discretion.
Score 2: Some format guidance — for example, a word count or general type (“a bullet list”) — but no structural detail or exclusions.
Score 3: Format specifies length, structure type, and at least one exclusion rule or content constraint that prevents a common default failure mode.

5. Constraint Clarity

What it measures: Whether explicit rules have been defined about what the output must or must not do.

Constraints and format specifications are distinct. Format describes shape; constraints describe rules. “Maximum 200 words” is format. “Do not use passive voice, do not reference competitor names, avoid claims that require a citation” are constraints.

Negative constraints — things the output must not do — are particularly high-leverage. They eliminate specific failure modes before they appear, rather than fixing them in follow-up prompts.

Score 1: No explicit constraints. The model will apply its own judgment on everything.
Score 2: Some constraints present, but stated vaguely (“keep it professional,” “be concise”) — not binary, not testable.
Score 3: Constraints are specific and binary — each one either holds or it doesn’t. At least one negative constraint is present.

6. Verifiability of the Output Standard

What it measures: Whether, once the output arrives, you could evaluate it against the prompt — or whether “good” is purely subjective.

This is the dimension most prompt engineers neglect. If your prompt does not define a measurable or observable standard, you cannot tell whether a borderline output is acceptable. You are just deciding based on feel. That is fine for one-off tasks; it is a problem for anything repeatable.

Verifiability does not require a numeric metric. It requires that the prompt creates a basis for comparison: the desired tone is characterized, the length is bounded, the required sections are named, the one concrete example in the prompt shows the standard you expect.

Score 1: No output standard defined. Evaluation is entirely subjective.
Score 2: Some implicit standard exists — enough that a thoughtful reader could agree or disagree with an output — but it is not stated in the prompt.
Score 3: The prompt contains explicit criteria against which the output can be evaluated objectively (length bounds, required elements, a few-shot example, or a named quality bar).

How to Use the Rubric

Add up your scores across the six dimensions. Maximum is 18.

Total ScoreInterpretation
6–9High risk. The prompt is underspecified. Running it will produce generic output; iteration will be slow. Revise before running.
10–13Acceptable for low-stakes output. Gaps exist but the core is functional. Worth running with attention to which dimensions scored lowest.
14–16Solid prompt. Running it should produce usable output. Minor gaps are unlikely to cause failure.
17–18Well-constructed. This is ready to run. At this level, output failure is more likely to be a model issue than a prompt issue.

Use the individual dimension scores diagnostically, not just the total. A prompt that scores 18 overall with two dimensions at 3 and one at 0 has a structural gap that could fail the entire task.

Applying the Rubric: A Worked Example

Here is a prompt in the wild, scored against the rubric:

“Write me a LinkedIn post about our new product launch.”

  • Specificity of Task: 1. “Write a LinkedIn post” is almost a task, but no scope, no length, no angle, no CTA.
  • Role: 1. No role defined.
  • Context Sufficiency: 1. Nothing about the product, the audience, the brand voice, or what makes the launch notable.
  • Format Specification: 1. LinkedIn posts can be 3 lines or 30. Not specified.
  • Constraint Clarity: 1. No constraints.
  • Verifiability: 1. No standard. You will know it when you see it — but you will not.

Total: 6/18. This prompt will produce a generic, competently-worded LinkedIn post that has nothing to do with your actual product, audience, or launch context. You will spend more time rewriting the output than writing a better prompt would have taken.

Now the same underlying request, rewritten:

You are a senior B2B marketing manager with experience in enterprise SaaS. Write a LinkedIn post announcing the launch of our AI-powered contract review tool for in-house legal teams.

Context: The launch is today. The tool reduces contract review time by 70% with no legal expertise required. Our audience is General Counsel and their direct reports at companies 200–2000 employees. We have a 14-day free trial at the link.

Format: 150–200 words. Lead with the outcome (time saved), not the feature description. Close with a clear trial CTA. No hashtags. Use short paragraphs, not bullets.

  • Specificity of Task: 3
  • Role: 3
  • Context Sufficiency: 3
  • Format Specification: 3
  • Constraint Clarity: 2 (constraints are present but could be more specific — no explicit negative constraints)
  • Verifiability: 2 (outcome-led and CTA requirements are stated; the 70% stat creates a concrete hook to evaluate against)

Total: 16/18. You can run this. The output will be usable. The two 2-scores are refinements, not blockers.

When to Run the Rubric Formally vs. Informally

For one-off, low-stakes prompts, you do not need to score all six dimensions explicitly. Running through them mentally — “does this have a role, do I have enough context, have I said what format I need?” — adds maybe 30 seconds and catches 80% of common gaps.

For prompts that will be reused, embedded in a workflow, or used to generate content at volume, score formally. The discipline of assigning a number catches ambiguities that a quick mental scan misses.

If you are building and iterating on prompts systematically, the Prompt Scaffold tool gives you dedicated input fields for Role, Task, Context, Format, and Constraints, with a live assembled preview of the full prompt. It does not do the scoring, but the structure enforces that you have addressed each dimension — which is most of what the rubric is checking.

The Relationship Between This Rubric and Prompt Frameworks

This rubric is framework-agnostic. It does not care whether you use RTGO, the six-component structure from The Anatomy of a Perfect Prompt, or your own personal system. The six dimensions map to what any complete prompt needs, regardless of the framework used to build it.

That said, if you find you are consistently scoring 1 on the same dimensions — Role every time, or Context every time — that is a signal that your default prompting habit is missing that element structurally. The fix is not to remember to add it each time; it is to change how you build prompts at the start. A structured framework like RTGO is useful precisely because it makes those omissions impossible by construction.

What the Rubric Does Not Catch

The rubric evaluates prompt construction. It does not evaluate:

  • Model fit. Some prompts are well-constructed but designed for the wrong model. A prompt that requires sustained reasoning over a very long document will perform differently on GPT-4o vs. Gemini 1.5 Pro, regardless of prompt quality.
  • Few-shot example quality. The rubric checks whether examples exist (Verifiability) but not whether they are representative, consistent, or correctly formatted for few-shot learning.
  • System prompt conflicts. If you are building on an API or a platform with a system prompt, a well-constructed user prompt can still fail if it conflicts with system-level instructions.
  • Ambiguity from unstated assumptions. Sometimes a prompt is technically complete but has an invisible assumption baked in — a term the writer considers obvious that the model interprets differently. These require output evaluation, not prompt evaluation.

The rubric reduces the probability of bad output. It does not eliminate it. Treat a score of 17–18 as “ready to run with reasonable confidence,” not “guaranteed to succeed.”

Related reading:

  • The Anatomy of a Perfect Prompt — The six-component structure that maps directly to the dimensions in this rubric, with worked examples of each
  • The RTGO Prompt Framework — A four-part prompt framework designed so that high-scoring prompts are the natural output of following it
  • Stop Using One-Liner Prompts — How context sufficiency (Dimension 3) is the most commonly missing element and how to fix it
  • Prompt Scaffold — A structured tool that gives you dedicated fields for each prompt component and a live assembled preview