Most prompts fail for boring reasons: they ask for too much at once, leave the format to chance, and never get measured. Good prompt engineering is less about clever phrasing and more about removing ambiguity the model would otherwise resolve at random.
Here are four levers that reliably move quality: structure, examples, decomposition, and evaluation. Each is cheap to apply and compounds with the others.
Structure: give the model a shape to fill
The single biggest gain in prompt engineering comes from imposing structure. A model fills whatever container you hand it. A vague container produces vague output.
Separate your prompt into labeled parts so the model can tell instructions from data:
- Role and task — one or two sentences on what the model is doing and for whom.
- Context — the source material, fenced off clearly (XML tags like
<document>work well and are hard to confuse with instructions). - Constraints — length, tone, what to avoid, what to do when information is missing.
- Output format — exact schema, ideally JSON or a fixed template.
When you need machine-readable output, specify the schema and a fallback. "Return JSON matching this shape; if a field is unknown, use null" beats hoping the model guesses. Pair this with structured output or tool-calling features in the API rather than parsing freeform text, and a whole class of failures disappears.
One concrete habit: put instructions before the data, and restate the most important constraint at the end. Long-context models attend strongly to the beginning and end of a prompt, so a critical rule buried in the middle gets diluted.
Examples: show, don't just tell
Few-shot prompting is the fastest way to lock in a format or a judgment call that words struggle to describe. Two or three worked examples often outperform a paragraph of instructions.
Examples do two jobs at once. They demonstrate the output shape, and they implicitly define edge-case behavior. If you want the model to write "unknown" instead of inventing a date, include one example where the input lacks a date and the output says "unknown."
A few rules that keep few-shot from backfiring:
- Make examples representative, not just easy. Include a hard case and a near-miss, or the model only learns the happy path.
- Keep formatting identical across examples. Inconsistent examples teach inconsistency.
- Watch for label bias. If every example you show is classified "positive," the model will lean positive. Balance them.
For reasoning-heavy tasks, chain-of-thought helps: ask the model to work through the steps before giving the answer. But place the reasoning before the final answer, and if you need clean output, have the model think inside a scratchpad section and then emit only the final result. On newer reasoning models, heavy-handed "think step by step" scaffolding is often unnecessary and can even hurt.
Decomposition: one prompt, one job
A prompt that extracts data, judges it, and writes a summary in a single pass will do all three worse than three focused prompts would. Decomposition is the prompt engineering version of single responsibility.
Break a complex task into a chain:
- Extract the raw facts from the source.
- Validate or classify those facts.
- Generate the final artifact from the validated facts.
Each step is easier to write, easier to test, and easier to debug when something goes wrong. You can also use cheaper, faster models for the mechanical steps and reserve the strongest model for the step that actually needs judgment.
Decomposition also unlocks self-correction. A common pattern: generate a draft with one call, then critique it against an explicit rubric in a second call, then revise in a third. The critique step catches things a single generation never would, because asking a model to find problems is a different cognitive task than asking it to produce content.
The cost is latency and tokens. Decompose where quality matters and keep simple tasks as a single call.
Evals: measure or you are just guessing
Without evaluation, every prompt change is a vibe. You tweak wording, the next three outputs look better, and you have no idea whether you improved the prompt or got lucky. Evals turn prompt engineering into something you can actually optimize.
Start small and concrete:
- Build a test set of 20 to 50 real inputs, including the weird ones that broke things in production.
- Define a grader. For structured tasks, assert on exact fields. For open-ended ones, use a rubric and an LLM-as-judge with clear criteria, or human review for the cases that matter.
- Run before and after every change. A prompt edit that fixes one case often quietly breaks two others; only a test set catches the regression.
Tools like the OpenAI Evals framework, promptfoo, or a simple spreadsheet plus a script all work. The framework matters far less than the discipline of running the same inputs every time and writing the result down.
Treat prompts like code. Version them, keep the eval set in the repo next to the prompt, and review prompt changes the way you review a pull request. The teams that ship reliable LLM features are not the ones with the cleverest wording. They are the ones who measure.