The sticker price of a model API call looks tiny—fractions of a cent per thousand tokens. Then you ship an agent to production and the bill is 40x what you modeled. The real cost of AI agents lives in the gap between a single call and a finished task.
Why the cost of AI agents surprises everyone
A chatbot makes one call per message. An agent makes ten, twenty, sometimes a hundred calls to finish one job. Each step re-sends the growing conversation history, the tool definitions, and the system prompt. You don't pay once—you pay for the same context over and over.
Here's the trap people miss: input tokens accumulate. If your agent runs a 12-step ReAct loop and the context grows by 2,000 tokens per step, step 12 is re-processing everything from steps 1 through 11. The transcript you send on the final turn can be larger than the whole conversation felt.
So the cost of running an agent is not "tokens per call." It's tokens per call, multiplied by calls per task, multiplied by how fast the context grows, plus everything you retried.
The four real cost drivers
1. Token volume from context growth
Most spend is input tokens, not output. Tool schemas are heavy—a single well-documented tool can be 500+ tokens, and agents often carry a dozen. Multiply that by every turn in the loop. Trimming tool definitions and pruning old turns out of the context window is often the highest-leverage thing you can do.
2. Retries and failure loops
Agents fail and recover, and recovery isn't free. Common money sinks:
- Malformed tool calls the model has to redo after a validation error.
- Rate-limit backoffs that re-send the full request.
- Self-correction loops where the agent notices a mistake and reasons through a fix—paying full input cost each pass.
- Infinite or near-infinite loops where an agent keeps calling the same tool because it can't satisfy its own goal. Without a hard step cap, one stuck task can cost more than a thousand good ones.
3. Orchestration overhead
Multi-agent setups multiply everything. A "planner" agent that spawns three "worker" agents means four context windows, four system prompts, and a coordination layer passing results between them. Frameworks like LangGraph, CrewAI, and AutoGen make this easy to build—and easy to make expensive—because every handoff serializes state back into a prompt. Supervisor patterns are powerful, but each layer of delegation is another full model call.
4. The wrong model for the step
Using a frontier model for every step is the most common overspend. Most agent steps are mechanical: parse this, decide which tool, format that. Those don't need your most expensive model.
The levers that actually keep AI agents cheap
You don't cut agent costs with one trick. You stack several.
Prompt caching
This is the biggest single lever most teams ignore. Providers like Anthropic and OpenAI let you cache the static prefix of your prompt—system instructions, tool definitions, few-shot examples—so repeated calls read it at a steep discount instead of full price. For an agent that re-sends the same 4,000-token preamble on every loop, caching that prefix turns your dominant cost line into a rounding error. Order your prompt so the stable parts come first and the variable parts come last.
Model routing
Route by difficulty. Use a small, cheap model (Haiku, GPT-4o-mini, or an open model like Llama on your own hardware) for routing, extraction, and classification. Reserve the expensive model for genuine reasoning and final synthesis. A cascade—cheap model first, escalate only when confidence is low—often handles the majority of steps at a fraction of the price.
Aggressive context management
- Summarize and compact long histories instead of carrying every raw turn.
- Drop tool outputs once they've been used—a 10,000-token API response rarely needs to ride along for the rest of the task.
- Scope tools per phase so the model only sees the handful relevant to the current step.
Hard limits and guardrails
- Cap max steps per task. A loop that can't finish in N steps should fail loudly, not spend forever.
- Set per-task token budgets and kill runs that blow them.
- Validate tool inputs before the call so the model isn't paying to fix avoidable errors.
Smaller output where you can
Output tokens cost several times more than input on most models. If a step only needs a structured decision, ask for JSON, not prose. Don't make the model narrate when you only need the answer.
Measure before you optimize
You can't control what you don't log. Track tokens per task, calls per task, retry rate, and cache hit rate—per agent and per step. The expensive step is almost never the one you'd guess; it's usually a verbose tool result or a redundant verification pass quietly running on every task. Find it, then apply the lever that fits. The cost of AI agents is controllable once you stop looking at the per-call price and start watching the whole loop.