Three levers for cutting your coding agent's token bill, and how to tell which one works

Coding agents burn tokens on naive grep-and-read loops and raw tool output. Three independent levers cut the bill. I ran a real internal benchmark on a code-graph stack over 17 repos and the lever I expected to win did not. Here is what I measured, per engine, behind a quality gate.

If you are new to how LLMs are billed, two minutes of fundamentals make the rest of this post obvious. A token is a chunk of text, roughly three-quarters of a word, and it is the unit you pay for. Every model call has two sides. Input tokens are everything you send: the system prompt, the conversation so far, and every tool result you paste back in. Output tokens are everything the model generates: its reasoning, its tool calls, and the code it writes. You pay for both, usually at different rates, and crucially you pay for the input on every single call. An agent that takes twenty steps re-sends a growing pile of history twenty times. Think of it like a taxi meter that charges for the passengers already in the car, not only the new one you pick up: the longer the ride and the more you carry, the more each additional block costs. That is why a loop with no expensive prompt anywhere in it can still run up a bill nobody can explain.

I noticed the problem on an invoice. A small fleet of coding agents, pointed at a normal-sized repo, was spending more per resolved task than I had any model for. Nobody had written an expensive prompt. The cost was coming from the loop itself.

So I sat with a week of traces and read what the agents were doing token by token. Two patterns explained most of the bill. The first was a grep-and-read reflex: the agent would search for a symbol, get a wall of matches, then read four or five whole files to find the one function it needed. The second was that every tool result went into context raw. A 600-line grep -rn dump, a pretty-printed JSON response, a stack trace with forty frames of framework noise, all of it landing verbatim in the next prompt.

Neither of those is a model problem. They are plumbing problems, and plumbing is fixable.

This post is about three levers I pulled to cut that bill, what each one was worth, and the part that took me longest to accept: they are independent, they do not pay off equally, and the only way to know which one is working is to measure each one on its own. I am not going to give you a vendor “10x” tagline, because when I measured honestly, nothing was 10x, and the lever I expected to win did not.

The benchmark I actually ran

The numbers in this post are not made up. They come from a real internal benchmark I ran against a code-graph-rag stack pointed at 17 of our backend repos, each a medium service of roughly 10,000 to 100,000 lines. The stack has two model calls per question: a small Cypher generator that turns a natural-language question into a graph query, running on Haiku 4.5, and an orchestrator that plans and answers, running on Sonnet 4.6. I instrumented the HTTP client to capture the exact request and response bodies against the Anthropic API, so every cost figure below is from a measured usage object on the wire, not an estimate. Where a number is illustrative rather than measured, I say so.

To keep the worked examples concrete and generic, I will also use a small imaginary service: an HTTP API, a handful of route handlers, a service layer, a data-access layer, and tests. Call it widget-store. The kind of repo where an agent task is “add a discount field to orders and thread it through to the invoice,” which spans layers: the agent has to find where orders are defined, where invoices are built, and what touches both. That is exactly the cross-file discovery that grep-and-read handles badly, and the kind of task the real stack answers against the 17 repos.

Where the tokens go

Before pulling any lever, split the bill into two buckets:

  • Input you fed the model: the system prompt, the running history, and every tool result you pasted back in.
  • Output the model generated: its reasoning, its tool calls, and the code it wrote.

This split decides everything downstream. If a task’s cost is dominated by input you control, retrieval and compression have room to work. If it is dominated by the model’s own output, no amount of smarter reading will help much, because you cannot compress tokens the model has not produced yet.

I underestimated how often output dominates. On a tight task with a clear target file, the agent reads a little and writes a lot, and the input levers barely move the needle. That single fact is why the first lever is not the clean win I assumed it would be.

per-task token bill
├── input (you control this)
│   ├── system + instructions
│   ├── conversation history
│   └── tool results  ← grep dumps, JSON, logs live here
└── output (model controls this)
    ├── reasoning
    ├── tool calls
    └── generated code

Lever 1: structural retrieval with a code graph

The idea is to stop making the model read files to discover structure. You precompute a graph of the codebase once, where nodes are files, classes, and functions, and edges are calls, imports, and definitions. Then the agent asks the graph a question and gets back the three functions that matter instead of the five files that contain them.

Off-the-shelf, you can build this with something like code-graph-rag, which parses a repo into a graph you can query, backed by a graph store such as Memgraph. The agent’s “find where orders connect to invoices” becomes a traversal instead of a grep plus four reads.

flowchart LR
    Q["agent question:<br/>what connects orders to invoices?"]
    subgraph cg["precomputed code graph"]
        O["Order"] -->|used by| S["InvoiceService.build()"]
        S -->|calls| L["lineItemsFor(order)"]
    end
    Q --> cg
    cg --> A["3 functions, ~40 lines<br/>instead of 4 files, ~600 lines"]

On the widget-store discount task, this is where the graph shines. Brute-force, the agent pulled four files into context to trace the path from order to invoice. With the graph, it pulled the three relevant functions. Illustratively, that took the discovery phase from roughly 9,000 input tokens to roughly 2,500. A real cut, and the kind of thing a vendor screenshot would stop at.

Here is the part the screenshot leaves out. I ran the graph against a set of tasks, not one, and the result was lumpy:

  • On deep cross-file discovery, the graph won clearly. Fewer files read, less context, lower input cost.
  • On shallow tasks with an obvious target file, the graph was break-even. The traversal overhead roughly cancelled the reading it saved.
  • On write-heavy tasks, where the model’s own output dominated the bill, the graph was slightly worse, because I was paying to build and query a graph whose savings landed on the small part of the bill.

So the honest finding is that a code graph is question-dependent. It is a sharp tool for discovery and a dull one everywhere else. If your workload is mostly small, targeted edits, the graph can cost you money. I went in expecting a flat discount and came out with a conditional one.

Lever 2: compress at the tool-output boundary

This is the lever I almost skipped, and it turned out to be the reliable large win.

The insight is narrow on purpose. Tool output is machine-shaped and repetitive: grep results repeat paths and line numbers, JSON repeats keys, logs repeat timestamps and framework frames. A prompt compressor like LLMLingua-2 is built for exactly this. It drops tokens that carry little information for the model while keeping the ones that do. On that kind of input it is lossless enough that the agent’s behavior does not change.

The rule I settled on, and the one I would lead with:

Compress tool output. Never compress source code.

Source code is the opposite of a log. Every token can be load-bearing, and the failure mode of dropping the wrong one is silent: the model gets a subtly wrong picture and writes subtly wrong code. So the compressor sits on one boundary only, the place where tool results re-enter context, and source files pass through untouched.

# illustrative: compress only at the tool-output boundary
def feed_to_context(payload, kind):
    if kind in ("grep", "json", "logs", "shell_output"):
        return compress(payload, target_ratio=0.45)  # LLMLingua-2
    if kind == "source":
        return payload  # never touch source code
    return payload

Across the task set, this was the steadiest saver. On tool-heavy tasks it cut total input tokens by roughly 30 to 40 percent, illustratively, and the quality gate did not flinch. I want to be precise about which numbers I stand behind: the per-engine caching figures in lever 3 are measured on the wire, while this compression range is from lighter runs on the same stack rather than the full instrumented benchmark, so I treat it as directional. It does not depend on the shape of the question the way the graph does, because almost every agent task generates verbose tool output at some point.

The catch is that you have to be disciplined about the boundary. The first time I let a compressor see a file because the wiring was sloppy, the agent confidently edited a function that no longer matched what was on disk. One mislabeled payload is all it takes. Tag your payloads by kind and route on the tag, not on a guess.

Lever 3: a gateway with virtual keys, a ledger, prompt caching, and a shared cache

The first two levers make each call cheaper. The third makes calls disappear, and it is the lever I have the most measured data on, because it is the one I benchmarked end to end.

Right now, if you have more than one engineer running agents, you probably have provider keys scattered across machines, no shared view of spend, and the same questions being answered from scratch by every person on the team. Centralize the keys behind a gateway like LiteLLM and four things become possible at once:

  • Per-developer virtual keys. Everyone calls the gateway, not the provider. You can rotate, scope, and rate-limit a person without touching a provider dashboard.
  • A spend ledger. Every call is attributed, so cost stops being a monthly surprise and becomes a number you can watch by repo and by task type. This is also how you keep the benchmarking honest later.
  • Prompt caching. Mark the stable prefix of a request, the system prompt and tool definitions, with cache_control, and the provider charges a reduced rate to re-read it on a subsequent call within the cache window instead of full price. This is the lever I measured most precisely, and it has a sharp edge described below.
  • A shared semantic cache. When one engineer’s agent asks “how does invoice rounding work,” the answer gets cached by meaning, and the next engineer who asks a semantically similar question gets a cheap hit instead of a fresh model call. A tool like GPTCache does the embedding-and-lookup part.

What prompt caching actually returned, per engine

This is where the benchmark earns its keep, and where the honest finding diverges from the slide. I turned on Anthropic prompt caching across the two-engine stack and measured the per-call usage on the wire. The result split cleanly by engine, and the split is the lesson.

The small Cypher generator on Haiku 4.5 got zero reduction. Its system prompt is about 1,750 tokens, which sits below Haiku’s cacheable-prefix minimum, so the provider quietly declined to cache it: the marker was on the wire, but cache_creation_input_tokens came back zero on every call. The cache was a no-op for that engine. At roughly $0.0023 per Cypher call, that engine was already cheap, so the miss did not hurt, but it would have been invisible if I had not read the usage object. A cache that silently does nothing is the easiest savings to imagine and the easiest to never actually get.

The orchestrator on Sonnet 4.6 is where caching paid. Its system prompt plus tool definitions run about 3,400 tokens, comfortably above the threshold, so it cached. The economics measured out like this:

Orchestrator callInput billingTotal costvs uncached
Uncached3,400 tok at full rate$0.0132reference
First call (cache write)3,400 tok at 1.25x$0.0158+20% (the write premium)
Warm call (cache read)3,400 tok at 0.10x$0.0042about 68% cheaper

The first call is more expensive, because a cache write costs 1.25x the input rate. Every subsequent call within the window reads the cached prefix at a tenth of the input rate and lands about 68% cheaper than an uncached call. So caching is not a flat discount; it is a bet that you will reuse the prefix enough times to pay back the one write.

Whether the bet pays depends on how clustered the usage is. A “session” is one developer asking a series of questions in close succession, each triggering one Cypher call and one orchestrator call. Measured across session shapes:

Session shapeNet cost changeWhy
Single question, cold cache+17% worseYou pay the write premium and never read it back
5 questions (4 within the window)about 43% cheaperFour warm orchestrator reads outweigh one write
20 questions, heavy useabout 54% cheaperThe write is amortized across nineteen warm reads

Projected onto a month of real usage, the savings ratio climbs with intensity, because caching only beats baseline when there are reads within the window: roughly 26% for a single light-use developer, around 46% for a small team of three asking regular questions, and near 50% for a wider team of eight. My earlier back-of-envelope guess had been “35 to 50% across the board,” and the measurement corrected it: the floor is lower and the ceiling is real, but only under repetition, and only on the engine whose prefix clears the threshold.

The transferable findings, both of which I would have gotten wrong without reading the wire:

  • Below the prefix threshold, caching is a silent no-op. Haiku’s minimum cacheable prefix is around 2,048 tokens; Sonnet’s is lower. A prompt under the line gets the marker honored on the wire and cached zero tokens. Check cache_creation_input_tokens, not the request.
  • A single-shot call is worse with caching on, not better. The 1.25x write premium with no read back makes a one-question session 17% more expensive. Caching is a repetition lever, so its value is a function of your session shape, not a constant.
flowchart LR
    D1["dev A agent"] --> G["gateway<br/>(virtual keys + ledger)"]
    D2["dev B agent"] --> G
    G --> C{"semantic<br/>cache hit?"}
    C -->|yes, same repo| H["cached answer<br/>~0 tokens"]
    C -->|no| M["model call"]
    M --> C2["write to cache<br/>(scoped to repo + commit)"]

A shared cache sounds like free money, and that is exactly why it is dangerous. Two governance gotchas bit me, and both are the kind you only see in production.

The first is invalidation on code change. A cached answer about how invoice rounding works is correct until someone changes the rounding. If the cache outlives the code it described, it starts handing out confident, stale answers. I key cache entries to the repo state, so a relevant commit invalidates the entries that depended on it. A stale cache hit is worse than a miss, because it costs nothing and lies.

The second is per-repo scoping. Without it, an answer derived from one project can surface as a hit in another, and the two repos do not share invariants. I scope every cache entry to its repo so answers cannot leak across projects. The savings from cross-project sharing are not worth one wrong answer in the wrong codebase.

With those two rules in place, the cache was the lever with the best ceiling, because a hit is close to free. But its value depends entirely on repetition. On a team asking overlapping questions about a shared codebase, the hit rate climbed and the savings compounded. A solo developer on a fast-changing repo would see most entries invalidated before anyone reused them.

Interactive: stack the levers
2,400,000 tokens / month (illustrative) baseline
Code-graph retrieval Pull the call graph, not whole files. Question-dependent, so the saving is smaller and variable. ~ -18%
Tool-output compression Summarize logs, diffs, and command output at the boundary. The reliably-large lever. ~ -45%
Semantic cache A hit is close to free. A step-change, but only on repeated, overlapping questions. ~ -30%

Numbers are illustrative, applied to a representative baseline for the toy repo. Real savings depend on your codebase, your questions, and your hit rate.

Attribution: the step that is easy to skip

The trap is stacking levers and measuring once. You implement all three in a weekend, you look at next month’s invoice, and it is lower. You congratulate yourself and move on.

You have no idea which lever did the work.

Maybe the compressor is carrying the whole thing and the code graph is quietly costing you money on every small edit. Maybe the cache hit rate is near zero and you built infrastructure for nothing. Stacked together and measured once, the levers are indistinguishable, and you will keep paying for the ones that do not pay you back.

So I benchmarked each lever the boring way:

  1. Fix a task set. A frozen list of representative tasks against the widget-store repo, run identically every time.
  2. Fix a quality gate. Each run has to pass the same bar: tests green, the diff does what the task asked. A lever that saves tokens by producing worse code has saved nothing. This gate is non-negotiable, because every one of these levers can cut cost by quietly cutting quality.
  3. Change one thing at a time. Baseline, then baseline plus graph, then baseline plus compression, then baseline plus cache. Never two at once.
  4. Attribute the delta. Whatever the bill moved when you added one lever, and only that lever, is what the lever is worth. The spend ledger from lever three is what makes this measurable per run.
run             token Δ           quality gate   verdict                       basis
baseline        --                pass           reference                     --
+ graph         -28% on discovery pass           wins on deep discovery only   illustrative
+ compression   -34% total        pass           reliable, broad               directional
+ prompt cache  -54% per 20q sess pass           repetition-bound, per-engine  measured on wire
+ semantic cache varies           pass           best ceiling, repetition-bound illustrative

What fell out of doing it this way is the whole point of this post. The lever I expected to be the star, the code graph, was conditional. The lever I almost skipped, tool-output compression, was the dependable win. And caching, the one I measured to the cent, was a step-change on warm, repeated queries against the engine whose prefix cleared the cache threshold, a no-op on the small engine below it, and a 17% penalty on single-shot use. Free money only after I paid the write premium and only under repetition.

None of that is visible if you measure the stack as a whole. It only shows up when each lever has to earn its savings alone, against a quality gate, on the same tasks every time.

What I would do on a fresh setup

If I were starting over on a new team and a new repo, in order:

  1. Gateway, keys, and ledger first. Even before any cache pays off, you want attribution. You cannot run the rest of this honestly without a per-call usage number, and reading that usage is what told me the Haiku cache was a silent no-op.
  2. Prompt caching next, but check the threshold. Turn it on for the engine whose system-plus-tools prefix clears the provider’s cacheable minimum, and confirm cache_creation_input_tokens is non-zero before you believe it. Expect it to help warm, repeated sessions and to cost you on single-shot calls.
  3. Compression alongside. It is a high floor, it does not depend on the question, and the only discipline it asks is keeping source code away from the compressor.
  4. Semantic cache once there is repetition. Turn it on when more than one person is asking overlapping questions about the same codebase, and turn on invalidation and per-repo scoping the same day, not later.
  5. Graph last, and only if your workload is discovery-heavy. If your agents mostly do large cross-file investigations, build it. If they mostly do small targeted edits, skip it and save yourself the infrastructure.

The meta-lesson is the one I keep relearning in this kind of work. Independent levers have to be measured independently, behind a quality gate, or you cannot tell help from theater. The originality is never in any single trick. It is in wiring them together and being honest about what each one was worth.

Key takeaways

  • You pay for input on every call, and input grows as history grows, so a loop with no expensive prompt can still run up a large bill.
  • Split every task’s cost into input you control and output the model generates. The levers only work on the input side, so a write-heavy task limits how much they can help.
  • A code graph wins on deep cross-file discovery and is break-even or worse on shallow or write-heavy tasks. It is question-dependent, not a flat discount.
  • Compress tool output, never source code. Tool output is repetitive and machine-shaped; source code is load-bearing token by token, and dropping the wrong one fails silently.
  • Prompt caching is the lever I measured precisely, and it is per-engine: zero on the Haiku engine whose prefix sat below the cacheable threshold, about 68% cheaper on warm Sonnet orchestrator reads, and a 17% penalty on a single-shot session. Check cache_creation_input_tokens on the wire, because a sub-threshold cache is a silent no-op.
  • A shared semantic cache has the best ceiling but only after you pay the governance tax: invalidate on code change and scope per repo. A stale hit costs nothing and lies.
  • Benchmark each lever alone, against the same task set, behind a fixed quality gate, changing one thing at a time. Stack-and-measure-once tells you nothing about which lever earned its keep.
  • On a fresh setup: gateway and ledger first, prompt caching with a threshold check next, compression alongside, semantic cache once there is repetition, graph last and only if your workload is discovery-heavy.

Common questions

Does a code graph always cut token cost?

No. In my runs it won on deep cross-file discovery, where brute-force reading would have pulled five or six files into context. On shallow questions, or any task where the model's own generated output dominates the bill, the graph was break-even or slightly worse once you count the retrieval overhead. Treat it as a question-dependent tool, not a flat discount.

Why compress tool output but not source code?

Tool output like grep dumps, JSON, and logs is repetitive and machine-shaped, so a compressor such as LLMLingua-2 can drop a large fraction of tokens with little loss the model cares about. Source code is the opposite: every token can be load-bearing, and dropping the wrong one silently changes the answer. I leave source untouched and only compress at the tool-output boundary.

Is a shared semantic cache safe across projects?

Only if you scope it. An answer derived from one repo can be wrong or leaky in another, so I key the cache per repo and invalidate on code change. Without those two rules a cache hit becomes a confident wrong answer, which costs more than the tokens it saved.

Why does an agent loop cost so much when no single prompt is expensive?

You pay for input tokens on every call, and the conversation history is part of the input. A twenty-step task re-sends a growing pile of history twenty times, like a taxi meter charging for the passengers already in the car. The bill comes from the accumulation across the loop, not from any one prompt, which is why reading the per-call cost matters more than reading any single message.

How do I know which lever saved money?

Benchmark each lever independently against the same task set with a fixed quality gate, change one thing at a time, and attribute the delta to that lever alone. If you stack all three and measure once, you cannot tell which one paid for itself, and you will keep paying for the ones that did not.

Keep reading

Prasad Subrahmanya
Prasad Subrahmanya

Founder & CEO at Luminik. 3x technical founder. I turn expensive, repetitive work into products people pay for.

Back to all writing