2. Compress your prompts with Mermaid diagrams
We had one enterprise account with 2,000-3,000 lines of text instructions for test case generation. Nobody was reading it, not even the developers. We just trusted the system. But the LLM had to read (and pay for) it on every single loop.
The solution
We started converting long text instructions into Mermaid diagrams (flows) and pseudo-scripting formats.
- A 1,000-token verbose text prompt compressed down to under 100 tokens when written as a Mermaid structure.
- The LLM understood the logic perfectly.
- We saw zero negative impact on agent execution quality.
If your prompts are written like essays for humans, compress them into structural code for the models.
3. Index repositories before reading files for cost control
The default behavior of most orchestrators (like Copilot, Cloud, etc.) is highly inefficient. They perform a basic bash/grep search, find a keyword in a file, and then immediately force the agent to read the entire file to understand the context. If you are investigating a large codebase, the token bill compounds exponentially.
The solution
Our solution: We integrated CodeGraph and RepoMix into our runner setups. These tools index the repository and prepare a lightweight snapshot of the codebase (public/private methods, system architecture, folder trees) without loading the actual implementation of every helper method.
- The agent uses CodeGraph to find the exact 10 lines of code it needs, then reads only that fragment, then uses RepoMix to take the snapshot structure of the file, and only after that reads the needed piece of file.
- It takes only 15 seconds to run on massive codebases.
Result? This reduces codebase exploration tokens by 50% to 60% and improves unit economics.
4. Enable smarter model selection and workload routing
When trying to save money, the instinctive reaction is to switch everything to the cheapest model (like GPT-5.4 Mini or Haiku-4.5). But cheap models are often "lazy" or make logical errors, leading to infinite loops.
On top of that your harness can have limitations in loops, so actually they have no chance even to finish tasks properly. We ran the exact same debugging task across different models and counted the execution loops:
- GPT-5.4 Mini: 10 loops (kept failing, retrying, and re-reading context).
- Claude Sonnet 4.6: 3-5 loops.
- Claude Opus 4.6: 1 loop (one-shot fix).
Even with an expensive model like Opus, running 1 loop on Opus is often cheaper than running 10 loops on a Mini model because of the accumulated input tokens from retries.
The solution
Use Tier 2 models (like Sonnet or GPT-5.4) as your workhorses for cost reduction.
Route simple orchestrator tasks (file moving, deployment triggers) to Mini/Flash models. Save reasoning models for complex multi-file bug fixing. Design your budgets system properly.
We've also started routing workloads to hyper-optimized, low-cost providers like Silicon Flow and Chinese local setups (such as Kimi 2.6 and DeepSeek), which offer incredible performance for a fraction of the cost.
5. Fix your CI/CD costs with caching and runner optimization
Tokens are only half the bill. The other half is CI/CD infrastructure. Our May run spent $500 on GitHub runner minutes alone. If you don't optimize your environments, you are burning money on reinstalling dependencies on every single agent run.
The solution
- Cache everything: We cache npm dependencies, container layers, and our custom CLI tools (dm-tools).
- At the start of a runner session, we restore from cache. At the end, we update the cache.
- You pay a minor network fee for the cache restore, but you save precious, expensive runner minutes.