Claude Code Token Optimization or What Actually Works May 2026
The Two Problems
| If you use... | You pay with... | Fix |
|---|---|---|
| Claude Code CLI | Response quality | Strip context bloat |
| Claude API | Real dollars | Stack four API techniques |
These are separate tracks. Don't mix the numbers.
CLI Track: What's Already In Your Context Window
Before you type a single character: 20,000–30,000 tokens loaded.
| Layer | Tokens | You control? |
|---|---|---|
| System prompt | ~4,200 | No |
| CLAUDE.md | varies | Yes |
| Auto memory | ~680 | Yes |
| MCP tool schemas | ~120+ | Yes |
| Skill descriptions | varies | Yes |
| Environment info | ~280 | No |
Run /context live. Run /memory to audit. That's where waste hides.
CLAUDE.md: Cut to <500 Tokens
Benchmark: 3,847 tokens → 312 tokens = 91.9% reduction, zero quality loss.
Delete: Team rosters, meeting schedules, generic preambles, framework docs Claude already knows (Next.js, TypeScript, React), FAQs it can't act on.
Keep: Non-obvious build commands, architecture decisions that violate defaults, project-specific constraints.
Test: Would this surprise an experienced developer new to the repo? If not, cut it.
Pro tips:
- HTML comments (<!-- note -->) cost zero tokens — use them for human notes
- CLAUDE.local.md (gitignored) for personal sandbox URLs
- Edits don't apply until restart or /compact
.claudeignore vs permissions.deny
| Tool | Behavior | Use for |
|---|---|---|
.claudeignore |
Advisory — Claude can override | Signal layer: node_modules/, dist/, build/, .next/, __pycache__/, *.lock, *.min.* |
permissions.deny |
Hard block — Read tool fails | Enforcement: {"permissions":{"deny":["Read(node_modules/**)","Read(dist/**)"]}} |
Use both. Check .claudeignore into version control.
Path-Scoped Rules: The 41% Win
Rules in .claude/rules/ load at startup by default. Add paths: frontmatter — they load only when Claude touches a matching file.
--- paths: - src/api/** --- All endpoints validate with Zod. Never return raw Prisma errors.
This costs nothing during frontend or test work. Documented case: 1,358 lines always-loaded → 807 lines global = 41% overhead reduction.
MCP: The Silent 10K–20K Token Tax
Every connected MCP server injects its full tool schema into every request.
ENABLE_TOOL_SEARCH defers schemas until needed. Heavy setups recover 50,000–70,000 tokens.
Warning: Connecting/disconnecting MCP mid-session wipes your entire prompt cache. Plan architecture, not ad-hoc toggles.
Hooks: Compress Before Context
A PostToolUse hook on Bash commands turns 10,000-line build logs into 200-line error summaries. The open-source RTK tool reports 80–99% reduction on build/test output.
API Track: The 89.3% Stack
Four separate levers. All require deliberate prompt structure.
| Technique | Savings | Key requirement |
|---|---|---|
| Model routing | 77.1% | Route trivial queries to cheaper models |
| Prompt caching | 71.5% | Stable content before dynamic, above threshold, breakpoints placed |
| Multi-turn caching | 63.2% | Cache persists across turns |
| Output budgeting | 56.8% | Cap max tokens per response |
Combined: 89.3%
The gap is structural, not a bug. "I enabled caching but my bill barely moved" is the default outcome. Caching is prompt architecture.
Quick Wins
- [ ] CLAUDE.md < 500 tokens
- [ ]
.claudeignorein version control - [ ]
permissions.denyfor hard blocks - [ ] Path-scoped rules with
paths:frontmatter - [ ]
ENABLE_TOOL_SEARCHfor heavy MCP setups - [ ] Hooks on Bash/build output
- [ ] API: stable→dynamic prompt structure with breakpoints
- [ ] API: model routing + output budgeting on every call
Bottom Line
90% savings come from six small structural disciplines, not one big toggle. Organize context, scope rules, architect prompts in that order.
Sources: token-optimizer benchmark, Anthropic pricing mid-2026, production CLI telemetry.