dyb

Claude Code Token Optimization or What Actually Works May 2026

The Two Problems

If you use... You pay with... Fix
Claude Code CLI Response quality Strip context bloat
Claude API Real dollars Stack four API techniques

These are separate tracks. Don't mix the numbers.


CLI Track: What's Already In Your Context Window

Before you type a single character: 20,000–30,000 tokens loaded.

Layer Tokens You control?
System prompt ~4,200 No
CLAUDE.md varies Yes
Auto memory ~680 Yes
MCP tool schemas ~120+ Yes
Skill descriptions varies Yes
Environment info ~280 No

Run /context live. Run /memory to audit. That's where waste hides.


CLAUDE.md: Cut to <500 Tokens

Benchmark: 3,847 tokens → 312 tokens = 91.9% reduction, zero quality loss.

Delete: Team rosters, meeting schedules, generic preambles, framework docs Claude already knows (Next.js, TypeScript, React), FAQs it can't act on.

Keep: Non-obvious build commands, architecture decisions that violate defaults, project-specific constraints.

Test: Would this surprise an experienced developer new to the repo? If not, cut it.

Pro tips: - HTML comments (<!-- note -->) cost zero tokens — use them for human notes - CLAUDE.local.md (gitignored) for personal sandbox URLs - Edits don't apply until restart or /compact


.claudeignore vs permissions.deny

Tool Behavior Use for
.claudeignore Advisory — Claude can override Signal layer: node_modules/, dist/, build/, .next/, __pycache__/, *.lock, *.min.*
permissions.deny Hard block — Read tool fails Enforcement: {"permissions":{"deny":["Read(node_modules/**)","Read(dist/**)"]}}

Use both. Check .claudeignore into version control.


Path-Scoped Rules: The 41% Win

Rules in .claude/rules/ load at startup by default. Add paths: frontmatter — they load only when Claude touches a matching file.

---
paths:
  - src/api/**
---
All endpoints validate with Zod. Never return raw Prisma errors.

This costs nothing during frontend or test work. Documented case: 1,358 lines always-loaded → 807 lines global = 41% overhead reduction.


MCP: The Silent 10K–20K Token Tax

Every connected MCP server injects its full tool schema into every request.

ENABLE_TOOL_SEARCH defers schemas until needed. Heavy setups recover 50,000–70,000 tokens.

Warning: Connecting/disconnecting MCP mid-session wipes your entire prompt cache. Plan architecture, not ad-hoc toggles.


Hooks: Compress Before Context

A PostToolUse hook on Bash commands turns 10,000-line build logs into 200-line error summaries. The open-source RTK tool reports 80–99% reduction on build/test output.


API Track: The 89.3% Stack

Four separate levers. All require deliberate prompt structure.

Technique Savings Key requirement
Model routing 77.1% Route trivial queries to cheaper models
Prompt caching 71.5% Stable content before dynamic, above threshold, breakpoints placed
Multi-turn caching 63.2% Cache persists across turns
Output budgeting 56.8% Cap max tokens per response

Combined: 89.3%

The gap is structural, not a bug. "I enabled caching but my bill barely moved" is the default outcome. Caching is prompt architecture.


Quick Wins


Bottom Line

90% savings come from six small structural disciplines, not one big toggle. Organize context, scope rules, architect prompts in that order.


Sources: token-optimizer benchmark, Anthropic pricing mid-2026, production CLI telemetry.

← Previous
Ricky polyglot software developer
Next →