dyb

Claude Code Token Optimization or What Actually Works May 2026

The Two Problems

If you use...	You pay with...	Fix
Claude Code CLI	Response quality	Strip context bloat
Claude API	Real dollars	Stack four API techniques

These are separate tracks. Don't mix the numbers.

CLI Track: What's Already In Your Context Window

Before you type a single character: 20,000–30,000 tokens loaded.

Layer	Tokens	You control?
System prompt	~4,200	No
CLAUDE.md	varies	Yes
Auto memory	~680	Yes
MCP tool schemas	~120+	Yes
Skill descriptions	varies	Yes
Environment info	~280	No

Run /context live. Run /memory to audit. That's where waste hides.

CLAUDE.md: Cut to <500 Tokens

Benchmark: 3,847 tokens → 312 tokens = 91.9% reduction, zero quality loss.

Delete: Team rosters, meeting schedules, generic preambles, framework docs Claude already knows (Next.js, TypeScript, React), FAQs it can't act on.

Keep: Non-obvious build commands, architecture decisions that violate defaults, project-specific constraints.

Test: Would this surprise an experienced developer new to the repo? If not, cut it.

Pro tips: - HTML comments () cost zero tokens — use them for human notes - CLAUDE.local.md (gitignored) for personal sandbox URLs - Edits don't apply until restart or /compact

.claudeignore vs permissions.deny

Tool	Behavior	Use for
`.claudeignore`	Advisory — Claude can override	Signal layer: `node_modules/`, `dist/`, `build/`, `.next/`, `__pycache__/`, `.lock`, `.min.*`
`permissions.deny`	Hard block — Read tool fails	Enforcement: `{"permissions":{"deny":["Read(node_modules/)","Read(dist/)"]}}`

Use both. Check .claudeignore into version control.

Path-Scoped Rules: The 41% Win

Rules in .claude/rules/ load at startup by default. Add paths: frontmatter — they load only when Claude touches a matching file.

---
paths:
  - src/api/**
---
All endpoints validate with Zod. Never return raw Prisma errors.

This costs nothing during frontend or test work. Documented case: 1,358 lines always-loaded → 807 lines global = 41% overhead reduction.

MCP: The Silent 10K–20K Token Tax

Every connected MCP server injects its full tool schema into every request.

ENABLE_TOOL_SEARCH defers schemas until needed. Heavy setups recover 50,000–70,000 tokens.

Warning: Connecting/disconnecting MCP mid-session wipes your entire prompt cache. Plan architecture, not ad-hoc toggles.

Hooks: Compress Before Context

A PostToolUse hook on Bash commands turns 10,000-line build logs into 200-line error summaries. The open-source RTK tool reports 80–99% reduction on build/test output.

API Track: The 89.3% Stack

Four separate levers. All require deliberate prompt structure.

Technique	Savings	Key requirement
Model routing	77.1%	Route trivial queries to cheaper models
Prompt caching	71.5%	Stable content before dynamic, above threshold, breakpoints placed
Multi-turn caching	63.2%	Cache persists across turns
Output budgeting	56.8%	Cap max tokens per response

Combined: 89.3%

The gap is structural, not a bug. "I enabled caching but my bill barely moved" is the default outcome. Caching is prompt architecture.

Quick Wins

[ ] CLAUDE.md < 500 tokens
[ ] .claudeignore in version control
[ ] permissions.deny for hard blocks
[ ] Path-scoped rules with paths: frontmatter
[ ] ENABLE_TOOL_SEARCH for heavy MCP setups
[ ] Hooks on Bash/build output
[ ] API: stable→dynamic prompt structure with breakpoints
[ ] API: model routing + output budgeting on every call

Bottom Line

90% savings come from six small structural disciplines, not one big toggle. Organize context, scope rules, architect prompts in that order.

Sources: token-optimizer benchmark, Anthropic pricing mid-2026, production CLI telemetry.