// AI

Why you’re hitting your limits in Claude

Tokens have always mattered. When the context window was the only thing anyone worried about, the game was pacing yourself against the 200K ceiling, watching retrieval degrade past the halfway mark, compacting before auto-compaction did it at the least intelligent point of the session. That part hasn't changed. What changed is what else counts against you now. 5-hour session windows. Weekly caps across plans. Tighter model access by tier. If you remember working without those, you know they're not coming back any time soon. Usage is cost to the provider, and every new surface Anthropic ships (Claude Code, the desktop app, skills, agents, MCPs, the API) drinks from the same limits pool, even when the dashboard displays the buckets separately. The skill that matters now is optimization. Getting more out of what you already pay for. The entry point into that skill is understanding where your tokens actually go.

The fee before you type

The first thing worth doing, if you haven't already, is opening a fresh session and running /context before you type anything. That number is the fee you pay just to exist in this session. System prompt. Project memory. Every tool the harness exposes. Every MCP server you have installed. Every skill. All of it loads before your first keystroke, and all of it counts against your window.

On a lean project the reading might be a few thousand tokens. On one I'd neglected for a while, it wasn't. The gap between those two numbers is where most people quietly lose their session budget, and it never shows up as a visible expense. It's just there.

Every new message rereads the old ones

The other thing worth getting clear on is that tokens don't add linearly across a conversation. They compound. Every new message rereads the entire session up to that point, so by turn thirty the per-message cost is nothing like the first prompt. That's also where context rot shows up. More to attend to, less precision per attention slice, more drift, more edits without reads.

This is the part that makes a session look fine in the middle and disastrous by the end. You type one more prompt, the model pulls the whole history, answers, and the cost for that single turn dwarfs the first ten turns combined. The progress bar doesn't make any of this visible. It looks like there's room until there isn't.

Compact earlier than the system wants you to

Auto-compaction kicks in near the ceiling. That's the worst possible moment to summarize, because the model is already at its most distracted by the time it fires. Anything it keeps on the way down is a fraction of the original detail, summarized at the least intelligent point of the session.

So I compact manually, earlier. On the 200K model I trigger it around 60%. On the million-token model I still compact around 60% of my working ceiling, which for me sits closer to 120K tokens. Past that, retrieval degrades anyway. Auto-compact is a safety net, not a plan.

A habit I underused for too long is /rewind. When Claude goes wrong and I tell it to try again, the broken attempt stays in the context, reread on every turn, taxing every future response. Rewinding drops it cleanly. The fix then runs over clean history instead of over the failure.

Scope MCPs per project, not globally

The thing that had the biggest effect on my baseline was how I wire MCP servers. For a while I had every useful one installed globally, because why not. The problem is they all load into every session, whether the current project needs them or not. Every new conversation pays the full fee for the full set.

So I broke them out per project. The scraping MCP lives on the project that scrapes. The Chrome devtools MCP lives on the frontend project. The design MCP lives on the design work. The global config stays intentionally small. A new project only pays for the tools that project actually uses, and the first /context reading after a fresh session stays low.

This is where a quieter long-term question lives. Every skill, every agent, every MCP is a few hundred to a few thousand tokens you pay on every session, forever. At some point, the surface area of your tooling starts to cost more than the tooling saves you.

Keep the meter visible while you work

In the terminal the fix for all of this is the status line, which surfaces a running context count for the current session. Glance at it between prompts and you naturally start to pace yourself. Low-friction enough that checking it becomes a reflex.

I spend most of my day in VS Code on Linux, and the native integration there didn't surface the same live readout in a place I'd actually see. So I wrote a small extension for myself that pins a per-session usage indicator into the editor's status bar. Nothing fancy. Does one thing. The habit of glancing at it changes the way I use the session, the same way glancing at the save state changes how I think about a file.

Optimization is the skill now

None of this is new. Anthropic has been publishing variations of the same advice for years. What changed is the cost of ignoring it. With 5-hour windows and weekly caps in the mix, an expensive session doesn't just get slow near the ceiling, it burns budget you can't easily get back. The people getting the most out of Claude on a cheaper plan treat tokens like a shared limited resource. Same way you'd treat a database connection, or a rate-limited API.

The session-management reflexes I'm describing used to be optional polish. They're becoming the job.

  • Open a new session with /context before anything else, and know what you already owe
  • Keep the MCP footprint scoped per project, not global
  • Compact manually before you hit the ceiling, not after
  • Rewind broken attempts out instead of leaving them to be reread
  • Watch the meter while you work

The limits pool isn't going to grow back. Usage is cost on the provider side, and the cheaper plans only work if the caps stay tight. The only variable under your control is how many tokens you burn on things you didn't need to burn them on. That's the skill.