marketingscience.dev

Context Engineering for Long Analytics Sessions (Escaping the Dumb Zone)

Long analytics sessions rot. Here's how I keep Claude Code and Codex sharp across a two-hour MMM build — six patterns, one war story, and why a bigger context window is just a bigger haystack.

Let me start with the moment that made me write this.

I was building an MMM for my Maven course. One of those long sessions where you start with exploration, get the database ready, run diagnostics, fit a model, and two hours later you look up and you've eaten half your context window without noticing. And at some point the model just started spewing nonsense. It didn't crash. Nothing dramatic. The numbers just stopped feeling right. It started making stupid modeling assumptions. It was forgetting things it knew thirty prompts ago.

So I did what we all do — I steered it. And it came back with the three words that should make every analyst's stomach drop:

"You're absolutely right."

That's the tell. When the model starts agreeing with you a little too eagerly, you're not in a productive session anymore. You're deep in the dumb zone of the context window, and everything you do from here is going to be worse than what you'd have gotten an hour ago.

This is specific to Claude Code in my examples, but to be fair it applies to every major harness — Claude Code, Codex, OpenCode, all of them. There are nuances between models (Opus 4.6 decays slower than Sonnet; GPT-5.4 holds up well out to 250–400k tokens), but the underlying problem is the same everywhere.

Why long sessions break

People talk about long context as a coding problem. In analytics we don't immediately see it, because we think of analytics as "do this report, make this pivot table." But modeling an MMM or designing an experiment? You go through exploration, prep, diagnostics, calibration, modeling. An MMM alone can take 15 to 30 minutes just to run, and the whole workflow has a dozen phases. You end up an hour or two deep without realizing it. Long analytical sessions are a reality, and they need to be accounted for.

Here's the thing I want to hammer home: this is not a context window size problem. It's an attention and instruction-budget problem. The context isn't missing. It's that a bigger context window doesn't get paid attention to equally across the session. Attention drifts based on how the model is built, and you get stuck in some local minimum of the model's distribution.

Remember when Gemini shipped the first 1-million-token window? We were all working with 100k at the time, and everyone said "RAG is dead — just dump your entire database, your entire codebase, into the window and you're done." That was never the case. You're letting context rot.

Lost in the middle

The core insight is the needle in a haystack problem, also called lost in the middle. There's a paper by Liu et al. (2023) that traces it back to how attention works under the transformer: models pay closer attention to the beginning and the end of the context window, and the middle degrades.

The benchmark — Chroma has done this extensively — is you insert a "needle" (a record you want retrieved) somewhere around 40–60% of the way through the window, then ask the model to fetch it. What they found, and what dozens of papers since 2023 keep finding: the more you fill the window, the harder it is to extract the needle.

Now connect that to how a long session actually goes. You heavily load context at the start. You do all your real work in the middle. That middle is exactly the dead zone — the dumb zone — where retrieval is weakest. That's not a coincidence; that's the architecture.

A bigger window is just a bigger haystack

So here's the trap. When a lab ships "the same model with a bigger window," your instinct is to think it got smarter. It didn't. It doesn't have more parameters or bigger reasoners. They've used a mathematical trick — something like the YaRN algorithm — to let the model search through more tokens. Sometimes that even means reducing attention on parts of the architecture to make room for more tokens. Go back to 2023 and the sliding-window approach: you bought a bigger window by literally sliding it, which meant longer sessions forgot the earlier parts.

Opus 4.6 with the original 200k window and Opus 4.6 with the 1M window are not two intelligences. Same model, bigger window. Prompt adherence and mid-context reasoning improve in post-training — that's the real jump from Opus 4.5 to 4.6, or between Sonnets. A bigger window, on its own, is just a bigger haystack to lose your needle in.

How we make it worse (and we do)

The structural problem is bad enough. Then we pile on.

CLAUDE.md bloat. This is the big one. There's a study out of the University of Zurich that tested repos with and without a CLAUDE.md file, and found that some CLAUDE.md files actually made Claude Code worse. To be fair, even Claude Code's own init command is guilty of this. The reason: CLAUDE.md is re-injected on every single turn. Every time you reply, you pay for the system prompt, the skill scaffolding, the MCP definitions, and then your CLAUDE.md on top. If it's 2,000 lines of overly prescriptive instructions, you're burning context before you've done anything — and overly prescriptive files are bad practice anyway. Even Anthropic says don't over-prescribe.

The rule I live by: CLAUDE.md is for things the model gets wrong, not things it already knows. Don't tell it you're using Python so use a package manager — it knows that. Do tell it "use uv, not pip," because Claude always defaults to pip and I prefer uv. A colleague of mine has "do not use em-dashes" in his, because these models love an em-dash. That's the right altitude: repeated behavior you want to stop, injected every turn so the model can't forget. What does not belong in there is, say, "here are the channels you should use for the MMM."

Verbose MCPs. MCP servers load at session start, so too many means you flood the window with tool definitions before you type a word. And some are incredibly chatty. The BigQuery MCP is a recent, personal example — let the model loose querying your warehouse tables and it'll fill your window with the model's queries and BigQuery's verbose replies. Newer harnesses do progressive disclosure now (a Jira or Linear MCP used to eat 30k tokens at startup; it's better), but it's still a problem.

Stale markdown memory. Memory lives in markdown these days, and markdown is documentation, and documentation goes stale fast. I was on a side project right before a recent session where yesterday I'd pivoted directions but didn't update the context files. So today one file said "phase 1 is X," another said "phase 1 is Y," and the model got confused. Of course it did.

Do-it-yourself / not-invented-here syndrome. Ask these models for causal inference and they'll write the algorithm from scratch. Ask for a synthetic control and they'll write a synthetic control in Python, all at once, instead of reaching for a tool. It shows they're smart, fine — but it dumps the entire generated algorithm into your window, and if the model takes a wrong turn from there, things get weird.

The war story

Which brings me to the session that earned this post its subtitle.

I was running a geo-experiment design without skills — letting the model do it itself. I asked it to simulate power: can this design detect a 5% and a 15% effect? Reasonable ask. And it went down a rabbit hole building Markov chain Monte Carlo from scratch. Spent twenty minutes on it. Spawned sub-agents to write the script. Then realized it was taking too long, so the main agent stopped its own sub-agents, cut things down, and the output was an error.

That error broke the entire context window. Because then the model generated what it predicted my reply would be — its guess at what I'd say next — and it answered its own question. And kept going. When I finally asked, "why did you do that?", it told me:

"Because you told me to."

I did not tell it to. This was Opus 4.6 on high effort — not Sonnet, not some weak setup. A frontier model, an hour deep in a rotted context, decided to pretend it was me and answer itself. That's what context rot does at the extreme. It's a cautionary tale on two fronts: be careful with long sessions, and be careful letting the model do statistical work itself. For simple things, Python and Bash are all you need. For statistical analysis, this goes wrong fast.

The six patterns I actually use

1. Audit your CLAUDE.md. If you take one thing from this, take this. Open your CLAUDE.md (or AGENTS.md for Codex/OpenCode) and cut anything the model can look up on its own. Keep only what it repeatedly gets wrong or what's specific to your business context. Prescriptive toward execution, not toward explaining things it already knows.

2. Progressive disclosure. This is how skills work under the hood, and it's a great pattern to build around. The idea: tell the model only what it needs, when it needs it. At launch, Claude Code reads only a skill's front matter — name and description — nothing else. When the description matches the task ("when designing a geo-experiment, use this"), then it loads the body. You don't put a manual on causal inference in the core text. You put a short description, and then references: "for synthetic control vs. diff-in-diff, see this file; for GeoLift details, see that one." Fill the window only with what you need.

3. Checkpoints, and be careful with compaction. Auto-compaction kicks in around 80% and summarizes your conversation into a fresh window. In Claude Code that summary is written by Haiku — it gets the whole conversation and is asked to compress it, and you will lose information, possibly the information you needed most. Compaction mostly sucks right now. So I checkpoint instead: if I'm doing research, I explicitly ask the model to write every conclusion to a research file as it goes, then I start a new session pointed at that file. That way I control what gets summarized and at what fidelity. (Anecdotally, Codex's compaction feels less damaging than Claude Code's lately — your mileage may vary.)

4. Sub-agents as context firewalls. I'm borrowing this framing from a developer named Dexter, and I agree with him: stop anthropomorphizing agents. We love to invent "the analyst agent," "the summarizer agent," "the data scientist agent." Wrong mental model. Sub-agents are context firewalls. Back to BigQuery: instead of the main agent flooding your window exploring tables, joining data, and chewing through errors, you spawn a sub-agent. It gets its own context window, does the messy work out there, summarizes the conclusion, and sends only that back. Same with "analyze this codebase" — Claude spawns sub-agents to do the grepping and reading, and only the summary lands in your main context. They're not little people. They're a way to manage sub-context.

5. RPI — one mission, one session. Research, Plan, Implement. Do your research (where's the data in BigQuery, how do I do this), export that context out, and start a new session for planning. Plan the steps — get the dates, run power analysis, find minimum detectable effect, simulate — then start another session to implement. One mission, one session. For tiny stuff, sure, do it all at once. But when the stakes are high enough, splitting gives you dramatically better results and near-free documentation: you end up with research.md, plan.md, and implement.md artifacts.

6. Regenerate, don't stack. If the model goes off track and you've corrected it two or three times, stop correcting it. Go back, rewrite the original prompt, and redo the session. Stacked corrections — "no, go here; no, that's wrong; no, over there" — leave scar tissue in the context window. And scar tissue anchors the model, because at its core it's predicting the next token given every token before it. Leave those broken branches open and it'll try to honor them. Always persist real decisions to files; don't trust the model to remember a choice you made at the top of a 1M-token session.

Some rough numbers

Heuristics, not laws — models differ. Starting context should be around 20% of the window, no more. At 60–70%, I start thinking about wrapping up and spinning a new session; in Claude Code I'm one of the people who gets nervous around 60%. The practical sweet spot is about 200k tokens. With GPT-5.4 I've stretched reliably to 400k. But it also depends on the session and, honestly, on how hard the provider's infrastructure is getting hit — last week the consensus online was Claude Code got dumber because thinking tokens were being throttled, which makes context discipline matter even more.

The cost, and why it's worth it

None of this is free. It would be lovely to talk to a harness for two hours and have something shiny fall out. Instead you split sessions, you research, you plan, you actually think and tell the model what to do. That's more effort. But the payoff is consistent, trustworthy results instead of a model an hour deep pretending to be you.

So three things to do after reading this: open your CLAUDE.md and cut the bloat, add checkpoint prompts to your workflow, and extract your repeated methodologies into skills. If you find yourself teaching Claude the same statistical procedure over and over, that's a skill waiting to be written.


I teach this kind of thing — marketing measurement, MMM, attribution, experimentation, and now the agentic-workflow side of it (skills, context engineering, the works) — at marketingscience.dev. If you're an analyst trying to do real statistical work with these tools without falling into the dumb zone, that's exactly who the course is for. Come say hi.