2026-06-01

The Three Cardinal Sins of Experimentation: Peeking, Early Stopping, and Data Dredging

Power is the unappreciated cousin of significance, and ignoring it is how good analysts get burned. Here's the iPhone-vs-telescope intuition for power, why peeking inflates your false-positive rate toward 100%, and how data dredging lets you 'drain the river and find something' that was never there.

I'll start with one of my pet peeves. Anyone who's worked with me knows it: underpowered experiments. And why they matter.

Everyone has heard about statistical significance — to the point of exhaustion. Significance is, roughly, the probability of seeing something that's actually not there. A false positive. You want it small: in physics or drug discovery you want 98–99% significance; in social sciences and marketing we usually live at 90–95%, which means your alpha is somewhere between 1-in-20 and 1-in-10. Fine. Everyone gets that you need to reach significance before you draw a conclusion.

What is not understood is the role of power. Power is the unappreciated cousin of significance, and it's where most marketing experiments quietly fall apart.

The iPhone-vs-telescope analogy

If significance is the probability of detecting something that's not there, power is the probability that, if something is there, you can actually see it.

The best way I've found to think about power is your iPhone camera versus the night sky.

The newest iPhones have genuinely powerful cameras. On a clear night, with a full bright moon, an iPhone 17 Pro can zoom in enough to pick out some definition on the craters of the moon. You can actually do it.

But you will not be able to see Uranus. You might catch the ring pattern as a smudge, but you'll get no detail. Why? Your camera simply doesn't have the resolution. If you want detail, you need a telescope. You need a bigger lens.

Experiments are exactly the same. The smaller the effect you're trying to measure, the bigger the lens you need — more power, more sample size. If you're trying to detect a 50% effect, you need very little sample. I've literally been handed the challenge: "can you design an experiment to detect an effect of 0.8% of revenue?" — and worse, sometimes "can you detect it after the fact?" The honest answer is you need a very big sample with a lot of power. You're asking to see minute things.

So the more you go down the assumption ladder — from a clean user-level A/B test, to geo experiments, to quasi-experimental methods, to pre/post — the less powerful your lens gets. That doesn't mean you can't measure things. It means we're not doing magic here. There are limits. There are things you can't see.

Halving the effect quadruples the sample

Here's the key insight that makes this scary: it does not scale linearly. Halving the minimum detectable effect quadruples the required sample size. It goes quadratically. This gets out of hand fast.

The minimum detectable effect — the MDE — is the smallest effect you can actually measure, and it's the input that drives everything. So how do you pick it? My golden rule: either the team running the experiment knows what to expect, or the finance team has a forecast. At one subscription company I worked with, the forecast was 0.8% of sales. Okay — that's your best guess for the minimum acceptable effect, so that's what you design to detect. For calibrating an MMM, your MDE is whatever the model says the channel is worth: if the MMM says a channel drives 3%, that's the effect you're trying to see.

And the convention worth burning into your head: aim for at least 80% power. That means 4 out of 5 times you'll detect a real effect if it's there. I've seen experiments run live at 50% power. I've seen one at 30% power — stop and think about what that means: two out of three times, you literally cannot see the effect even when it exists.

One more rule, because it gets ignored: do the power analysis before you run the experiment, not after. Post-experiment power calculations are biased in a specific way. And this matters in marketing because honestly, the only thing worse than no experiments is badly designed ones — where you don't have the power to detect, so the results are no better than a coin flip, but now everyone has a dangerous sense of false confidence.

Geo experiments make this harder still. You have millions of users but only tens or hundreds of geos, and geos are noisier. So you often can't get power by accumulating sample — you have to increase the effect size instead. That usually means swinging your spend hard: instead of nudging budget, you shut the channel off in one set of regions to magnify the effect. If you can choose between more sample and more effect size, choose effect size every time. It's far easier to get power that way.

Okay. Now imagine you did all of this right. You ran the power calc up front, you got the sample size, your stakeholders accepted it, the design was clean. It takes three weeks to read out. This is exactly where the three cardinal sins come for you.

Cardinal sin #1: Peeking

In your life, how many stakeholders have asked you, a week in, "can we just see the results now? Can we get weekly reports?"

My friend Shaun, who's worked with me, laughs every time I bring this up, because he knows it drives me absolutely insane.

Here's the myth. People believe statistical significance works like a switch: once you open the platform and see p < 0.05, you've found the truth, and it stays there. What actually happens — and I draw this in red, because red is evil — is that your significance is all over the place early on, then it converges, then it settles down. Early in the test, it's noise.

And here's the genuinely counterintuitive part, the one I need people to sit with: every time you look, it is itself a statistical test. The act of opening the experimentation platform and asking "is it significant?" is not free. You're not just counting beans that were already sorted into "significant" and "not." Each look is a fresh roll of the dice with its own chance of being wrong.

Say each peek has a 90% chance of giving you a good read and a 10% chance of a fluke. Peek three times and the errors accumulate — you're now around a 30% chance of a false positive instead of 5%. This is why people say peeking inflates your alpha. The whole mathematical architecture of the test assumes you start at time zero and wait until the full duration T. Peek weekly on an 8-week test, and the probability of a false positive is almost certain. By week 10 you are essentially guaranteed to see something that isn't there.

The fix exists, but you have to know about it: alpha spending (and more generally, sequential testing). It changes the bar to declare significance depending on when you look. At the end you might accept p < 0.05 as usual — but if you peek in the first few days, the bar might be 1%, or 0.1%. That's what stops you from accepting a flukey early result and shutting the test down.

Cardinal sin #2: Early stopping

Peeking's evil twin. You peek a week in, you see a "significant" 10% lift on the promo, and you stop. Your stakeholders are happy as a clam — nobody likes an experiment that produces no effect — and you killed it early.

The problem is your p-value hasn't stabilized. You have no guarantee the effect is real, because significance is still bouncing around in that early-noise regime. And a quick aside that trips up a lot of stakeholders: statistical significance does not mean there is an effect. It means you're sure enough of your conclusion — including the conclusion that there isn't an effect. Significance is about confidence in the reading, not the existence of the thing.

Most platforms don't correct for early stopping, and most people doing experiment math by hand don't either. There are real tools for valid "always stopping" — sequential testing, certain Bayesian models — but unless you're using them, the rule is brutally simple: don't stop early. Let it run to T.

Cardinal sin #3: Data dredging

This is the worst one, and it's the most seductive.

Your test was: did the promotion increase sales? The result comes back: no effect. And then someone says, "well… what if we look at a different metric? What about impressions? Sessions? Checkouts? What if we segment — for all users there's nothing, but what about just high-value users?"

This is the same disease as peeking, just along a different axis. Instead of opening the box multiple times across time, you're opening it multiple times across segments and metrics. You're running the same test on the same data for a question it was never designed to answer. Do this across 8 different metrics and you are guaranteed to find something.

This is data dredging, and the name is exactly right: you're literally draining the river to find something. And you will — by the nature of how these methods work. Look hard enough and you can find almost anything in statistics.

I've worked with companies where I never had to explain this — a pleasure. And I've worked with companies where I spent months and months arguing with stakeholders not to do it. Because here's how it ends: you dredge until you find a 2% uplift in some segment, you report it (and let's be honest, half the time without confidence intervals — always show your confidence intervals), the director puts $20 million behind it, and six months later nobody can find the return. Then comes the question: "why did you put the money there?" "Well… analytics told me it had an effect." That's the conversation these rules exist to prevent.

It all circles the same thing

If you take one thing from all of this, it's that peeking, early stopping, and multiple comparisons all circle the same point: maximizing your ability to measure something and being sure that what you measured actually exists. We live in the realm of probability, so I need to answer two questions — can I even measure this, and is the thing the test surfaced real, or just a product of random chance?

Every one of these sins inadvertently increases the chance you'll "see" an effect that was never there. And I know it's annoying to stakeholders when we're this strict. But these rules are for everyone's benefit. Measuring marketing is hard enough — you can't calibrate an MMM, you can't trust a readout, if you're not sure about your tests. Proper test design is really, really, really important.

We're statisticians, not magicians. The lens has a resolution limit, the river will always cough up something if you drain it, and the box gets less trustworthy every time you open it early. Respect those three things and most of your experiment program takes care of itself.

Designing tests with enough power, calibrating MMMs from real experiments instead of platform ROAS, and keeping stakeholders out of the river — that's a big part of what I teach in my course and consulting. If you want to go deeper, I'm at marketingscience.dev.