2026-05-28

7 MMM Vendor Red Flags (and How to Rescue a Distrusted Model)

How to tell whether your MMM vendor built you something useful or something sophisticated-looking — and a field-tested playbook for saving a model the org has already stopped trusting, without burning it down and starting from scratch.

I've sat on both sides of this table. I've been the consultant brought in because the performance team stopped believing the model. I've been the person who got paid six months — and I don't come cheap — to do nothing but clean up data. So when I tell you most MMM engagements go sideways for boring, avoidable reasons, that's not cynicism. It's the t-shirt I've worn.

This post is two things. First, the red flags I look for when I evaluate a vendor or a model. Second — and more useful if you're already in the hole — the playbook for rescuing a model the organization has decided it doesn't trust, without throwing the baby out with the bathwater.

Let's be honest about something up front: the mere word "Bayesian" is not a feature. I've spent weeks of my life extolling the virtues of Bayesian MMMs, and I'll still tell you this isn't voodoo and it isn't magic. There are vendors who will sell you "an incredibly advanced model" whose entire sophistication is the word Bayesian. That doesn't make it advanced. An advanced model is a hierarchical thing with a model-inside-a-model — branding feeding sessions feeding orders, a meta-model on top reconciling them, geography layered under that. That's advanced. A vanilla model with a fancy adjective is just a vanilla model. So calibrate your skepticism accordingly, and watch for these.

Red flag 1: "It's accurate because the forecast error is small"

This is my favorite cotton test, because it sorts vendors instantly.

A vendor will proudly show you forecast accuracy — "we predict what already happened to within 2%." And they'll treat that as proof the model is good to go. It is not. Accuracy matters, I'm not saying it doesn't. But MMMs are causal models, and accuracy is not causal correctness. You can build a model that forecasts beautifully in which every single channel coefficient is wrong, as long as the channels are correlated enough to reconstruct the line. I know of a company whose MMM is literally correlational by design — smart data scientists, they know exactly what they're doing, and they don't pretend otherwise. The danger is the vendor who shows you 99% accuracy and lets you believe that means the channel attribution is correct. The ability to forecast is not the only way to validate a model, and it should never be the headline.

Red flag 2: They won't show you the confidence intervals

Demand the confidence intervals. Always.

Vendors do themselves a genuine disservice here — they hide the intervals because they want you to believe in their certainty. But a single bar with no interval is hiding the most important thing in the whole readout: whether the model can actually tell two channels apart. I've literally seen a return-on-investment chart where the bars looked different but the intervals overlapped so much the difference was meaningless — the model is quietly telling you "I'm not sure TV and YouTube are really different," and the vendor is presenting it as settled fact.

I once asked two vendors at the same company for their intervals. One flatly said no. The other handed them over, and you could see per channel that they didn't overlap much — that's a real degree of confidence you can act on. Guess which vendor I trusted. If they won't show you the uncertainty, assume it's because the uncertainty is embarrassing.

Red flag 3: Priors that came from the platform or attribution

Every Bayesian MMM has priors — assumptions about each channel before the data speaks. Ask where they came from. There's a right answer and a few wrong ones.

The right source for calibrating a prior is a properly designed experiment. That's it. Priors should not come from attribution, and they absolutely should not come from what the platforms claim internally about their own ROAS. Why? Because then you're measuring the same thing twice and feeding it back into itself — you're biasing the model with its own reflection. (You can still compare the MMM to attribution and platform numbers as a sanity check — just never feed them in as priors.)

The other wrong answer is "we used our industry defaults" delivered as if it were secret sauce. Sometimes defaults are fine — it depends entirely on whether they're informative and whether they fit your sector. But you are entitled to ask. "What are your priors for media effectiveness? What's your ad stock, what's your saturation, per channel?" This is not proprietary code. Nobody's asking for their model internals. If a vendor won't share the assumptions, that's iffy, and you should treat it as a flag.

Red flag 4: Not enough data (and why digital-only is an advantage)

Here's the rule I give every client. If you don't have at least two years of weekly data — three is the sweet spot, agreed by Google's research, Meta's research, and my own scars — don't even spend the money on an MMM. You'll get pretty values wrapped in confidence intervals so wide they're worthless, and you'll fool yourself with them. Start with the experimentation layer instead.

And here's the counterintuitive part people get backwards: digital-only is not a limitation, it's an advantage. If you've got three years of weekly data and it's all digital channels, you're in a better spot, not a worse one. Nearly all the pain in MMM calibration comes from the offline side — how on earth do you cleanly test TV, or out of home? With digital you get accurate exposure metrics, you can run real in-platform A/B tests for reliable measurement, and you can usually get city-level geographic granularity to feed the model more variance. The headaches come from offline. Don't apologize for being digital.

Red flag 5: No quarterly refresh and no experiment roadmap

Ask how often they refresh the model. The gold standard is quarterly — feed the last quarter back in every three months. Yearly is too long to let a model go stale. Weekly or monthly is theater: at the MMM level you should not expect your channel effectiveness or baseline to move week to week, and if it does, your model is wrong. I've had clients beg me to update monthly and I'll do it as a favor, but I tell them straight: you're not feeding it as much signal as you think.

The bigger flag is the missing experiment roadmap. I think every MMM implementation should ship with an appendix that says: here are the experiments you need to run, in this order, to test and audit this model. Vendors usually treat that as out of scope — "here's your model, here's our invoice, when do you want it updated?" I don't understand why they stop there, because the only way to ever prove the model's estimates is through experiments. If nobody's planning those tests, the model has no path to being trusted. It's born on life support.

Red flag 6: Priors locked down because a team screamed loud enough

This is the one that actually corrupts results, and it's subtle.

The tightness of a prior is set by its standard deviation. A wide prior says "I have a rough idea, but move freely if the data disagrees." A tight one says "you'll need overwhelming evidence to pull me off this." Go tight enough — a standard deviation of 0.01 — and you've planted a flag no amount of data can move. I've done it by accident; the model got stuck and I spent an afternoon confused before I realized I'd nailed it to the floor.

Now imagine that's done on purpose. A media team is convinced paid social ad stock is exactly three weeks, or wants ROI fixed at some number, and they scream until the vendor locks it down. Most vendors don't do this — but some will, and here's the real damage: when you anchor one parameter so it can't move, everything else has to move to accommodate it. The model contorts the rest of the channels around the one you fixed. That's not calibration, that's the model cheating because you told it to. Tight priors are fine with evidence — if you've run well-powered paid-search experiments every month for a year, sure, be confident. Adding a decimal point of false precision so one channel can't budge is not that.

And it propagates worse when the locked parameter is low in the model. Fix the ad stock and you've changed the transformation, which changes the effect size, which changes everything downstream. So when you ask for the priors, also ask which ones are pinned, and why.

Red flag 7: The org is weaponizing MMM against attribution

This one isn't a modeling flaw, it's an organizational one, and it's the deadliest because it kills models that are technically fine.

Here's the pattern I've lived through more than once. The performance marketing team isn't the team that brought the MMM in — they're happy, they've got their attribution, they're running their Google Ads tests. It's the brand team that brings the MMM in, because the MMM is the only thing that can prove TV works. In a healthy company brand and performance both answer to the same CMO and you get integrated measurement. In a dysfunctional one — and I've been in several — these teams operate separately, sometimes adversarially. Attribution sits on one side, MMM on the other, and instead of triangulating, the two numbers become ammunition in a budget fight.

The result is predictable. Someone hears the model "has no priors" or "the numbers moved," word spreads that the model is wrong, and the MMM is suddenly on life support with nobody trusting it. That's the worst-case scenario — the model's about to die not because it's bad, but because of politics. Watch for it, because no amount of statistical rigor saves a model the org has already decided to disbelieve.

The rescue playbook: don't rebuild from scratch

So your model is distrusted. A new stakeholder asked about priors, found out the vendor used defaults, declared "I don't trust it, we need to rebuild." Before you torch eight months of work — and I say eight months because that's roughly what my last rescue took — work the checklist. Rebuilding from scratch is almost always throwing the baby out with the bathwater.

Step 1: Check the data first. This is where the real disasters live, and it's almost never the math. On a recent client I got paid six months purely to clean data, and the headline problem was almost comically dumb: a media agency ran the paid-social activity, a central data engineering team extracted it, and both sent the same account's data to the vendor — so paid social spend was double-counted. That wasn't even the worst issue. Before you question a single coefficient, ask: did the model get the right data? Are we using exposure metrics or spend, and do they reconcile to the actual budget? (I once spent six weeks reconciling a model's channel mapping to the budget.) Is the channel taxonomy sane, or did someone split paid search into ten variables because the account has ten sub-accounts — going too deep, too fast, and starving every other channel of data?

Step 2: Get the priors and the ad stock, and smell-test them. Once the data checks out, ask the vendor for the priors, the ad stock, the saturation per channel. Then apply common sense. A vendor I worked with — genuinely good people — reported paid search with three weeks of ad stock. That smells wrong; there's no paid-search marketer alive brand-building on paid search. Meta's published benchmarks topped out around twelve weeks of ad stock, so a brand person claiming a year of TV carryover is unreasonable on the other end. You can't perfectly identify these parameters — carryover and saturation are genuinely hard, sometimes near-impossible, and you end up trusting the optimizer — but you can smell-test them against industry reality.

Step 3: Collect experiments and calibrate. This is how you bring trust back. Find the contentious channels — where's the budget actually being argued over? — and start there. The performance team almost certainly has a paid-search test lying around somewhere. Feed those results in as properly-weighted priors and let the coefficients move. And reframe the movement for the nervous stakeholders, because finance people love an objective binary truth and hate probability: when the model shifts after calibration, that's not a bug, that's a feature — it means we're getting more accurate. On my last rescue, we collected experiments, some came close to what the MMM already said, others didn't, we adjusted, and trust came back channel by channel. The offline world is the hardest test and comes last.

What you should expect from a sound model after calibration is telling. The ROI figures shouldn't lurch — paid search shouldn't fly from best channel to worst. Marketing ROI at the MMM level doesn't swing wildly unless you've genuinely changed strategy. What should move is certainty: the confidence intervals tighten. If instead you see TV go from top to bottom of the rankings, then your manager was right and you do need to start over — that's a telltale sign something is broken at the data or structure level.

Two stress tests worth stealing

If you want to actually pressure-test a vendor's model, two tricks I use.

Synthetic recapture. Generate synthetic data where you set the true ROIs per channel — you know the answer because you made it. Feed it to the model. A good model recaptures your parameters. This is Bayesian work 101, honestly; most of my own teaching exercises are generating synthetic data in PyMC and recapturing it. If the vendor's model can't recover known answers, that's all you need to know.

The channel swap. Sometimes a vendor doesn't see it coming. Swap two channels — put paid search's data under paid social's label and vice versa, keep the labels — and hand it over. The coefficients should move, because they're now genuinely different channels. A colleague of mine did this to a vendor once and they were thoroughly confused. It's a quick read on how stable and how real the model actually is.

The honest bottom line

All models are wrong, some are useful. We are simplifying reality with a hundred to two hundred data points — I can't encode the entire psychology of brand pull and awareness and consideration into that, and any vendor who claims to is overfitting their way to a 99% accurate work of fiction. So unless the data was genuinely wrong or the model fails a basic post-modeling check — negative baseline, failed posterior predictive — you almost never need to rebuild. You need to check the data, get the priors, smell-test, run experiments, and calibrate. That's the job.

And the Bayesian framing actually helps here, because it's the one school of thought that's always reasoning in degrees of confidence: how much do I trust this model? The honest answer at the start of a rescue is "this is the best we can do, is it accurate, we don't know — but we know exactly how to find out."

That checklist — the data reconciliation, the prior and ad-stock smell tests, the experiment roadmap — is the spine of what I teach and what I do on consulting engagements. If your MMM is on life support, or you just want a second set of eyes on a vendor's assumptions before you sign or before you torch it, that's squarely the kind of conversation I'm happy to have. The course and my details are at marketingscience.dev. I've been in your seat, and there is a way out.