White hat p-hacking, a primer

Jargon glossary: Exploratory data analysis is what you do when you suspect there is something interesting in there but you don’t have a good idea of what it might be, so you don’t use a hypothesis. It overlaps with p-hacking, asking random questions of a noisy world on scant data until the world accidentally misfires and tells you what you want to hear, and you pretend that that was what you thought would happen all along. p-hacking is a response to null results, when you spent forever organizing a study and nothing happens. p-hacking might have caused the replicability crisis, which is researchers becoming boors when they realize that everything they thought was true is wrong. Hypothesis registration is when you tell the world what question you’re gonna ask and what you expect to find before doing anything at all. People are excited because it is a solution to p-hacking. A false positive is when you think you found something that actually isn’t there. It is one of the two types of error, the other being a false negative, when you missed something that actually is there. The reproducibility movement is focused on reducing false positives.

I almost falsified data once. I was a young research assistant in primatologist Marc Hauser’s lab in 2004 (well before he had to quit for falsifying data, but probably unrelated to that). I was new to Boston, lonely and jobless. I admired science and wanted to do it, but I kept screwing up. I had already screwed up once running my monkey experiment. I got a stern talking to and was put on thin ice. Then I screwed up again. I got scared and prepared to put made-up numbers in the boxes. I immediately saw myself doing it. Then I started to cry, erased them, unloaded on the RA supervising me, quit on the spot, and even quit science for a few years before allowing myself back in in 2008. I know how we fool and pressure ourselves. To be someone you respect requires either inner strength or outside help. Maybe I’ve got the first now. I don’t intend to find out.

That’s what’s great about hypothesis registration. And still, I’m not impressed by it. Yes it’s rigorous and valuable for some kinds of researchers, but it does not have to be in my toolkit for me to be a good social scientist. First, there are responsible alternatives to registration, which itself is only useful in domains that are already so well understood that why are we still studying them? Second, “exploratory data analysis” is getting paired with irresponsible p-hacking. That’s bad and it will keep happening until we stop pretending that we already know the unknowns. In the study of complicated systems, uncertain data-first exploratory approaches will always precede solid theory-first predictive approaches. We need a good place for exploration, and many of the alternatives to registration have one.

What are the responsible alternatives to hypothesis registration?

Design good experiments, the “critical” kind whose results will be fascinating no matter what happens, even if nothing happens. The first source of my not-being-impressed-enough by the registration craze is that it misses a bigger problem: people should design studies that they know in advance will be interesting no matter the outcome. If you design null results out, you don’t get to a point of having to fish in the first place. Posting your rotten intuitions in advance is no replacement for elegant design. And elegant design can be taught.
Don’t believe everything you read. Replicability concerns don’t acknowledge the hidden importance of tolerating unreplicable research. The ground will always be shaky, so if it feels firm, it’s because you’re intellectual dead weight and an impediment to science. Reducing false positives requires increasing false negatives, and trying to eliminate one type of error makes the other kind explode. Never believe that there is anything you can do to get the immutable intellectual foundation you deserve. Example: psychology has a lot of research that’s bunk. Econ has less research that’s bunk. But psychology adapts quickly, and econ needs decades of waiting for the old guard to die before something as obvious as social preferences can be suffered to exist. Those facts have a deep relationship: economists historically suffer false negatives at the cost of false positives. Psychologists do the opposite, and they cope with the predominance of bunk by not believing most studies they read. Don’t forget what they once said about plate tectonics: “It is not scientific but takes the familiar course of an initial idea, a selective search through the literature for corroborative evidence, ignoring most of the facts that are opposed to the idea, and ending in a state of auto-intoxication in which the subjective idea comes to be considered an objective fact.” link
Design experiments that are obvious to you and only you, because you’re so brilliant. If your inside knowledge gives you absolute confidence about what will happen and why it’s interesting, you won’t need to fish: if you’re wrong despite that wild confidence, that’s interesting enough to be publishable itself. Unless you’re like me and your intuition is so awful that you need white hat p-hacking to find anything at all.
Replace p-values with empirical confidence intervals.
Find weak effects boring. After all, they are.
Collect way too much data, and set some aside that you won’t look at until later.

OK, so you’re with me: Exploratory data analysis is important. It’s impossible to distinguish from p-hacking. Therefore, p-hacking is important. So the important question is not how to avoid p-hacking, but how to p-hack responsibly. We can; we must. Here is one way:

Collect data without a hypothesis
Explore and hack it unapologetically until you find/create an interesting/counterintuitive/publishable/PhD-granting result.
Make like a responsible researcher by posting your hypothesis about what already happened after the fact.
Self-replicate: Get new data or unwrap your test data.
Test your fishy hypothesis on it.
Live with the consequences.

While it seems crazy to register a hypothesis after the experiment, it’s totally legitimate, and is probably better done after your first study than before it. This whole thing works because good exploratory findings are both interesting and really hard to kill, and testing out of sample forces you to not take the chance on anything that you don’t think will replicate.

I think of it as integrity exogenously enforced. And that’s the real contribution of recent discourse: hypothesis registration isn’t what’s important, it’s tying your hands to the integrity mast, whether by registration, good design, asking fresher questions, or taking every step publicly. It’s important to me because I’m very privileged: I can admit that I can lie to myself. Maybe I’m strong enough to not do it again. I don’t intend to find out.

White hat p-hacking, a primer

One thought on “White hat p-hacking, a primer”

Comment to Seth / Read but not published Cancel reply