What is “Reproducibility” Anyway?

Crossposted from Scimatic

Titus Brown has a very funny spoof about how scientists will probably react to the NSF’s moves towards data management plans. Go read it, I’ll wait. After detailing all manner of horrible data, licensing and source code management techniques, he closes with

Meanwhile we will continue publishing exciting sounding (but irerproducible [sic]) analyses

I’m not sure how I feel about this. All the disturbing practices he details are, well, disturbing. And not very scientific. However, he implies in his last sentence that the end goal to having well-structured data and documentation and source code in a version control tool is “reproducibility.” But there’s no offered definition of what “reproducibility” is.

This topic seems to come up a lot in the Open Science and Science 2.0 discussions. And the entry-level definition of “reproducibility” seems to be that another scientist or group will take your data and your tools and verify your result.

Sorry, but that’s not reproducibility. As the climate folks say, that’s replication. If you take all the same data and all the same tools, one of two things can happen:

There’s value in the first outcome, especially if you show I’m the sloppy one. I’m just not sure how much value there is. It’s going to be hard to convince some young researcher to take a year or five to figure out that maybe some other dude might be wrong. It’s just not that appealing. I’d rather work on my own ideas.

The real problem is that reproducibility or verification or whatever you want to call it is a lot harder than just running someone else’s code. It probably means designing a different experimental setup. Controlling for different biases. Getting a statistically independent data set. These things cost time and money, both of which are in short supply. But all these things are critical to say that a result has been reproduced.

A short example from my previous life. My thesis was about CP violation in the neutral kaon sector. We measured a parameter called Epsilon-prime. It doesn’t really matter what it is or what it describes. What mattered at the time was whether or not it had a value of zero. The Fermilab results said yes, consistent with zero, and the CERN results said no, it’s non-zero. A real “irreproducible” disagreement. And dammit, both groups had pride on the line and needed to be right.

So, both groups built new experiments. Both groups looked at each others techniques, and cherry-picked the best ideas. We went back for more funding. The second round of experiments got different results: now the Fermilab result was farther from zero than the CERN result. But by now, the two results were consistent within their respective uncertainties, and also consistent with a non-zero interpretation. That’s a win and reproducibility.

A similar thing is being reported in the New York Times from the DZero collaboration. No one is interested in looking at DZero’s data or their software. They are interested if the CDF collaboration has a similar, independent result that verifies what DZero is reporting.

This is going to be a bigger problem in the future, not just in physics, but also in bioinformatics. The scale of the data and the experiments is so large that no one will be able to mount a complementary experiment to confirm the results. Once the LHC produces peta- or yotta-bytes of data, that’s it. It’s all we got.

So in that respect, Brown’s points are good. You have to have decent data management plans. Scientists owe it to the people who will come later, and to the people (i.e. the taxpayer) who paid for the research. For some of these experiments will only be run once, and future scientists may have ideas to find stuff that we haven’t thought of yet. However, I’m not sure if he’s claiming it’s sufficient for reproducibility. I don’t think it is.

For source code that did the analysis; if it’s open and available and well architected and concise and documented – great. I’m not going to run it, but seeing it in that shape will give me confidence that you applied similar rigour to the rest of your experiment. It’s the reverse of the Climategate-East Anglia problem. I don’t believe those guys are doing good science because they sure aren’t writing good code. As Steve Easterbrook points out, there are other climatology groups writing really tight software with good development practices. I’m probably going to trust their models more. So there certainly are benefits to all the things Brown indirectly suggests.

Now, none of this discussion is new; new to me maybe, but the climate folks have been all over this for a while. And it’s a really tough problem. I don’t have any answers, but the first step that the community is having is at least trying to figure out what the terms mean.

So, if you got this far, here’s the summary:

But let’s drop the idea that I’m going to take your data and your code and “reproduce” your result. I’m not. First, I’ve got my own work to do. More importantly, the odds are that nobody will be any wiser when I’m done.