When you give multiple teams of researchers the same question and data, it's not uncommon to get different results
Science is commonly understood as being a lot more certain than it is. In popular science books and articles, an extremely common approach is to pair a deep dive into one study with an illustrative anecdote. The implication is that’s enough: the study discovered something deep, and the anecdote made the discovery accessible. Or take the coverage of science in the popular press (and even the academic press): most coverage of science revolves around highlighting the results of a single new (cool) study. Again, the implication is that one study is enough to know something new. This isn’t universal, and I think coverage has become more cautious and nuanced in some outlets during the era of covid-19, but it’s common enough that for many people “believe science” is a sincere mantra, as if science made pronouncements in the same way religions do.
But that’s not the way it works. Single studies - especially in the social sciences - are not certain. In the 2010s, it has become clear that a lot of studies (maybe the majority) do not replicate. The failure of studies to replicate is often blamed (not without evidence) on a bias towards publishing new and exciting results. Consciously or subconsciously, that leads scientists to employ shaky methods that get them the results they want, but which don’t deliver reliable results.
But perhaps it’s worse than that. Suppose you could erase publication bias and just let scientists choose whatever method they thought was the best way to answer a question. Freed from the need to find a cool new result, scientists would pick the best method to answer a question and then, well, answer it.
The many-analysts literature shows us that’s not the case though. The truth is, the state of our “methodological technology” just isn’t there yet. There remains a core of unresolvable uncertainty and randomness in the best of circumstances. Science isn’t certain.
In many-analyst studies, multiple teams of researchers test the same previously specified hypothesis, using the exact same dataset. In all the cases we’re going to talk about today, publication is not contingent on results, so we don’t have scientists cherry-picking the results that make their results look most interesting; nor do we have replicators cherry-picking results to overturn prior results. Instead, we just have researchers applying judgment to data in the hopes of answering a question. Even still, results can be all over the map.
Let’s start with a paper in economics: Huntington-Klein et al. (2021). In this paper, seven different teams of researchers tackle two research questions that had been previously published in top economics journals (but which were not so well known that the replicators knew about them). In each case, the papers were based on publicly accessible data, and part of the point of the exercise was to see how different decisions about building a dataset from the same public sources lead to different outcomes. In the first case, researchers used variation across US states in compulsory schooling laws to assess the impact of compulsory schooling on teenage pregnancy rates.
Researchers were given a dataset of schooling laws across states and times, but to assess the impact of these laws on teen pregnancy, they had to construct a dataset on individuals from publicly available IPUMS data. In building the data, researchers diverged in how they handled different judgement calls. For examples:
One team dropped data on women living in group homes; others kept them.
Some teams counted teenage pregnancy as pregnancy after the age of 14, but one counted pregnancy at the age of 13 as well
One team dropped data on women who never had any children
In Ohio, schooling was compulsory until the age of 18 in every year except 1944, when the compulsory schooling age was 8. Was this a genuine policy change? Or a typo? One team dropped this observation, but the others retained it.
Between this and other judgement calls, no team assembled exactly the same dataset. Next, the teams needed to decide how, exactly, to perform the test. Again, each team differed a bit in terms of what variables it chose to control for and which it didn’t. Race? Age? Birth year? Pregnancy year?
It’s not immediately obvious which decisions are the right ones. Unfortunately, they matter a lot! Here were the seven teams’ different results.
Depending on your dataset construction choices and exact specification, you can find either that compulsory schooling lowers or increases teenage pregnancy, or has no impact at all! (There was a second study as well - we will come back to that at the end)
This isn’t the first paper to take this approach. An early paper in this vein is Silberzahn et al. (2018). In this paper, 29 research teams composed of 61 analysts sought to answer the question “are soccer players with dark skin tone more likely to receive red cards from referees?” This time, teams were given the same data but still had to make decisions about what to include and exclude from analysis. The data consisted of information on all 1,586 soccer players who played in the first male divisions of England, Germany, France and Spain in the 2012-2013 season, and for whom a photograph was available (to code skin tone). There was also data on player interactions with all referees throughout their professional careers, including how many of these interactions ended in a red card and a bunch of additional variables.
As in Huntington-Klein et al. (2021), the teams adopted a host of different statistical techniques, data cleaning methods, and exact specifications. While everyone included “number of games” as one variable, just one other variable was included in more than half of the teams regression models. Unlike Huntington-Klein et al. (2021), in this study, there was also a much larger set of different statistical estimation techniques. The resulting estimates (with 95% confidence intervals) are below.
Is this good news or bad news? On the one hand, most of the estimates lie between 1 and 1.5. On the other hand, about a third of the teams cannot rule out zero impact of skin tone on red cards; the other two thirds find a positive effect that is statistically significant at standard levels. In other words, if we picked two of these teams’ results at random and called one the “first result” and the other a “replication,” they would only agree whether the result is statistically significant or not about 55% of the time!
Let’s look at another. Breznau et al. (2022) get 73 teams, comprising 162 researchers to answer the question “does immigration lower public support for social policies?” Again, each team was given the same data. In this case, they were given responses to surveys about support for six different government social policies, and various ways to measure “immigration:” as a stock (how many immigrants are in the country?), as a flow (how many new immigrants are coming into the country?), as a change in the flow (how is the annual inflow of immigrants changing?), etc. They were also given other country-level explanatory variables such as GDP per capita and the Gini coefficient.
It’s not surprising that we might get different results when we look at how different measures of immigration (stock vs flow, for example) affect attitudes towards different social policies (jobs vs healthcare, for example).1 But even when we restrict our attention to teams using the same measure of immigration, and looking at the impact on attitudes about the same category of social policy, we still get a lot of variation across different teams. The paper has a nice web-app that lets you filter the results along lots of different dimensions, which I used to create the following charts. They show how different teams (using different modeling strategies and sets of control variables), obtained different results about the impact of the flow if immigration on social support for six different kinds of government social policy.
These look a lot like the results of Silberzahn et al. (2018); most results are small, but there are some major outlier results, and a lot of variation in whether results are deemed statistically distinguishable from zero or not.
Finally, Menkveld et al. (2021) wrangles 164 teams of economists to test six different hypotheses about financial markets using a big dataset of European trading data. Testing these hypotheses required participants to define and build their own measures and indices, and to see if they have increased or decreased over time. As should be no surprise by now, the teams came up with an enormous range of estimates. For example, on one hypotheses - how has the share of client volume in total volume changed - 4% of teams found it had increased, 46% found it had declined, and 50% found no statistically significant change over time.
We could do more studies, but the general consensus is the same: when many teams answer the same question, beginning with the same dataset, it is quite common to find a wide spread of conclusions (even when you remove motivations related to beating publication bias).
At this point, it’s tempting to hope the different results stem from differing levels of expertise, or differing quality of analysis. “OK,” we might say, “different scientists will reach different conclusions, but maybe that’s because some scientists are bad at research. Good scientists will agree.” But as best as these papers can tell, that’s not a very big factor.
Silberzahn et al. (2018) (soccer players), Breznau et al. (2022) (immigration), and Menkveld et al. (2021) (financial markets), each try various ways of assessing the expertise of their teams, and then dividing their sample into groups with high and low levels of expertise to see if there is more agreement among the high expertise group. For example, Silberzahn and coauthors divides groups based on if they teach classes on statistics, publish on methodology, and so on. They also had teams score each others’ analysis plans (before knowing the results). Breznau and coauthors have participants fill out a survey assessing their topical and methodological expertise. And Menkveld and coauthors assess team quality along a variety of metrics: publication in top journals, self-assessed expertise, seniority, team size, and replicability of their submitted code.
In each paper, the authors use these various metrics to split their sample of research teams into high expertise and low expertise, but in all cases there is either a very small or no difference between the groups. In Silberzahn et al. (2018), the half with greater expertise was more likely to find a positive and statistically significant effect (78% of teams, instead of 68%), but the variability of their estimates was the same across the groups (just shifted in one direction or another). The quality of the analysis plan, as judged by other teams, was also unrelated to the outcome. This was the case even when they only looked at the grades given by experts in the statistical technique being used. Breznau et al. (2022) split their samples along their measures of expertise, but also find no difference between the groups. Menkveld et al. (2021) do find that higher quality teams were slightly more likely to get results close to the overall average, but only in some specifications, and the effects are not particularly large.
So; don’t assume the results of a given study are definitive to the question. It’s quite likely that a different set of researchers, tackling the exact same question and starting with the exact same data would have obtained a different result. Even if they had the same level of expertise!
But while most people probably overrate the degree of certainty in science, there also seems to be a sizable online contingent that has embraced the opposite conclusion. They know about the replication crisis and the unreliability of research, and have concluded the whole scientific operation is a scam. This goes too far in the opposite direction.
For example, a science nihilist might conclude that if expertise doesn’t drive the results above, then it must be that scientists simply find whatever they want to find, and that their results are designed to fabricate evidence for whatever they happen to believe already. But that doesn’t seem to be the case, at least in these multi-analyst studies. In both the study of soccer players and the one on immigration, participating researchers reported their beliefs before doing their analysis. In both cases there wasn’t a statistically significant correlation between prior beliefs and reported results.
If it’s not expertise and it’s not preconceived beliefs that drive results, what is it? I think it really is simply that research is hard and different defensible decisions can lead to different outcomes.
Right up at the top, there can be disagreement about what even counts as evidence towards answering a particular research question. Ausburg and Brüderl (2021) provides some interesting detail on what drove different answers in the soccer player study, by digging back into the original study’s records. After analyzing each team’s submitted reports, Ausburg and Brüderl argue that the 29 teams were actually trying to answer (broadly) four different questions.
Recall the research prompt was “are soccer players with dark skin tone more likely than those with light skin tone to receive red cards from referees?” Ausburg and Brüderl argue some interpreted this quite literally, and sought to compute the simple average difference in the risk of red cards among dark- and light-skinned players, with no effort to adjust for any other systematic differences between the players. Others thought this was a question specifically about racial bias. For them, the relevant hypothetical was the average difference in risk of a red card among two players who were identical except for their skin tone. Yet others interpreted the question as asking “if we are trying to predict the risk of red cards, does skin tone show up as one of the most important factors?” And still others thought of the whole project as being about maximizing the methodological diversity used to tackle a question, and saw their role as trying out novel and unusual methodologies, rather than whatever approach they thought most likely to arrive at the right answer!
Menkveld and coauthors’ paper on financial markets provide some other evidence that tighter bounds on what counts as evidence can reduce, though not eliminate, the dispersion of answers. Recall this paper asked researchers to answer six different hypotheses. Some of these hypotheses were relatively ambiguous, such as “how has market efficiency changed over time?” leaving it to researchers define and implement a measure of market efficiency. Other hypotheses permitted much less scope for judgment, such as “how has the share of client volume in total volume changed?” The dispersion of answers for the more tightly defined questions was much narrower than for the more nebulous questions.
But the choice of what to look for only explains part of the different answers. In Huntington-Klein et al. (2021) for example, the parameter to be estimated is quite carefully defined. Instead differences there stem from analysis plans and data construction. To see which matters more, Huntington-Klein et al. (2021) perform an interesting exercise where they apply the same analysis to different teams data, or alternatively, apply different analysis plans to the same dataset. That exercise suggests roughly half of the divergence in the teams conclusions stems from different decisions made in the database construction stage and half from different decisions made about analysis. There’s no silver bullet - just a lot of little decisions that add up.
More importantly, while it’s true that any scientific study should not be viewed as the last word on anything, studies still do give us signals about what might be true. And the signals add up.
Looking at the above results, while I am not certain of anything, I come away thinking it’s slightly more likely that compulsory schooling reduces teenage pregnancy, pretty likely that dark skinned soccer players get more red cards, and that there is no simple meaningful relationship between immigration and views on government social policy. Given that most of the decisions are defensible, I go with the results that show up more often than not.
And sometimes, the results are pretty compelling. Earlier, I mentioned that Huntington-Klein et al. (2021) actually investigated two hypotheses. In the second, Huntington-Klein et al. (2021) ask researchers to look at the effect of employer-provided healthcare on entrepreneurship. The key identifying assumption is that in the US, people become eligible for publicly provided health insurance (Medicare) at age 65. But people’s personalities and opportunities tend to change more slowly and idiosyncratically - they also don’t suddenly change on your 65th birthday. So the study looks at how rates of entrepreneurship compare between groups just older than the 65 threshold and those just under it. Again, researchers have to build a dataset from publicly available data. Again every team made different decisions, such that none of the data sets are exactly alike. Again, researchers must decide exactly how to test the hypothesis, and again they choose slight variations in how to test it. But this time, at least the estimated effects line up reasonably well.
I think this is pretty compelling evidence that there’s something really going on here - at least for the time and place under study.
And it isn’t necessary to have teams of researchers generate the above kinds of figures. “Multiverse analysis” asks researchers to explicitly consider how their results change under all plausible changes to the data and analysis; essentially, it asks individual teams to try and behave like a set of teams. In economics (and I’m sure in many other fields - I’m just writing about what I know here), something like this is supposedly done in the “robustness checks” section of a paper. In this part of a study, the researchers show how their results are or are not robust to alternative data and analysis decisions. The trouble has long been that robustness checks have been selective rather than systematic; the fear is that researchers highlight only the robustness checks that make their core conclusion look good and bury the rest.
But I wonder if this is changing. The robustness checks section of economics papers has been steadily ballooning over time, contributing to the novella-like length of many modern economics papers (the average length rose from 15 pages to 45 pages between 1970 and 2012). Some papers are now beginning to include figures like the following, which show how the core results change when assumptions change and which closely mirror the results generated by multiple-analyst papers. Notably, this figure includes many sets of assumptions that show results that are not statistically different from zero (the authors aren’t hiding everything).
Economists complain about how difficult these requirements make the publication process (and how unpleasant they make it to read papers), but the multiple-analyst work suggests it’s probably still a good idea, at least until our “methodological technology” catches up so that you don’t have a big spread of results when you make different defensible decisions.
Lastly, even if expertise does not seem to be associated with consensus at the outset, this doesn’t mean that expertise and the quality of argument don’t matter at all. One other hopeful sign is that learning and a reduction in the spread of answers does seem to be possible when researchers are allowed to give feedback. Menkveld et al. (2021) design their study to simulate the kind of feedback researchers typically get, and then look to see if the dispersion of results converges over time. First researchers do their work independently. Next, they each receive peer evaluations from outside experts and have a chance to revise their results in response to feedback. After this stage, they are shown the top five papers, based on peer evaluation scores, and given the opportunity to revise their results again. Lastly, they are allowed to report their preferred results, even if those results are someone else’s. The following figure shows how the dispersion of results changes across all four stages. While some extreme outliers appear immune to persuasion, the whisker plots (which track the 2.5-97.5% quartiles) and box plots (which span the 25-75% quartiles) mostly converge in their estimates as feedback is received.
Silberzahn et al. (2018) also include a staged review process, and do find similar, albeit weaker, evidence of a convergence of beliefs. In their case, they ask participants their subjective beliefs about bias, and find the standard deviation in these beliefs drops after a group discussion, though again, those on the extreme appear difficult to move. This provides at least some evidence that experts can come to agree given space to debate each other’s methods and assumptions, as we would expect happens in science as actually practiced. Of course, these results do not tell us how much publication bias might hinder this process.
More broadly, I take away three things from this literature:
Failures to replicate are to be expected, given the state of our methodological technology, even in the best circumstances, even if there’s no publication bias
Form your ideas based on suites of papers, or entire literatures, not primarily on individual studies
There is plenty of randomness in the research process for publication bias to exploit. More on that here.
New articles and updates to existing articles are typically added to this site every two weeks. To learn what’s new on New Things Under the Sun, subscribe to the newsletter.
How to accelerate technological progress
Why is publication bias worse in some disciplines than in others?
Publication bias without editors? the case of preprint servers
Huntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey R. Bloem, Pralhad Burli, et al. 2021. The influence of hidden researcher decisions in applied microeconomics. Economic Inquiry, 59: 944– 960. https://doi.org/10.1111/ecin.12992
Silberzahn R, Uhlmann EL, Martin DP, et al. 2018. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science: 337-356. https://doi.org/10.1177/2515245917747646
Breznau, Nate, Eike Mark Rinke, Alexander Wuttke, Muna Adem, Jule Adriaans, Amalia Alvarez-Benjumea, Henrik K. Andersen, et al. 2022. Observing Many Researchers Using the Same Data and Hypothesis Reveals a Hidden Universe of Uncertainty. PNAS 119 (44): e2203150119. https://doi.org/10.1073/pnas.2203150119
Engzell, Per. 2023. A universe of uncertainty hiding in plain sight. PNAS 120(2): e2218530120. https://doi.org/10.1073/pnas.2218530120
Menkveld, Albert J. and Dreber, Anna and Holzmeister, Felix and Huber, Juergen and Johanneson, Magnus and Kirchler, Michael and Razen, Michael and Weitzel, Utz and Abad, David et al. 2021. Non-Standard Errors. University of St. Gallen, School of Finance Research Paper No. 2021/17. http://dx.doi.org/10.2139/ssrn.3961574
Jojanneke A. Bastiaansen, Yoram K. Kunkels, Frank J. Blaauw, Steven M. Boker, Eva Ceulemans, Meng Chen, Sy-Miin Chow, et al. 2020. Time to get personal? The impact of researchers choices on the selection of treatment targets using the experience sampling methodology. Journal of Psychosomatic Research 137(110211). https://doi.org/10.1016/j.jpsychores.2020.110211
Schweinsberg, Martin, Michael Feldman, Nicola Staub, Olmo R. van den Akker, et al. 2021. Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Organization Behavior and Human Decision Processes 165: 228-249. https://doi.org/10.1016/j.obhdp.2021.02.003
Auspurg, Katrin, and Josef Brüderl. 2021. Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project. Socius: Sociological Research for a Dynamic World 7: 1-14. https://doi.org/10.1177/2378023121102442
Swenson, Isaac, Jason M. Lindo, and Krishna Regmi. 2020. Stable Income, Stable Family. NBER Working Paper 27753. https://doi.org/10.3386/w27753