Several strands of evidence on the size of publication bias
Publication bias is when academic journals make publication of a paper contingent on the results obtained. Typically, though not always, it’s assumed that means results that indicate a statistically significant correlation are publishable, and those that don’t are not (where statistically significant typically means a result that would be expected to occur by random chance less than 5% of the time). A rationale for this kind of preference is that the audience of journals is looking for new information that leads them to revise their beliefs. If the default position is that most novel hypotheses are false, then a study that fails to find evidence in favor of a novel hypothesis isn’t going to lead anyone to revise their beliefs. Readers could have skipped it without it impacting their lives, and journals would do better by highlighting other findings. (Frankel and Kasy 2021 develops this idea more fully)
But that kind of policy can generate systematic biases. In another post, I looked at a number of studies that show when you give different teams of researchers the same narrow research question and the same dataset, it is quite common to get very different results. For example, in Breznau et al. (2021), 73 different teams of researchers are asked to answer the question “does immigration lower public support for social policies?” Despite each team being given the same initial dataset, the results generated by these teams spanned the whole range of possible conclusions, as indicated in the figure below: in yellow, more immigration led to less support for social policies, in blue the opposite, and in gray there was no statistically significant relationship between the two.'
Seeing all the results in one figure like this tells us, more or less, that there is no consistent relationship between immigration and public support for social policies (at least in this dataset). But what if this had not been a paper reporting on the coordinated efforts of 73 research teams, committed to publishing whatever they found? What if, instead, this paper was a literature review reporting on the published research of 73 different teams that tackled this question independently? And what if these teams could only get their results published if they found the “surprising” result that more immigration leads to more public support for social policies? In that case, a review of the literature would generate a figure like the following:
In this case, the academic enterprise would be misleading you. More generally, if we want academia to develop knowledge of how the world works that can then be spun into technological and policy applications, we need it to give us an accurate picture of the world.
On the other hand, maybe it’s not that bad. Maybe researchers who find different results can get their work published, though possibly in lower-ranked journals. If that’s the case, then a thorough review of the literature could still recover the full distribution of results that we originally highlighted. That’s not so hard in the era of google scholar and sci-hub.
So what’s the evidence? How much of an issue is publication bias?
As a first step, let’s consider some papers that use experiments to explicitly see whether reviewers treat papers differently, depending on the results. In each of these papers, reviewers receive descriptions of papers (or actual papers) that are basically identical, except for the results. For one random set of reviewers, the papers (or descriptions of papers) obtain statistically significant results; in the other, these results are changed to be statistically insignificant. But as much as possible, what the reviewers see is otherwise unchanged. The papers then compare the recommendations and ratings of the two groups of reviewers to see if the non-significant results are rated more poorly or given lower recommendations than the significant ones.
We have three papers from different fields. Emerson et al. (2010) has 110 actual reviewers of papers of orthopedic journals do a standard peer review of one of two different fictitious papers, each of which are identical but for the results. Berinsky et al. (2021) email short descriptive vignettes of research papers to all faculty in US political science departments that grant PhDs and have respondents fill out surveys about these vignettes, getting about 1,000 responses. Similarly, Chopra et al. (2022) get responses on short descriptive vignettes of economics papers from about 500 responses from economists at top 200 departments. These studies varied a bit in exactly how they measured support for publication and what other dimensions they studied, but in all cases reviewers believed papers with statistically significant results were better candidates for publication. The figure below tracks, in dark blue, the probability a given reviewer would support publication among the reviewers who saw a statistically significant finding, while light blue illustrates the same for reviewers who saw a statistically insignificant result of an otherwise identical paper.
To emphasize - the only difference in the papers or paper vignettes that respondents read in the above figure was whether the result was described as statistically significant or not. Holding everything else fixed - the research question, the methodology, the quality of the writing, the sample size, etc - reviewers were less likely to recommend the versions of the papers that found non-significant results be published.
That said, in the above studies, the results are hypothetical: we’re talking about fictitious papers, and in two of the three studies, reviewers were being asked to do an unfamiliar task (reviewers do not typically review short descriptions of papers). Is this a problem that we see outside these artificial situations?
Ideally, what we would like to do is follow a large number of roughly similar research projects from inception to completion, and observe whether the probability of publication depends on the results of the research. In fact, there are several research settings where something quite close to this is possible. This is because many kinds of research require researchers to submit proposals or pre-register studies before they can begin investigation. A large literature attempts to track the subsequent history of such projects to see what happens to them.
Dwan et al. (2008) reports on the results of 8 different healthcare meta-analyses of this type, published between 1991 and 2008. The studies variously draw on the pool of proposals submitted to different institutional review boards, or drug trial registries, focusing on randomized control trials where a control group is given some kind of placebo and a treatment group some kind of treatment. The studies then perform literature searches and correspond with the authors of these studies to see if the proposals or pre-registered trials end up in the publication record.
In every case, results are more likely to be published if they are positive than if the results are negative or inconclusive. In some cases, the biases are small. One study (Dickersin 1993) of trials approved for funding by the NIH found 98% of positive trial outcomes were published, but only 85% of negative results were (i.e., positive results were 15% more likely to be published). In other cases, they’re quite large. In Decullier (2005)’s study of protocols submitted to the French Research Ethics Committees, 69% of positive results were published, but only 28% of negative or null results were (i.e., positive results were about 2.5 times more likely to be published).
These kinds of studies are in principle possible for any kind of research project that leaves a paper trail, but they are most common in medicine (which has these institutional review boards and drug trial registries). However, Franco, Malhotra, and Simonovits (2014) are able to perform a similar exercise for 228 studies in the social sciences that relied on the NSF’s Time-sharing Experiments in the Social Sciences program. Under this program, researchers can apply to have questions added to a nationally representative survey of US adults. Typically, the questions are manipulated so that different populations receive different versions of the question (i.e., different question framing or visual stimulus), allowing the researchers to conduct survey-based experiments. This paper approaches the ideal in several ways:
The proposals are vetted for quality via a peer review process, ensuring all proposed research question clear some minimum threshold
The surveys are administered in the same way by the same firm in most cases, ensuring some degree of similarity of methodology across studies
All the studies are vetted to ensure they have sufficient statistical power to detect effects of interest
That is, in this setting, what really differentiates research projects is the results they find, rather than the methods or sample size.
Franco, Malhotra, and Simonovits start with the set of all approved proposals from 2002-2012 and then search through the literature to find all published papers or working papers based on the results of this NSF program survey. For more than 100 results, they could find no such paper, and so they emailed the authors to ask what happened and also to summarize the results of the experiment.
The headline result is that 92 of 228 studies got strong results (i.e., statistically significant, consistent evidence in favor of the hypothesis). Of these, 62% were published. In contrast, 49 studies got null results (i.e., no statistically significant difference between the treatment and control groups). Of these, just 22% were published. Another 85 studies got mixed results, and half of these were published. In other words, studies that got strong results were 2.8 times as likely to be published as those that got null results.
The figure below plots the results from both studies, with the share of studies with positive results that were published on the vertical axis and the share of studies with negative results that were published on the horizontal axis. The blue dots correspond to 8 different healthcare meta-analyses from Dwan et al. (2008), and the orange ones to four different categories of social science research in Franco, Malhotra, and Simonovits (2014). For there to be no publication bias, we would expect dots to lie close to the 45-degree line, on either side. Instead, we see they all lie above it, indicating positive results are more likely to be published than negative ones. The worst offenders are in the social sciences. Of the 60 psychology experiments Franco, Malhotra, and Simonovits study, 76% of positive results resulted in a publication, compared to 14% of null results. In the 36 sociology experiments, 54% of positive results got a publication and none of the null results did.
Franco, Malhotra, and Simonovits (2014) has an additional interesting finding. As we have seen, most null results do not result in a publication for this type of social science. But in fact, of the 49 studies with null results, 31 were never submitted to any journal, and indeed were never even written up. In contrast, just 4 of 98 positive results were not written up. In fact, if you restrict attention only to results that get written up, 11 of 18 null results that are written up get published - 61% - and 57 of 88 strong results that get written up get published - 64%. Almost the same, once you make the decision to write them up.
So why don’t null results get written up and submitted for publication? Franco, Malhotra, and Simonovits got detailed explanations from 26 researchers. Most of them (15) said they felt the project had no prospect for publication given the null results. But in other cases, the results just took a backseat to other priorities, such as other projects. And in two cases, the authors eventually did manage to publish on the topic… by getting positive results from a smaller convenience sample(!).
This echoes the results of Brodeur et al. (2023), which conducted an anonymous survey of authors who published in top 25 economics journals in 2018, using common empirical methods. Of the 143 respondents, 30% said they had stopped a study after getting null results; about the same number also said they refrained from seeking to publish the results of a study that found null results.
Why is this interesting? It shows us that scientists actively change their behavior in response to their beliefs about publication bias. In some cases, they just abandon projects that they feel face a hopeless path to publication. In other cases, they might change their research practices to try and get a result they think will have a better shot at publication, for example by re-doing the study with a new sample. The latter is related to p-hacking, where researchers bend their methods to generate publishable results. (A bit more on that here)
These kinds of ideal studies are great when they are an option, but in most cases we won’t be able to observe the complete set of published and unpublished (and even unwritten) research projects. However, a next-best option is when we have good information about what the distribution of results should look like, in the absence of publication bias. We can then compare the actual distribution of published results to what we know it should look like to get a good estimate of how publication bias is distorting things.
Among other things, Andrews and Kasy (2019) show how to do this using data on replications. To explain the method, let’s think of a simple model of research. When we do empirical research, in a lot of cases what’s actually going on is we get our data, clean it up, make various decisions about analysis, and end up with an estimate of some effect size, as well as an estimate of the precision of our estimate - the standard error. Andrews and Kasy ask us to ignore all that detail and just think of research as like drawing a random “result” (an estimated effect size and standard error) from an unknown distribution of possible results. The idea they’re trying to get at is just that if we did the study again and again, we might get different results each time, but if we did it enough we would see they cluster around some kind of typical result, which we could estimate with the average. In the figure below, results are clustered around an average effect size of 1.2 and an average standard error 0.5.
If we weren’t worried about publication bias, then we could just look at all these results and get a pretty good estimate of the “truth.” But because of publication bias we’re worried that we’re not seeing the whole picture. Maybe there is no publication bias at all and the diagram above gives us a good estimate. Or maybe results that are not statistically significant are only published 10% of the time. That would look something like this, with the light gray dots results that aren’t published in this example.
In this case, the unpublished results tend to be closer to zero for any given level of standard error. By including them, the average effect size is cut down by 15%.
What Andrews and Kasy do is show is that if you have a sample of results that isn’t polluted by publication bias, you can compare that to the actual set of published results to infer the extent of publication bias. So where do you get a sample of results that isn’t polluted by publication bias? One option is from systematic replications.
Two such projects are Camerer et al. (2016)’s replication of 18 laboratory experiments published in two top economics journals between 2011 and 2014, and Open Science Collaboration (2015)’s replication of 100 experiments published in three top psychology journals in 2008. In both cases, the results of the replication efforts are bundled together and published as one big article. We have confidence in each case that all replication results get published; there is no selection where successful or unsuccessful replications are excluded from publication.
We then compare the distribution of effect sizes and standard errors, as published in those journals, to an “unbiased” distribution that is derived from the replication projects. For example, if half of the replicated results are not statistically significant, and we think that’s an unbiased estimate, then we should expect half of the published results to be insignificant too in the absence of publication bias. If instead we find that just a quarter of the published results are not statistically significant, that tells us significant results are three times more likely to be published, since there are three times as many significant ones as insignificant, and we would expect them to have equal probability of being significant or insignificant. Using a more sophisticated version of this general idea, Andrews and Kasy estimate that null results (statistically insignificant at the standard 5% level) are just 2-3% as likely to be published in these journals as statistically significant results!
Now, note the important words “in these journals” there. We’re not saying null results are 2-3% as likely to be published as significant results; only that they are 2-3% as likely to be published in these top disciplinary journals. As we’ve seen, the studies from the previous section indicate publication bias exists but is usually not on the order of magnitude seen here. We would hope non-significant results rejected for publication in these top journals would still find a home somewhere else (but we don’t know that for sure).
This technique works when you have systematic replication data to generate an unbiased estimate of the “true” underlying probability of getting different effect sizes and standard errors. But replications remain rare. Fortunately, there is a large literature on ways to estimate the presence of publication bias even without any data on the distribution of unpublished results. I won’t talk about this whole literature (Christensen and Miguel 2018 is a good recent overview) but I’ll talk about Andrews and Kasy’s approach here, and talk a bit about other approaches here.
Let’s go back to this example plot from the last section. Suppose we have this set of published results. How can we know if this is just what the data looks like, or if this is skewed by publication data?
As I noted before, statistically significant results are basically ones where the estimated effect size is “large” relative to imprecision of that estimate, here given by the standard error. For any point on this diagram, we can precisely calculate which results will be statistically significant and which won’t be. It looks sort of like this, with results in the red area not statistically significant.
Notice the red area of statistical insignificance is eerily unpopulated; an indication that we’re missing observations (these are called funnel plots, in the literature).
What Andrews and Kasy propose is something like this: for a given standard error, how many studies do you see in the zone of insignificant results versus outside it? For instance, look at the regions highlighted in blue in the figure below.
If we assume that without publication bias there is no systematic relationship between standard errors and effect sizes (i.e., without publication bias, a big standard error is equally likely for a big effect size as a small effect size), then we should expect to see a similar spread of effect sizes across each rectangle. In the bottom one, they range pretty evenly from -1 to 3, but in the top one they are all clustered on around 2. Also, in the top area, most of the area lies in the “red zone” of statistical insignificance and so we should expect to observe a lot of statistically insignificant results there - the majority of observations, in fact. But instead, we just observe 1 out of 6 total observations statistically insignificant. This kind of information about the difference between what we should expect to see versus what we actually see can also be used to infer the strength of publication bias. In this case, instead of using replications to get at the “true” distribution, it’s like we’re using the distribution of effect sizes for precise results to help tell us about what kinds of effect sizes we should see for less precise results. After all, they should be basically the same, since usually an imprecise result is just a precise result with fewer datapoints. There isn’t anything fundamentally different about them.
Andrews and Kasy apply this methodology to Wolfson and Belman (2015), a meta-analysis of studies on the impact of the minimum wage on employment. They estimate results that find a negative impact of the minimum wage on employment (at conventional levels of statistical significance) are a bit more than 3 times as likely to be published as papers that don’t. Unlike the replication based studies, this estimate is a closer match to the results we found for the ideal data studies, where social science papers that found strong positive results were around 2.8 times as likely to be published as those that didn’t.
So, to sum up; yes - publication bias is a real thing. You can see it’s statistical fingerprints in published data. And when you’re lucky enough to have much better data, such as on the distribution of results that would occur without publication bias, or when you can actually see what happens to unpublished research, you find the same thing: positive results have an easier time getting published. And when you do explicit experiments where different populations review the same paper, with only the results changed, you again find more support for publication when results are significant.
So what do you do about it? Well, you can find ways to reduce the extent of bias, for example by getting journals to precommit to publishing papers based on the methods and the significance of the question asked, rather than the results. Or you can create avenues for the publication of non-significant findings, either in journals or as draft working papers. Recall however that Franco, Malhotra, and Simonovits found most null results weren’t even written up at all - it may be that if non-significant work can be published in some outlets but will still be ignored by other researchers, researchers might not bother to write them up and instead choose to allocate their time to pursuing research that might attract attention.
Alternatively, if you can identify publication bias, you can correct for it with statistical tools. Andrews and Kasy, as well as others, have developed ways to infer the “true” estimate of an effect by estimating the likely value of unpublished research. Indeed, Andrews and Kasy make such a tool freely available on their website.
New articles and updates to existing articles are typically added to this site every two weeks. To learn what’s new on New Things Under the Sun, subscribe to the newsletter.
Frankel, Alexander and Maximilian Kasy. Forthcoming. Which findings should be published? American Economic Journal: Microeconomics. https://www.aeaweb.org/articles?id=10.1257/mic.20190133&&from=f
Breznau, Nate, Eike Mark Rinke, Alexander Wuttke, Muna Adem, Jule Adriaans, Amalia Alvarez-Benjumea, Henrik K. Andersen, et al. 2021. Observing Many Researchers Using the Same Data and Hypothesis Reveals a Hidden Universe of Uncertainty. MetaArXiv. March 24. doi:10.31222/osf.io/cd5j9.
Emerson, Gwendolyn B., Winston J. Warme, Fredric M. Wolf, James D. Heckman, Richard A. Brand, and Seth S. Leopold. 2010. Testing for the Presence of Positive-Outcome Bias in Peer Review: A Randomized Controlled Trial. Archives of Internal Medicine 170(21): 1934-1939. https://doi.org/10.1001/archinternmed.2010.406
Berinsky, Adam J., James N. Druckman, and Teppei Yamamoto. 2021. Publication Biases in Replication Studies. Political Analysis 29: 370-384. https://doi.org/10.1017/pan.2020.34
Chopra, Felix, Ingar Halland, Christopher Roth, and Andreas Stegmann. 2022. The Null Results Penalty. CESifo Working Paper 9776. https://dx.doi.org/10.2139/ssrn.4127663
Dwan, Kerry, Douglas G. Altman, Juan A. Arnaiz, Jill Bloom, An-Wen Chan, Eugenia Cronin, et al. 2008. Systematic Review of the Empirical Evidence of Study Publication Bias and Outcome Reporting Bias. PLoS ONE 3(8): e3081. https://doi.org/10.1371/journal.pone.0003081
Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2014. Publication bias in the social sciences: Unlocking the file drawer. Science 345(6203): 1502-1505. DOI: 10.1126/science.1255484
Brodeur, Abel, Scott E. Carrell, David N. Figlio, and Lester R. Lusher. 2023. Unpacking p-hacking and Publication Bias. NBER Working Paper 31548. https://doi.org/10.3386/w31548
Andrews, Isaiah, and Maximilian Kasy. 2019. Identification of and Correction for Publication Bias. American Economic Review 109(8): 2766-94. https://doi.org/10.1257/aer.20180310
Camerer, Colin F., Anna Deber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johanson et al. 2016. Evaluating replicability of laboratory experiments in economics. Science 351(6280): 1433-1436. https://doi.org/10.1126/science.aaf0918
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349(6251) aac4716. https://doi.org/10.1126/science.aac4716
Christensen, Garret, and Edward Miguel. 2018. Transparency, Reproducibility, and the Credibility of Economics Research. Journal of Economic Literature 56(3): 920-80. https://doi.org/10.1257/jel.20171350
Wolfson, Paul J., and Dale Belman. 2015. 15 years of research on U.S. employment and the minimum wage. Tuck School of Business Working Paper No. 2705499. http://dx.doi.org/10.2139/ssrn.2705499