Physical sciences seem to have less issues - is that due to empirical tools, theory, or both?
Publication bias is real. In the social sciences, more than one study finds that statistically significant results seem to be about three times more likely to be published than insignificant ones. Some estimates from medicine aren’t so bad, but a persistent bias in favor of positive results remains. What about science more generally?
To answer that question, you need a way to measure bias across fields that might be very different in terms of their methodology. One way is to look for a correlation between the size of the standard error and the estimated size of an effect. In the absence of publication bias, there shouldn’t be a relationship between these two things. To see why, suppose your study has a lot of data points. In that case, you should be able to get a very precise estimate that is close to the actual population average of the thing you’re studying. On the other hand, if your study has very few data points, you’ll get a very imprecise estimate, including a high probability of getting something very much bigger than the actual population mean and a high probability of getting something very much smaller. But over lots of studies, if there’s no publication bias, you’ll get some abnormally high estimates, some abnormally low ones, but in the end they’ll cancel each other out. If, however, small estimates are systematically excluded from publication, then you’ll end up with a robust correlation between the size of your standard errors and the size of your effects. The extent of this correlation is a way to measure the extent of publication bias in a given literature.
A second approach leverages the fact that journals have strict but essentially arbitrary standards for what constitutes a statistically significant result. If a random process could generate data that looks like yours 4% of the time, purely by chance, that’s a significant finding (using the conventional 5% significance cut-off). But if a random process would generate data that looks like yours 6% of the time, it’s not. But we shouldn’t expect nature to obey these thresholds. Out there in the real world, we’ll probably observe patterns in data that would arise purely by chance 4% of the time about as often as we observe patterns that arise by chance 6% of the time. If the published literature shows lots of 4% datasets and very few 6% datasets, that’s a sign that results without significant results aren’t being published.
Which approach is better? Well, you don’t actually have to choose. Bayesian model averaging gives you a framework to, in effect, see how well a bunch of different modeling assumptions fit the data, and then average the results over all those different models, giving more weight to the ones that better fit the data. For example, one of your models might assume there’s no genuine findings in any of these papers: it’s all random noise that’s been filtered by publication bias. Another one of your models might assume there is no publication bias at all, and that there are real findings in your data. Each model implies a different distribution of results and you can see how likely it is the data you observe would arise under each set of assumptions. This Bayesian modeling average approach seems to work pretty well in simulations (where we know the “truth,” because we are simulating both the research and publication processes in the computer), and when comparing settings where we know there is no publication bias to cases where we suspect there is (for example, in settings where journals pre-commit to publish results before seeing them, we know there shouldn’t be any publication bias).1
Bartoš et al. (2022) identify about 1,000 meta analyses across environmental sciences, psychology, and economics, covering more than one hundred thousand individual studies (the lion’s share in economics), and another 67,000 meta-analyses in medicine that cover nearly 600,000 individual studies in medicine. For each field, they see how likely it is that different sets of assumptions would generate data displaying these patterns, and then how likely it is that each of these models is “correct.” Lastly, once they have the probability all these different models are correct, they can “turn off” publication bias in their models to infer the actual distribution of effects seen in the data, rather than just the ones that make it into journals. The figure below compares the distribution of findings in the literature (“unadjusted”) to what you find after using models to try and strip out publication bias (“adjusted”).
For their sample, publication bias in economics, environmental sciences, and psychology are much worse than in medicine (though comparing among these three depends on which measure of publication bias you prefer).
Fanelli, Costas, and Ioannidis (2017) is a bit of older study, but it covers a broader range of fields. They obtain 1,910 meta-analyses drawn from all areas of science, and pull from these 33,355 datapoints from original underlying studies. For each meta-analysis, they compute the correlation between the standard error and the size of the estimated effect; they then do a weighted average across the different meta-analyses to generate a sort of average over the meta-analyses in fields they cover. In general, the more positive the estimate, the stronger the correlation between standard errors and effect size, implying stronger publication bias. Results below:
Note that the social sciences (up at the top) have pretty high measures of bias, estimated with a lot of precision, while many (but not all) of the biological fields also have fairly high bias. But also note the bottom two rows, which seem to exhibit no bias: computer science, chemistry, engineering, geosciences, and mathematics.
A downside of these approaches is that they only work in disciplines where where research is primarily about measuring effect sizes with noisy data. Fanelli (2010) uses a simpler, but more flexible measure of publication bias. Fanelli analyses a random sample of 2,434 papers from all disciplines that include some variation of the phrase “test the hypothesis.” For each paper, Fanelli determined if the authors of the paper argued they had found positive evidence for their hypothesis or not (that is, they either found no evidence in favor of the hypothesis, or actually found contrary results). As a rough and ready test of publication bias, he then looked at the share of hypotheses in each field for which positive support was found. He finds between 75 and 90% of hypotheses mentioned in published papters tend to be supported, across different disciplines. But there are some significant differences across disciplines.
Fanelli cuts papers into six categories: physical sciences, biological sciences, and social sciences, and for each one he further sub-divides papers into pure science and applied science. There are no major differences among applied papers in all three domains - bias seems to be quite high in every case. But in the pure science fields, physical sciences tended to find support for less than 80% of their hypotheses while social sciences tended to find support for nearly 90% of the hypotheses investigated. Biology is in the middle.
Taken together, these three studies suggest the social sciences have bigger problems with publication bias than do the biological sciences, which tend to have more problems than the hard sciences. Why?
Suppose the root cause of publication bias is that journals want to highlight notable research, in order to be relevant to their readership.2 There are at least two different ways this can lead to publication bias, depending on what journals view as “notable” research.
First, it might be that journals consider surprising results to be the most notable. After all, if we’re not surprised by research, doesn’t that imply we already sort-of knew the result? And what would be the point of that? But this leads to publication bias if results that challenge the prevailing wisdom are easier to publish than results that support it. In aggregate the weight of evidence is distorted because we do not observe the bulk of the boring evidence that just supports the conventional wisdom.
This could lead to variation in publication bias across fields if fields vary in the breadth of what is considered surprising. For example, we could imagine one field that is very theoretically contested, with different theories making very different predictions. In that field, perhaps everything is surprising in light of some theory and so most results are publishable. In this field, we might not observe much evidence of publication bias. In another field (social science?), perhaps there is an unstated assumption that most hypotheses are false and so null results are perceived as boring and hence difficult to publish. In this field, we would observe a lot of evidence of publication bias.
A second way that a preference for notable research can lead to bias has to do with a field’s skepticism towards its own empirical methods. Suppose you have a theory that predicts a positive relationship between two variables, but when you test the theory you get a null result. That’s a surprising result, and so under the first theory of publication bias its notable and therefore more attractive to publishers. But your willingness to recommend such a paper for publication might depend on if you think the result is reliable. If you trust the empirical methods are super reliable - you believe that if you replicated the methods you would get the same result - then you might recommend it for publication. But if you are working in a field where you know empirical work is very hard to do right, then it becomes a lot more likely that the surprising result merely reflects inadequacies in the test.
Fields do seem to differ substantially in the reliability of their empirical methods. Fields differ in how feasible it is to precisely measure phenomena of interest, rather than noisy proxies. Fields differ in their ability to run tightly controlled repeatable experiments and isolate specific phenomena of interest from a web of complex interactions. Fields differ in the number of observations they have to work with. It might be that fields with imprecise empirical methods are much more hesitant to publish null results, because there is much less signal imputed to null results. Nine times out of ten, a null result just means the data was too noisy to see the signal.
A field with very reliable empirical methods might be more willing to publish null results than a field where empirical work is “more art than science.” A tragic outcome of this theory of publication bias is that it predicts publication bias will be worst in the fields with the weakest empirical methods, exacerbating the already rough state of empirical work in those fields!
These theories of publication bias make somewhat different predictions. Under the surprise theory, null results will be easier to publish when positive results are expected, because that’s when null results are most surprising. Under the skepticism theory, null results will be harder to publish when positive results are expected, because that’s when it’s more likely that the research messed up the test. That said, it could be that both theories are true to some degree, and pull in different directions.
To begin, we can turn to a strand of literature that conducts experiments on the ultimate source of publication bias: the people helping to decide what gets published. In these experiments, a set of randomly selected peer reviewers or editors view one version of a description of a (fictitious) research project and another set of peer reviewers view a description of the same project, but with the results changed from a positive result to a null result. We then see if those shown the version with a null result rate the project as less publishable than the one with a positive result.
For example, in Chopra et al. (2022), about 500 economists assess a series of short descriptive vignettes of research projects, some of which describe positive results and some of which describe null results. As discussed here, indeed, positive results were more likely to be recommended for publication than negative ones. But Chopra and coauthors introduce a second experiment too: within each of these groups, some participants additionally see information on what a survey of experts expects the result to be. That lets us see how our expectations about what a result “should” be informs its perceived suitability for publication. As discussed above, this can help us distinguish between the surprise and skepticism theories of publication bias. Are null results easier or harder to publish when reviewers are told the result is expected to be positive?
Chopra et al. (2022) finds that a null result is extra unlikely to be recommended for publication if it flies in the face of the expert consensus, consistent with the skepticism theory of publication bias. To provide some further insight into what’s going on, Chopra and coauthors run a second experiment with about 100 economics graduate students and early career researchers, again showing them research vignettes that differ across groups in terms of whether the main results are statistically significant. But this time, they directly ask their respondents to rate the statistical precision of the main result on a scale from 1 to 5, where 1 is very imprecisely estimated” and 5 is “very precisely estimated.” Even though the statistical precision is the exact same in the positive and null result versions of the vignettes, respondents rated the precision of the null results as significantly lower.
That suggests to me that when economists like myself see an unexpected null result, we tend to think that’s probably due to a weak study design and hence not very informative about the true state of the world.
Another paper in this literature provides additional evidence that reviewers see null results as indicative of something going wrong in the paper’s methods. Emerson et al. (2010) sends 110 reviewers at 2 leading orthopedic journals an article for review: roughly half the reviewers get a version of the paper with a positive result; the others get a basically identical paper, except the main result is negative. As with Chopra et al. (2022), reviewers were more likely to recommend acceptance for the version with a positive result.
As with Chopra and coauthors, this seems to be because reviewers didn’t trust the results of null results. Despite the fact that each paper had a word-for-word identical “methods” section, reviewers who reviewed a version of the paper with positive results rated the paper’s methodological validity at 8.2 out of 10, while those who saw the version with a null result rated the validity at 7.5 out of 10 (a difference that was statistically significant). Emerson and coauthors also embed another clever check inside their experiment: in each paper they purposefully inserted various errors and then they read the review reports to see how many of these errors were detected by reviewers. The set of errors was identical across both sets of paper, but on average reviewers of the null result version of the paper detected 0.8 of them, while reviewers of the positive result version detected 0.4. That suggests reviewers of the null result papers were reading more skeptically, with an eye towards seeing whether the null result reflected ground truth or merely weak methods.
Another way we can see how publication is affected by our expectation of what a result “should” be is with replication studies. Specifically, is it easier to publish a replication study if it finds results that are the same as the original study or different? In this case the results are a bit more in favor of the surprise theory of publication bias.
Berinsky et al. (2021) test this, also by providing short descriptions of research projects to reviewers, this time in political science departments. Once again, some reviewers see versions of these descriptions with positive results, others with negative results. But Berinsky and coauthors also look specifically at the willingness to publish replications of prior work. Just as is the case with original work, they find a bias against replications that get null results. But in this case the effect is slightly ameliorated if the original study had a positive finding. In other word, in the specific case of replications, if people “expect” a positive result, because the original study was positive, they are slightly more likely to be willing to publish a null result than would be the case if the original study was also a null result. However, the opposite does not seem to be the case: a replication that gets a positive result is just as likely to be published, whether the original study is positive or a null result.
Experiments are great for really isolating a particular causal channel, but they can be a bit artificial or may not be generalizable. So let’s turn to one of the few studies that tries to look at this question with observational data.
Doucouliagos and Stanley (2013) looks at 87 different meta-analyses from empirical economics and measures the extent of publication bias in each of the literatures covered using the approach discussed at the beginning of this post, where standard errors are compared with effect sizes. In the figure below, they classify anything smaller than 1 as exhibiting little to modest selection bias, anything between 1 and 2 as exhibiting substantial selection bias, and anything over 2 as exhibiting severe selection bias. They find there are plenty of results in each category.
What’s driving the differences? Well, one possibility is that this is driven by variation in the reliability of empirical methods across subfields in economics. For example, my subjective read on the quality of data across economic fields is that macroeconomics has the toughest time with getting lots of clean data. And Doucouliagos and Stanley do find publication bias seems to more extreme in macroeconomics than other fields.
But Doucouliagos and Stanley (2013) is really set up to see if differences in the range of values permitted by theory explain a big chunk of the variation in publication bias across fields. How are you going to measure that though?
Doucouliagos and Stanley take a few different approaches. First, they just use their own judgement to code up each meta-analysis as pertaining to a question where theory predicts empirical results can go either way (i.e., positive or negative). Second, they use their own reading of the meta-analyses or draw on surveys (where they exist) to assess whether there is “considerable” debate around this area of research. Whereas they claim their first measure is non-controversial and that most economists would agree with how they code things, they acknowledge the second criteria is a subjective one.
By both of these measures, they find that when theory admits a wider array of results, there is less evidence of publication bias. And the effects are pretty large. A field whose theory they code as admitting positive and negative results has a lot less bias than one that doesn’t - the difference is large enough to drop from “severe” selection bias to “little or no” selection bias, for example. This is broadly consistent with the experimental work of Chopra et al. (2022): when the space of expected results is more narrow, it seems to be harder to publish results outside that space.
But maybe we’re worried at this point that we have the direction of causality exactly backwards. Maybe it’s not that wider theory permits a wider array of results to be published. Maybe it’s that a wider array of published results leads theorists to come up with wider theories to accommodate this evidence. Doucouliagos and Stanley have two responses here. First, there is a difference between the breadth of results published and publication bias and they try to control for the former to really isolate the latter. After all, it is possible for a field to have both selection bias and a wide breadth of results published. Their methodology can separately identify both, at least in theory, and so they can check if there is more selection bias when there is more accommodating theory, even when two fields have an otherwise similarly large array of results to explain.
But in practice, I wonder if controlling for this is hard to do. So I am a fan of the second approach they take to address this issue. There are some theories in economics where there just really isn’t much wiggle room about which way the results are supposed to go. One of them is studies estimating demand. Except for some exotic cases, economists expect that if you hold all else constant, when prices go up, demand should go down, and vice-versa. We even permit ourselves to call this the “law” of demand. Economists almost uniformly will believe that apparent violations of this can be explained by a failure to control for confounding factors. They will strongly resist the temptation to derive new theories that predict demand and price go up or down together.
Moreover, it isn’t controversial to identify which meta-analyses are about estimating demand and which are not. So for their final measure, Doucouliagos and Stanley look at estimates of bias in studies that estimate demand and those that don’t. And they find studies that estimate demand exhibit much more selection bias than those that don’t (even more than in their measures about extent of debate or what theory permits). In other words, when economists get results that say there is no relationship between price and demand, or that demand goes up when prices go up, these results appear less likely to be published. That’s pretty consistent with skepticism about the quality of empirical work driving publication bias.
To sum up, there seems to be significant variation in the extent of publication bias across different scientific fields with social sciences and some parts of biology seeming to suffer from this problem more acutely than other areas. There’s not a ton of work in this area (email me if you know of more!) but what is available seems to suggest one reason for this may be that reviewers will tend to be more skeptical of null results in fields with less reliable methods. In some experiments on reviewers, we see some evidence reviewers read papers with null results more skeptically: they spot more errors, they rate the methodology as weaker, and they estimate the results are less precisely estimated. And experiments and observational data from economics also suggests reviewers tend to be extra skeptical of results that go against the prevailing consensus. In a world where evidence that surprises is most interesting, we would predict that kind of paper to be easier to publish; but that logic only holds if you think the null result reflects a genuine fact about the world.
And in fact, we know from other work that social scientists are right to exercise skepticism towards their own empirical methods. In the many analysts literature (discussed in some detail here), multiple teams of social scientists are given the same data and asked to answer the same question, yet quite frequently arrive at very different conclusions!
Unfortunately, if this mechanism is right, it suggests fields with lower levels of empirical reliability are more likely to additionally face the burden of publication bias. It is precisely the fields where lots of empirical work is needed to average out a semi-reliable answer that we see the most bias in what empirical work is published!
That said, the study of publication bias is itself a field where empirical work might be shaky. And that should give us pause about the strength of this conclusion. Indeed - who knows if others have looked for the same relationships elsewhere and gotten different results, but have been unable to publish them? (Joking! Mostly.)
New articles and updates to existing articles are typically added to this site every three weeks. To learn what’s new on New Things Under the Sun, subscribe to the newsletter.
Maier, Maximilian, František Bartoš, and Eric-Jan Wagenmakers. 2023. Robust Bayesian meta-analysis: Addressing publication bias with model-averaging. Psychological Methods, 28(1), 107–122. https://doi.org/10.1037/met0000405
Bartoš, František, Maximilian Maier, Eric-Jan Wagenmakers, Franziska Nippold, Hristoc Soucouliagos, John P. A. Ioannidis, Willem M. Otte, Martina Sladekova, Teshome K. Deressa, Stephan B. Bruns, Daniele Fanelli, and T.D. Stanley. 2022. Footprint of Publication Selection Bias on Meta-Analyses in Medicine, Environmental Sciences, Psychology, and Economics. arXiv: 2208.12334. https://doi.org/10.48550/arXiv.2208.12334
Fanelli, Daniele, Rodrigo Costas, and John P. A. Ioannidis. 2017. Meta-assessment of bias in science. Proceedings of the National Academy of Sciences of the United States of America 114(14): 3714-3719. https://doi.org/10.1073/pnas.1618569114
Fanelli, Daniele. 2010. “Positive” Results Increase Down the Hierarchy of the Sciences. PLoS ONE 5(4): e100688. https://doi.org/10.1371/journal.pone.0010068
Chopra, Felix, Ingar Halland, Christopher Roth, and Andreas Stegmann. 2022. The Null Results Penalty. CESifo Working Paper 9776. https://dx.doi.org/10.2139/ssrn.4127663
Emerson, Gwendolyn B., Winston J. Warme, Fredric M. Wolf, James D. Heckman, Richard A. Brand, and Seth S. Leopold. 2010. Testing for the Presence of Positive-Outcome Bias in Peer Review: A Randomized Controlled Trial. Archives of Internal Medicine 170(21): 1934-1939. https://doi.org/10.1001/archinternmed.2010.406
Berinsky, Adam J., James N. Druckman, and Teppei Yamamoto. 2021. Publication Biases in Replication Studies. Political Analysis 29: 370-384. https://doi.org/10.1017/pan.2020.34
Doucouliagos, Chris, and T.D. Stanley. 2013. Are all economic facts greatly exaggerated? Theory competition and selectivity. Journal of Economic Surveys 27(2): 316-339. https://doi.org/10.1111/j.1467-6419.2011.00706.x
Skepticism and surprise are not the only possible reasons for publication bias that could plausibly vary across fields. Here I briefly look at two more. Surely there are others too.
First, variation in publication bias could be related to the nature of publication in different fields. If it’s easier to draft and push an article through peer review in some fields than in others then some fields may end up getting more results out there (even if they’re not out there in a top-ranked journal). In the social sciences, we have some evidence that the biggest difference between null results and strong results is that most null results are never even written up and submitted for publication. Maybe that’s because it’s too much work for too little reward. In a field where writing up and publishing results from an experiment somewhere is easy, it might be worth doing, if only to add another line to the CV.
Second, empirical reliability might matter in another more simple way as well. In some fields, maybe everyone just always gets the same answer, and so even if publication bias is rampant, it will not manifest because it has very little variation in results to induce bias in! For example, as noted above it may be easier in some fields to tightly control for noise in data, or to obtain many more observations, than in others. In economics, a big sample might be hundreds of thousands of observations. In physics, the Large Hadron Collider generates 30 petabytes of data per year. In fields where clean data is plentiful, it might not be the case that when you run an experiment sometimes you find support for a hypothesis and sometimes you don’t. You always find the same thing, or at least always come to the same conclusion about statistical significance. In that case, you won’t find much of a relationship between the size of standard errors and effect sizes: within the range of observed standard errors, everything is either significant or not.