It knows a bit!
People rag on peer review a lot (including, occasionally, New Things Under the Sun). Yet it remains one of the most common ways to allocate scientific resources, whether those be R&D dollars or slots in journals. Is this all a mistake? Or does peer review help in its purported goal to identify the science most likely to have an impact and hence, perhaps most deserving of some of those limited scientific resources?
A simple way to check is to compare peer review scores to other metrics of subsequent scientific impact; does peer review predict eventual impact?
A number of studies find it does.
Let’s start with peer review at the stage of reviewing research proposals.
Li and Agha (2015) looks at more than 100,000 research projects funded by the NIH over 1980-2008, comparing the percentile rank of the application peer review scores to the outcomes of these research projects down the road. For each grant, they look for publications (and patents) that acknowledge the grant’s support. Besides counting the number of publications and patents each grant results in, they can also see how often the publications are cited. Note, they are only looking at projects that actually were funded by the NIH, so we don’t need to worry that their results are just picking up differences between funded and unfunded projects.
The upshot is, better peer review scores are correlated with more impact, whether you want to measure that as the number of resulting journal articles, patents, or citations. For example, here’s a scatter plot of the raw data, comparing peer review percentile ranks (lower is better) to citations and publications. Lots of noise, but among funded projects, if people think your proposal is stronger, you’re more likely to get publications and citations.
Li and Agha also look at the correlation between peer review scores and impact measures after controlling for other potentially relevant factors, such as the year or field of the grant, or the PI’s publication history, institution, and career characteristics. The results are moderated a bit, but basically still stand - compare two grants in the same year, in the same study section, from PIs who look pretty similar on paper, and the grant with higher peer review scores will tend to produce more papers, patents, receive more citations, and produce more very highly cited papers.
Among funded proposals, the predictive power of peer review seems to be highest at the top; the difference in citations, for example, between a top-scoring proposal and one at the 20th percentile tends to be much larger than the difference in citations between one at the 20th and 40th percentile.1 Moreover, even at the top, the correlation between peer review scores and outcomes isn’t great. If you compare proposals that score at the top to proposals at the 10th percentile (of grants that were ultimately still funded), the top proposal is twice as likely to result in a one-in-a-thousand top cited paper. I think that’s not actually that high - since a 10th percentile proposal isn’t that far off from the average, if peer review was really accurate, you might have expected the top proposal to be something like ten times as likely to produce a hit paper than as an average proposal.
Park, Lee, and Kim (2015) exploits a peculiar moment in NIH history to provide further evidence that the NIH peer review processes, on average, pick projects with higher scientific impact. In 2009, the US government passed the American Recovery and Reinvestment Act, a stimulus bill meant to fight the economic headwinds of the 2008 financial crisis. The bill authorized $831bn in new spending, of which a tiny corner, $1.7bn, was used by the NIH to fund research projects that would not normally have been funded. This provides a rare opportunity to see how good projects that would otherwise have been rejected by the NIH (which relies heavily on peer review to select projects) fare when they unexpectedly receive funding.
When Park, Lee, and Kim (2015) compare stimulus-funded proposals (which got lower peer review scores) to normally funded proposals, they find the stimulus-funded proposals tend to lead to fewer publications and that these publications tended to receive fewer citations. On average, a research proposal with peer review scores high enough to be funded under the NIH’s normal budget produces 13% more publications than a stimulus funded project. If we focus on a proposal’s most high-impact publication (in terms of citations), Park and coauthors find proposals funded only because of the stimulus got 7% fewer citations. Lastly, we can look at the 5% of publications funded by these NIH grants that received the highest amount of citations. A normally funded research proposal had a 7% chance of producing one of these “highest impact” papers; a stimulus-funded proposal had a 4% chance of producing one.
I think these results are pretty consistent with Li and Agha (2015) in a few ways. They replicate the general finding that in the NIH, higher peer review scores are associated with more research impact (as measured with imperfect quantitative methods). But they also find peer review doesn’t have super forecasting acumen. Note that Park, Lee, and Kim are not comparing proposals that just barely clear the NIH’s normal funding threshold to proposals that just barely miss it - they don’t have the data needed for that. Instead, they are comparing the entire batch of proposals rated above the NIH’s normal funding threshold to a batch of proposals that fall uniformly below it. The batch of normally funded proposals includes the ones that were rated very highly by peer review, which Li and Agha’s work suggests is where peer review tends to work best. Even so, the differences Park, Lee, and Kim find aren’t enormous.
Gallo et al. (2014) obtain pretty similar results as above for the peer review scores of the American Institutes of Biological Sciences, an organization that provides expert peer review services for clients. In the figure below, on the horizontal axis we see the peer review scores for 227 projects reviewed by American Institutes of Biological Sciences peer reviewers that were ultimately funded by the client. These range from 1 (the best) to 5 (the worst) (note the figure stops at 4; no projects receiving a score worse than that were funded). On the vertical axis we have a normalized count of all the citations to publications that emerged from the grant. As with the NIH data, we again observe a noisy but pretty consistent relationship: the better the peer review score, the more citations eventually earned.2
Clavería et al. (2000) obtains similar results in a review of 2,744 proposals funded by the Spanish Health Research Fund over 1988-1994. In this case, the peer review data available is pretty coarse: Claveria and coauthors just know if reviewers classified projects as “excellent/good”, “acceptable”, or “questionable/rejected.” However, a distinguishing feature of this study is that in 1996 the authors arranged for each of these proposals to be reviewed retrospectively by new reviewers. These reviewers looked at the original proposals, the annual and final reports, and published papers originating the project, and assigned each of the now-completed proposals a score of 1-10 (higher is better) for its actual scientific performance. So, if we are concerned that quantitative indicators like citations or publication counts are inappropriate ways to evaluate science, this study gives us a more holistic subjective assessment of research quality.
The study again finds that peer review scores are noisily correlated with measures of quality. Spanish Health Research Fund proposals were reviewed by two commissions, one comprised of experts with topical expertise, and one with experts from related fields. After controlling for research level, duration, budget, and year of project onset, projects that received an “excellent/good” review at the proposal stage from the related field commission were rated 0.3 points higher when the completed projects were reviewed. An “excellent/good” review from the commission with more direct topical expertise was associated with a 0.7 higher rating. (If you do not adjust for research level and others, the association is a bit stronger). Again - better peer review scores seem to be associated with better outcomes, but the association isn’t super strong (for context, the average rating for completed projects was 5.0/10).
We have some similar results about the correlation between peer review scores and citations at the publication stage too. As discussed in more detail in Do academic citations measure the impact of new ideas? Card and DellaVigna (2020) have data on about 30,000 submissions to four top economics journals, including data on their peer review scores over (roughly) 2004-2013. Because, in economics, it is quite common for draft versions of papers to be posted in advance of publication, Card and Dellavigna can see what happens to papers that are accepted or rejected from these journals, including how many citations they go on to receive (both as drafts and published versions). As with Li and Agha (2015) and Gallo et al. (2014) they find there is indeed a positive correlation between the recommendation of reviewers and the probability a paper is among the top 2% most highly cited in the journal.
Neither is this because high peer review scores lead to publication in top economics journals (though that’s also true). Card and Dellavigna also track the fate of rejected articles and find that even among rejects to these journals, those that get higher peer review scores still go on to receive more citations.
Siler, Lee, and Bero (2014) obtain similar results using a smaller sample of submissions to the Annals of Internal Medicine, the British Medical Journal, and The Lancet over 2003 and 2004. For a sample of 139 submissions that received at least two peer review scores, they can track down the eventual fate of the submission (either published in one of these three journals or another). Among the 89 peer-reviewed submissions that were ultimately rejected, the peer review scores (from the first, initial review) were positively correlated with the number of citations the submissions eventually received, though the correlation was pretty weak. For the 40 submissions that were reviewed and accepted, again positive (initial) peer review reports were positively correlated with the number of citations eventually received. In this latter case, the correlation was too weak to be confident it’s not just noise (possible because the sample was so small).
Siler, Lee, and Bero also emphasize that the three journals actually rejected the 14 papers that would go on to receive the most citations (though they did manage to get the 15th!).
Perhaps more reassuring is the fact that generally speaking, papers that went on to be highly cited tended to be identified as publishable in other journals pretty quickly. The figure below compares the eventual number of citations received to the time elapsed between submission to one of the three journals under study and eventual publication somewhere else. No highly cited papers took longer than 500 days (not great, but better than 2000!) to find a home. That could be because peer review at one of the next journals the paper was submitted to was quick to recognize the quality of these articles, or possibly that they rapidly resubmitted after getting favorable feedback from initial peer reviewers. But this evidence is pretty indirect and other explanations are also possible (for example, maybe the authors believed in the paper’s merit and submitted them more frequently for review, or they were more frequently desk-rejected and so could be resubmitted fast).
That said, we also have one more study looking at peer review reports and eventual impact, this time in the American Sociological Review. Teplitskiy and Bakanic (2016) have data on 167 articles published in the American Sociological Review in the 1970s, as well as their peer review scores. Among this set of published article, they find no statistically significant relationship between peer review scores and the number of citations papers go on to earn.
After analyzing a sample of peer review reports, they argue this is because, for the American Sociological Review, reviewers simply don’t seem to be that focused on the likely impact of submitted papers. None of the reviews explicitly talk about the citation prospects of a paper, and some plausible related ideas don't seem to be much discussed either. For example, we might think the choice of topic is an important determinant of eventual impact, but in their analysis of reviews, only around 20% of positive comments (which were rare) or 6% of negative comments focused on the choice of topic. The vast majority of comments focused instead on general impressions (“this is a nice article”), the soundness of the argument, or concrete criticisms, which don’t seem directly tied to any attempt to predict the likely future impact of an article.
I think the above studies find reasonably consistent results. If you have a lot of data - thousands of papers or applications - you can confidently pick out the peer review signal from the noise: stronger peer review scores are correlated with subsequent measures of impact. The Li and Agha (2015) study, which draws on the largest dataset, also finds peer review is predictive after controlling for a lot of other observable variables.
These studies also find the signal of peer review isn’t too strong though. The best we can do is pick something with maybe twice the probability of being a hit, relative to averages (though, that’s averages of funded projects - maybe peer review does a better job of keeping resources away from really bad ideas). In Claveria et al. (2014), the gap between an “excellent/good” proposal, relative to an acceptable or even questionable one is just 0.3-0.7 on a 10-point scale of scientific performance. When we have less data, as with Siler, Lee, and Bero (2015), or Teplitskiy and Bakanic (2016), the strength of the association may be too weak to reliably be detected.
Part of that may simply be down to the noise in citations (in another article, I argued they are very noisy measures of the “value” of articles), which will tend to pull estimated correlations towards zero, compared to what they would be if we could properly measure. But plenty of other papers use citations as well, and find stronger relationships between what they’re interested in and eventual citations. I think that suggests the noise and imprecision of peer review is not purely down to issues with citation data being too noisy to sniff out a strong relationship.
New articles and updates to existing articles are typically added to this site every three weeks. To learn what’s new on New Things Under the Sun, subscribe to the newsletter.
Li, Danielle, and Leila Agha. 2015. Big names or big ideas: Do peer-review panels select the best science proposals? Science 348(6233): 434-438. https://doi.org/10.1126/science.aaa0185
Fang, Ferric, Anthony Bowen, and Arturo Casadevall. 2016. Research: NIH peer review percentile scores are poorly predictive of grant productivity. eLife 5:e13323. https://doi.org/10.7554/eLife.13323
Park, Hyunwoo, Jeongsik (Jay) Lee, and Byung-Cheol Kim. 2015. Project selection in NIH: A natural experiment from ARRA. Research Policy 44(6): 1145-1159. https://doi.org/10.1016/j.respol.2015.03.004.
Gallo, Stephen, Afton S. Carpenter, David Irwin, Caitlin D. McPartland, Joseph Travis, Sofie Reynders, Lisa A. Thompson, and Scott R. Glisson. 2014. The Validation of Peer Review through Research Impact Measures and the Implications for Funding Strategies. PLOS One 9(9): e106474. https://doi.org/10.1371/journal.pone.0106474
Clavería, L. Erik, Eliseo Guallar, Jordi Cami, José Conde, Roberto Pastor, José R. Ricky, Eduardo Rodríguez-Farré, Fernando Ruiz-Palomo, and Emilio Muñoz. 2000. Does peer review predict the performance of research projects in health sciences? Scientometrics 47(1): 11-23. https://doi.org/10.1023/A:1005609624130
Card, David, and Stefano DellaVigna. 2020. What do Editors Maximize? Evidence from Four Economics Journals. The Review of Economics and Statistics 102(1): 195-217. https://doi.org/10.1162/rest_a_00839
Siler, Kyle, Kirby Lee, and Lisa Bero. 2014. Measuring the effectiveness of scientific gatekeeping. PNAS 112(2): 360-365. https://doi.org/10.1073/pnas.1418218112
Teplitskiy, Misha, and Von Bakanic. 2016. Do Peer Reviews Predict Impact? Evidence from the American Sociological Review, 1978 to 1982. Socius, 2. https://doi.org/10.1177/2378023116640278