Relatively common for ideas we know will be big, but not so much for the rest
An old divide in the study of innovation is whether ideas come primarily from individual/group creativity, or whether they are “in the air”, so that anyone with the right set of background knowledge will be able to see them. As evidence of the latter, people have pointed to prominent examples of multiple simultaneous discovery:
Isaac Newton and Gottfried Liebnitz developed calculus independent of each other
Charles Darwin and Alfred Wallace independently developed versions of the theory of evolution via natural selection
Different inventors in different countries claim to have invented the lightbulb (Thomas Edison in the USA, Joseph Swan in the UK, Alexander Lodygin in Russia)
Alexander Graham Bell and Elisha Grey submitted nearly simultaneous patent applications for the invention of the telephone
In 1922, Ogburn and Thomas compiled a list of nearly 150 examples of multiple independent discovery (often called “twin” discoveries or “multiples); wikipedia provides many more. These exercises are meant to show that once a new invention or discovery is “close” to existing knowledge, then multiple people are likely to have the idea at the same time. It also implies scientific and technological advance have some built in redundancy: if Einstein had died in childhood, someone else would have come up with relativity.
But in fact, all these lists of anecdote show is it is possible for multiple people to come up with the same idea. We don’t really know how common it is, because these lists make no attempt to compile a comprehensive population survey of ideas. What do we find if we do try to do that exercise?
A number of papers have looked at how common it is for multiple independent discovery to occur in academic papers. An early classic is Hagstrom (1974), which reports on a survey of 1,947 academics in the spring of 1966. Hagstrom’s survey asked mathematicians, physicists, chemists, and biologists if they had ever been “anticipated”; today, we would call this getting scooped. Getting scooped isn’t that uncommon: 63% of respondents said they had been scooped at least once in their career, 16% said they had been scooped more than once.
For our purposes, the most illuminating question in Hagstrom’s survey is “how concerned are you that you might be anticipated in your current research?” Fully 1.2% of respondents said they had already been anticipated on their current project!
Let’s assume people are, on average, halfway through a research project. If they have a constant probability of being scooped through the life of a project, then that implies the probability of getting scooped on any given project is on the order of 2.5%, at least in 1966.
Hill and Stein (2020) get similar results, studying the impact of getting scooped over 1999-2017 for the field of structural biology. Structural biology is a great field for studying how science works because of its unusually good data on the practice of science. Structural biologists try to figure out the 3D structure of proteins and other biological macromolecules using data on the diffraction of x-rays through crystalized proteins. When they have a model that fits the data well, the norm (and often publication requirement) is to submit the model to the Protein Data Bank. This submission is typically confidential until publication, but creates a pre-publication record of completed scientific work, which lets Hill and Stein see when two teams have independently been working on the same thing. Since almost all protein models are submitted to the Protein Data Bank, Hill and Stein really do have something approaching a census of all “ideas” in the field of structural biology, as well as a way of seeing when more than one team “has the same idea” (or more precisely, is working on closely related proteins). Overall, they find 2.9% of proteins involve multiple independent discovery, as defined above, quite close to what Hagstrom reported in 1974.
Painter et al. (2020) takes yet another approach to identifying multiple simultaneous invention, this time in the field of evolutionary medicine (2007-2011). Their approach is to identify important new words in the text of evolutionary medicine articles, and then to look for cases where multiple papers introduce the same new word at the roughly the same time. In their context, this usually means an idea has been borrowed from another field (where a word for the concept already exists) and they are looking for cases where multiple people independently realized a concept from another field could be fruitfully applied to evolutionary medicine.
To identify important new keywords, they take all the words in evolutionary medicine articles and algorithmically pick out the ones unlikely to be there based their frequency in American English. This gives them a set of technical words that are not common English words. They build up a dictionary of such terms mentioned in papers published between 1991 and 2006; these are words that are “known” to evolutionary biology in 2007. Beginning in 2007, they look for papers that introduce new technical words. Lastly, they consider a word to be important if it is mentioned in subsequent years, rather than once and never again.
Over the period they study, there were 3,488 new keywords introduced that went on to appear in at least one subsequent year. Of this set, 197 were introduced by more than one paper in the same year, or 5.6%. As a measure of independent discovery, that’s probably overstated, since it doesn’t correct for the same author publishing more than one paper using the same keywords. Again, I think something in the ballpark of 2-3% sounds plausible. Painter and coauthors go on to focus on a small subset of 5 keywords that were simultaneously introduced by multiple distinct people and which were very important, being mentioned not just again, but in every subsequent year.
Bikard (2020) is another attempt to identify instances of multiple independent discovery, though in this case it’s harder to use the data to estimate how common they are. Bikard argues that when the same two papers are frequently cited together in the same parenthetical,1 then that is evidence they refer to the same underlying idea. Bikard algorithmically identifies a set of 10,927 such pairs of papers in the PubMed database and shows they exhibit a lot of other hallmarks of being multiple independent discoveries: they are textually quite similar, published close in time, and frequently published literally back-to-back in the same journal issue, which is one way journals acknowledge co-discovery.
Given 29.3 million papers in PubMed, if there are only 10,927 instances of multiple discovery, that would naively suggest something on the order of 0.03% of papers having multiple independent discovery. But while Bikard’s publicly available database of twin discoveries is useful for investigating a lot of questions related to science, it’s less useful for ascertaining the probability of independent discovery. That’s because the algorithm requires articles to have the right mix of characteristics to be identified as simultaneous discoveries. For example, in order to identify if two articles are frequently cited together in the same parenthetical block, Bikard needs each paper to receive at least 5 citations, and he needs at least three papers that jointly cite them to have their full text available, so he can see if those citations happen in sequence inside a parentheses. It’s unclear to me how many of the 29.3mn papers in PubMed meet this criteria. But we can at least say that as long as no less than 1 in 100 papers meet the criteria, then Bikard’s method suggests a rate of simultaneous discovery that is significantly lower than 3%.
To close out this section, let’s turn to patents.
Until 2013, the US patent system featured an unusual first-to-invent system wherein patent rights were awarded not to the first person to seek a patent but to the first person to invent it (provided certain conditions were met). This meant that if two groups filed patents for substantively the same invention, the US patent office initiated something called a “patent interference” to determine which group was in fact the first to invent. These patent interferences provide one way to assess how common is simultaneous invention at the US patent office.
Ganguli, Lin, and Reynolds (2020) have data on all 1,329 patent interference decisions from 1998-2014. Of this set, it’s not totally clear how many represent actual simultaneous invention. In a small number of cases (3.5%), the USPTO ruled there had in fact been no interference, but in some cases one party settles or abandons their claim, or ownership of the patents is transferred to a common owner. In these cases, we don’t know necessarily know if the patents were the same. But it turns out this doesn’t really matter for making the argument that simultaneous invention is very rare. For the sake of argument, let’s assume all 1,329 patent interference decisions correspond to cases of independent discovery.
On average, it takes a few years for a patent interference decision to be issued. So let’s assume, for the sake of argument, these decisions come from the set of granted patents whose application was submitted between 1996 and 2012. Some 6.3mn patents applications (ultimately granted) were submitted over this time period, which implies 0.02% of patent applications face simultaneous invention. That’s a lot less than the 2-3% we found in some academic papers!
Before proceeding, we need to pause and think a bit about what these numbers mean. I have been implicitly interpreting the probability of simultaneous discovery as a measure of how many people have the same ability to make a discovery, where ability consists of the ability to even have the idea, but also to possess the requisite skills and knowledge to execute. Interpreted this way, it’s a measure of redundancy, because it tells us something about how many people are plausible “backup” discoverers of an idea.
But we might be worried the probability of simultaneous discovery doesn’t actually reflect redundancy because scientists are not choosing ideas at random. Instead, scientists carefully choose ideas and they might very well choose to eschew ideas that they believe are already being pursued by others. In that case, low levels of simultaneous discovery are the outcome of scientists/inventors dividing up the intellectual landscape to avoid incursions into the “territory” of their rivals. But that doesn’t mean they are incapable of discovering ideas in those areas; they just choose to avoid them. It may be this avoidance that drives low rates of simultaneous discovery, but that there is actually a lot of redundancy in science since if someone misses something important, others can step up.
The post Contingency and Science examines some evidence on how important is this concern. It looks at some studies that match up life scientists who unexpectedly die to those who do not, and then observe what are the knock-on effects to the people they collaborate with or who work in the same field. These papers do find the concern highlighted above is well founded: when scientists die, there is a reshuffling of what people work on, with new people starting to publish in the domain of the scientist who passed away. But these papers also suggest what is being published isn’t the same as what would have been published, had the life scientist not died. The citation profiles of the new work differs from the citation profile of the work created by life scientists who did not die. There is also more turnover in the kinds of topics that are studied, as compared to what happens in fields where eminent life scientists remain active. All together, this suggests that scientists do indeed respect each other’s intellectual property rights, but also that when someone else “takes over ownership” they don’t do the same things as the “original” owner.
Taken together, it’s likely that the annual rates of simultaneous discovery partially understate redundancy, because scientists capable of making the same discoveries may seek to avoid stepping on each other’s toes. At the same time, it’s not totally spurious to read these as indicative of low redundancy in innovation; it really does seem like, even when a scientist exits a research topic, people do not simply mop up the discoveries they would have made.
It’s tough to know exactly how much we should adjust our beliefs to take all this into account; I’m going to move forward by mostly using the larger estimates in the literature on the rate of simultaneous discovery (on the order of 2-3%). But simultaneous discovery is not exactly the benchmark we’re interested in. What I’m actually curious about is the probability someone would eventually rediscover an idea. If Einstein died in childhood, would someone else have found relatively? And would that have been fast or slow?
We can back out a rough estimate of this based on the probability of multiple simultaneous discovery. Suppose I am working on an idea that takes me 1 year to go from idea conception to publication, and I face a 2-3% probability of being scooped by someone else working on the topic during that time. Imagine I drop the idea, but the probability that someone else out there is working on it and will publish remains 2-3% per year. In other words, in every year, the probability nobody publishes on the idea is 97-98%. If that stayed constant, the probability nobody publishes on it in the next 20 years is 54-67%! In other words, these estimates imply that if someone doesn’t publish on an idea, there’s less than even odds someone else will pick up the idea and run with it in the following two decades.
And even this estimate may be on the high end, for a few reasons. First, assuming a research idea can move from conception to publication in the space of a year is probably too optimistic. If research takes two years and you face a 2-3% probability of being scooped during that longer period, the probability of no independent discovery in 20 years rises to 74-82%.2 Moreover, as noted earlier, a 2-3% probability of simultaneous discovery is actually on the high end in these papers. If it’s closer to 0.1%, it’s all but certain no one will make a discovery someone else missed.
Setting aside uncertainty about the annual rate of rediscovery, we still have another worry. Maybe it is wrong to assume the probability of independent discovery will remain constant over time. Maybe, as more knowledge around an area gets filled in, it gets increasingly likely we’ll make a discovery we might otherwise have missed. On the other hand, maybe the opposite is true; as science and technology move on, it might become increasingly unlikely that we’ll make a missed discovery.
For evidence on the probability of rediscovery over the long-run, we need longer-run data. But this is a very hard question to get at, because once a discovery is made, the kind of people capable of independently discovering it are likely to learn about the discovery and then not spend time trying to re-discover it.
In the patent office, for example, only the first to invent (or first to file) gets patent protection. So it behooves any prospective inventor to check to see if an invention has already been patented before spending a lot of time on reinvention. And in academia, most social credit for a discovery goes to the first to discover something. So it behooves researchers to spend some time searching to see if what they want to do has already been done, before embarking on major research projects.3
That said, there can be a considerable gap between when a discovery is made and when it is publicly disclosed. For patents, applications are generally only disclosed no less than 18 months after being filed. Even this is not universal and has only been the case since 2000. In other cases, patent applications are private until the patent is granted, which can take years. Moreover, even if a patent application or grant is public, searching the patent record is an art in and of itself and people might miss things even if the record is public. The point is, people may inadvertently reinvent patented inventions, either because the patent office hasn’t told anyone a similar invention is already under review, or because the inventor simply failed to locate a similar invention in the public patent record. And fortunately for us, the patent office keeps records of why patent applications are blocked. So this can serve as an alternative metric of reinvention.
Lück et al. (2020) look into what happened when the patent office began disclosing patent applications after 18 months, instead of waiting until patents were issued (which can take years). By comparing patent applicants from just before and just after the rule change, they find early disclosure decreased the number of subsequent patent applications whose claims infringed on pending patent applications by 5-15%. If we interpret patent infringement as an indicator of independent reinvention, then that suggests some reinvention occurs because inventors didn’t know a similar idea had already been submitted for patent examination. But that only applies 5-15% of the time. The rest of the time, “reinvention”, if that’s what it is, is not simply a case of someone inventing something that they could not have known about by searching the patent record.
All this is to say, something like reinvention at the patent office does happen at least some of the time. I estimate something like up to 8% of patented inventions might be reinvented over the course of a later decade. How I got to that number is a bit tedious, so I stuck it in an appendix to this post, but basically it’s an upper bound estimate on the share of patents that are cited as a reason for blocking a patent application that is later abandoned. I am not quite sure what to make of this 8% number, other than to remark it isn’t very high over the course of a decade, which is consistent with the probability of reinvention being less than 2-3% per year.
But it could easily be an underestimate of the true extent of reinvention, since it won’t include people who could have reinvented a technology and chose not to after successfully learning a similar patent was already on file at the patent office. Or it could be an overestimate of how much reinvention occurs, if it includes people who did not independently reinvent, but instead tried to adapt a patented invention they had learned of, and failed to make their adaptation sufficiently distinctive.
Fortunately, there is another line of evidence we can turn to, again discussed in more detail in Contingency and Science. Sometimes, geopolitics shuts down the normal communication channels that stymie our attempts to infer the long-run probability of multiple people making the same discovery. Contingency and Science specifically looks at papers on two cases: the fracturing of the global scientific community during World War I and during the Cold War. In both cases, we have strong evidence there was a major divergence in the scientific topics under study.
During World War I, for example, we can track the similarity of scientific topics studied by looking to the text of the titles of scientific articles. Similarity of titles steadily and rapidly declines as the war proceeds, indicating science among the different (non-communicating) belligerents quickly began to proceed down different paths. For the Cold War, our main evidence comes from what happens when the iron curtain fell, and previously isolated mathematical communities began to communicate again. Again we see a lot of evidence that the communities had diverged significantly in terms of the topics under study during their many decades of blocked communication.
While it is once again very hard to assess whether the magnitude of the divergences during World War I and the Cold War match what would be predicted from the probability of simultaneous discovery, it at least shows that long-run divergence of some kind is the norm, in the rare cases when isolated scientific ecosystems emerge.
A final issue with extrapolating from the probability of simultaneous discovery is that it treats all ideas the same. This is a problem, since ideas vary so much in their import. If most random papers are not likely to be rediscovered, but the most important papers are, that has different implications. And it would seem quite sensible that more important discoveries have more people looking for them, and so face a higher rate of multiple independent discovery. That might also explain why it’s possible to draw up compelling lists of multiple discoveries; the most famous discoveries are also the ones most likely to have multiple inventors.
In fact, we do have a lot of evidence this is the case. The cleanest evidence we have on this is probably from structural biology. In a complementary paper, Hill and Stein try to estimate just how important the structure of a given protein is and then directly compare this to how many groups are actively working on figuring out the structure of a protein (something they can do, because data in this field is so good). To estimate how important a discovery might be, they fit a statistical model that tries to predict how many citations a paper about a given protein structure will get, based on data about it that would have been available to any scientist prior to beginning work on the protein.4 This includes stuff like “how many other papers have been written on this protein in the past” and also stuff like “is this protein found in humans?”
The figure below shows how the predicted citation value of a protein is related to the number of groups doing research on a protein. There is a pretty strong positive correlation: proteins with the kinds of characteristics that get lots of citations attract more investigation. And in fact, since the vertical axis is in log units, the correlation is actually much stronger than it seems - the highest potential proteins appear to garner way more interest. In the main data they use for their analysis, the protein cluster with the most submissions gets almost 50 times as many submissions as the median protein cluster.
We can also get some confirmatory descriptive evidence that doesn’t depend on doing anything fancy, like trying to predict citations based on protein characteristics. In Hill and Stein (2020), papers on proteins subject to multiple discovery tend to get 26 citations in the next five years, as compared to 17 citations among those discovered by just one scientist (or team of scientists).
Hagstrom (1974) offers some fuzzier evidence that is consistent with the view that high impact work is more likely to be discovered simultaneously by multiple people. In his survey of scientists, he found those who had received more citations were more likely to report having been anticipated at some point in their career.5 In other words, scientists working on topics that went on to be highly cited were also more likely to report being scooped at some point.
Lastly, some evidence from the patent office is also consistent with the notion that more important discoveries are more likely to be independently invented by multiple people. Cotropia and Schwartz (2018) have data on a sample of 1.4mn US patents issued between 1999 and 2007, including whether they were cited as the basis for rejecting a subsequent patent application (filed between 2008 and 2017), because of a lack of novelty. Optimistically, this data can be read as a way of seeing which kinds of patents are most likely to be reinvented later, though I think there are important caveats to this and so I wouldn’t lean on them too heavily. But supposing we set those aside, Cotropia and Schwartz show patents are more likely to have this kind of infringing reinvention occur if the patent is more valuable by various metrics (how much it gets cited, the probability the owner pays the patent’s renewal fees, probability its the subject of litigation, etc.). That’s consistent with the most valuable inventions attracting more inventors, with most of the follower inventors being bounced out by the patent office for failing to be first.
So all together, what does this imply?
Pick a discovery or innovation at random, and I think the probability it has much in the way of built-in redundancy is probably pretty small. I think it is quite plausible that for most papers or patents, if you erased them from history, no one else would independently reproduce the work in the next two decades.
But that’s for a discovery selected at random. If you pick a patent or paper at random, in all likelihood it won’t be a particularly impactful patent or paper. With innovation, a small number of hits appear to have a disproportionate impact on the direction of a discipline or industry. It seems plausible that the most promising ideas attract many times as many potential discoverers as a randomly selected paper. If the annual probability of getting scooped on an important paper is 10% instead of 2-3%, that implies something quite different about long-term redundancy. With a 10% annual probability of discovery, the probability that no one makes the discovery in twenty years drops from over 50% to just 12%.
That, in turn, suggests there is a lot of redundancy in the most important ideas and inventions, but not in the details. The main trunk of humanity’s scientific and technical knowhow is pretty robust, but the positions of the branches and twigs are not.
Where the real fragility lies would seem to be among ideas that are important in retrospect but not in prospect. Those are the ones that don’t attract a lot of attention and so there are not many shots on goal; if a scientists studying the topic gives up, it may be a very long time until someone else makes the discovery. But when those discoveries do happen they can have a big effect.
Bottom line: if we can see an idea is going to be important, there is probably a good chance of multiple independent discovery, which builds in a bit of redundancy. But in all other cases, all bets are off.
New articles and updates to existing articles are typically added to this site every two weeks. To learn what’s new on New Things Under the Sun, subscribe to the newsletter.
Ogburn, William F., and Dorothy Thomas. 1922. Are Inventions Inevitable? A Note on Social Evolution. Political Science Quarterly 37(1): 83-98. https://www.jstor.org/stable/2142320
Haagstrom, Warren O. 1974. Competition in Science. American Sociological Review 39(1): 1-18. https://doi.org/10.2307/2094272
Hill, Ryan, and Carolyn Stein. 2020. Scooped! Estimating Rewards for Priority in Science. Working Paper.
Painter, Deryc T., Frank van der Wouden, Manfred D. Laubichler, and Hyejin Youn. 2020. Quantifying simultaneous innovations in evolutionary medicine. Theory in Biosciences 139: 319-335. https://doi.org/10.1007/s12064-020-00333-3
Bikard, Michaël. 2020. Idea Twins: Simultaneous discoveries as a research tool. Strategic Management Journal 41(8): 1528-1543. https://doi.org/10.1002/smj.3162
Ganguli, Ina, Jeffrey Lin, and Nicholas Reynolds. 2020. The Paper Trail of Knowledge Spillovers: Evidence from Patent Interferences. American Economic Journal: Applied Economics 12(2): 278-302. https://doi.org/10.1257/app.20180017
Lück, Sonja, Benjamin Balmier, Florian Seliger, and Lee Fleming. 2020. Early Disclosure of Invention and Reduced Duplication: An Empirical Test. Management Science 66(6): 2677-2685. https://doi.org/10.1287/mnsc.2019.3521
Hill, Ryan, and Carolyn Stein. 2021. Race to the bottom: competition and quality in science. Working paper.
Cotropia, Christopher Anthony, and David L. Schwartz. 2018. Patents Used in Patent Office Rejections as Indicators of Value. SSRN Working Paper https://dx.doi.org/10.2139/ssrn.3274995
Frakes, Michael D., and Melissa F. Wasserman. 2014. Is the Time Allocated to Review Patent Applications Inducing Examiners to Grant Invalid Patents?: Evidence from Micro-Level Application Data NBER Working Paper 20337. https://doi.org/10.3386/w20337
In order to be granted, a patented invention needs to be novel, non-obvious, and useful. So maybe one way we can get a handle on how common is independent re-discovery is to look at how frequently a patent application gets rejected because it isn’t novel: someone else has already had the idea.
Cotropia and Schwartz (2018) have data on this. For a sample of 1.4mn US patents issued between 1999 and 2007, slightly more than 200,000 patents were cited as the basis for rejecting a subsequent patent application (filed between 2008 and 2017), because of a lack of novelty. In other words, over a future 10-year period, for about 14% of patents, a later inventor submitted something to the patent office close enough to them that it was bounced back for being insufficiently original.
Does that mean 14% of inventions would have been rediscovered in time? Well, not exactly.
For one: patents typically make more than one claim, and it may be that only a subset of claims are not novel. These aren’t necessarily one-for-one rediscoveries; there is just some overlap. And naturally, since the patent examination process is a back-and-forth iterative process, an applicant might be wise to try and make a broad claim initially, which they are prepared to walk back if the patent examiner rejects it. These kinds of broader claims are probably more likely to bump up against pre-existing patents.
Is there a way to separate out cases where a patent application is a close match to an existing patent from cases where there is just a bit of overlap that the patent applicant can get around by amending their application? Well, Frakes and Wasserman (2014) have some complementary data on the fate of 1.4mn patent applications filed after March 2001 and decided one way or another by July 2012. Of this set, 56% were rejected at some point due to lack of novelty, and 32% of patent applications were ultimately not granted at all.
That doesn’t mean 32% of patent applications failed because they lacked novelty. A patent could be rejected for a lot of other reasons too - too obvious, failure to disclose enough, not patentable material, etc. But just to get an order of magnitude, let’s make an unrealistic assumption and suppose that 32% of patent applications really did fail to be granted because they lacked novelty. Given 56% of patent applications were dinged for insufficient novelty at some point, and since 32% out of 56% is 57%, that suggests at best something like 57% of novelty rejections are strong enough to scuttle the whole patent application. Maybe we can consider these cases of genuine reinvention.
Now let’s go back to the Cotropia and Schwarz data that says 14% of patents were subsequently cited as the basis for rejecting a non-novel patent. If, at best, 57% of the time these rejections are severe enough to stop the patent application from moving forward, then that implies at best, something like 8% of patents were reinvented by someone.
So let’s say, at most, something on the order of 8% of patents are reinvented and someone submits something to the patent office close enough to the prior patent that the application is rejected. If we take the number 8% per decade naively, it implies an annual probability of reinvention on the order of 0.8%.
But as noted in the main piece, there are a lot of additional complications with this number and I hesitate to put much weight on it. Maybe this understates the probability of reinvention, because in most cases a quick patent search reveals to would-be inventors that their idea has already been done and so they stop doing work on it. But maybe it overstates the probability of reinvention, because it includes a lot of people who copied existing ideas and then tried to sneak something past the patent office.