Overdiagnosis: why scientists and statisticians think we should test fewer people
A targeted testing strategy is the long-standing best practice for a reason
Welcome back to Limits of Inference! Thanks for being here.
This newsletter now has readers from many different backgrounds. I love it, but it has presented me with a writing challenge. To keep my posts readable, I have been editing out some details. These edits are sorely needed as I am long-winded by nature. But, sometimes, it makes me sad not to dive into all the nitty-gritty details. I cannot help it, still an academic at heart. So, I am trying something new— footnotes. The footnotes are here for those who care deeply about a topic or want more substance. These long footnotes are tangential, so dive in at your own risk. Let me know how this experiment works.
If you enjoy the newsletter, please tell others about it. If not, reach out to me and tell me why.
This year a manufacturing snafu caused test shortages at a critical point early in the pandemic, and the need for widely available testing became a rallying cry. Testing remains an important tool; however, we've taken it too far. We are testing too many people for SARS-COV-2. As a result, we are overestimating the health impact of COVID-19 relative to other respiratory illnesses.
Overdiagnosis is not a new phenomenon. We have seen screening contribute to overdiagnosis with prostate cancer, chronic kidney disease, cervical cancer, breast cancer, ADHD, thyroid cancer, and depression. This list is just the tip of the iceberg. As this is a mistake that modern medicine keeps repeating, it's worth talking about how and why.
Data tells us that something happened, not why. The why is a story a person comes up with to explain data using context and domain expertise. Many people assume a test tells us that a person has COVID-19, but that assumption conflates the data (the test result) with the inference or explanation (the presence or absence of disease) [1]. A positive test result tells us that a person got a positive test result. Yes, this statement is annoyingly tautological, but that is exactly as useless as data is by itself.
When one searches for evidence, it becomes harder to guess the explanation. This problem is fundamental, not unique to public health. Let's develop the intuition using an example outside of biology: the metal detectors used to screen crowds entering airports, courthouses, sports arenas, etc. Metal detectors search for criminals by recognizing metal, a material unlikely to be found on a person's body but commonly found in guns and knives. Most of us have seen or heard a metal detector alarm at least once, and odds are it was a false alarm (nearly all alarms are). This is the expected outcome whenever we screen large numbers of people looking for evidence of rare events. False alarms will dominate, but it will still prove difficult to predict why or when the next one will occur.
Metal detectors detect evidence of metal (data being tautological as always). However, it takes inference from that data to determine whether the person has a weapon or violent intent. In practice, there are many reasons why someone has metal on their person that are neither dangerous nor criminal—from bone screws to pacemakers, from underwire bras to steel-toed shoes, from coins to snap closures on a clutch (yep, that last one happened to me) [2]. Likewise, there are alternative explanations for a positive COVID-19 test other than a viral disease — a contaminated swab, a mislabeled sample, a cross-reaction with a different molecule, high background noise, exposure without infection, residual debris from a previous infection. The more weird edge cases that can explain a false alarm, the more errors will occur.
Any single alternate explanation may seem like a rare circumstance that does not matter to the big picture. That changes with scale. Once we search extensively or test thousands or millions of people, all explanations, even rare ones, are likely to occur. Surgical screws are rare in the population overall, but we will almost certainly find some in a crowd of ten thousand sports fans. It is unlikely for a technician to cross-contaminate a sample. However, we still expect regular, consistent mistakes caused by human error anytime we run millions of tests per day.
There is another way to think about the problem of scale. Testing does not cause disease. The total number of cases does not increase with the number of tests performed. However, every single test has a small chance of returning in error, whether due to a mistake or just the general weirdness of the world. So while the number of true-positives has an upper bound (max value), the potential for false-positives is limitless. The signal is constant, but the noise grows each time we search for evidence of disease.
We can estimate the scale of the false-positives problem using Bayes theorem. The specific tests used for COVID-19 are too new to have reliable real-world performance data, but we can expect them to be excellent — much more precise than metal detectors. I will use the same numbers the FDA provides as guidance to practitioners here. But be aware that even the guidance numbers are hypothetical and represent a best-case scenario [3].
Assume a large COVID-19 outbreak occurs in New York City, and Governor Andrew Cuomo goes into beast mode, testing every person in the city in 24 hours. It's a major operational success, but how much should we trust the data that comes out?
Given a large outbreak, we would expect perhaps 1% of the city's population will be infected at once. In a city of 8.3 million, that is a total of 83,000 people. The hypothetical testing process is nearly perfect, correctly flagging 98% of all infected people as positive (81,500 cases are true-positives, missing 1500 as false-negatives). We tested every person in NYC to find the infected, including approximately 8.2 million uninfected people. Again, the testing procedure is robust, but different types of errors, such as cross-contamination or data entry, can happen, so we assume we classify 98% of these people correctly as well. That is around 8 million true-negatives with 164,000 false-positives. The accuracy of this hypothetical would be phenomenal by biological laboratory testing standards [1].
But look closer at what these results mean for someone who received a positive test result. Eighty-one thousand five hundred results (81,500) are true-positives, but 164,000 results are false-positives. That means 1) 70% of positive test results are false-positives; 2) a person who received a positive test result is more likely to be healthy than infected; 3) anyone analyzing this data would overestimate the actual number of COVID-19 cases by around 3X or 300%.
This result is famous and famously counterintuitive, making it the centerpiece of many Bayesian statistics classes. Since it is counterintuitive, I will reiterate the main points. The test is extremely accurate. Most test results provide the correct answer (over 8 million correct, less than 170,000 wrong). Nearly all the negatives are true-negatives; almost all infected people test positive. The data quality problem shows up in the subset of people who received a positive test (the evidence we were actively searching for). After testing positive, there remains a high probability that a person is not infected. We end up overdiagnosing the disease.
Errors may be unlikely, but a tiny percentage of an enormous number is a big number in its own right. Technically, a person who received a positive test result went from having a 1% chance of having the disease to a ~35% chance after the test. Statisticians may appreciate a probabilistic win, but individuals perceive such a test as wrong most of the time.
There are no statistics, no data tricks, that can fix this after it has occurred. If there were a post hoc fix, this wouldn't be a limit of inference, only a math problem. One statistic used in the example above, the disease prevalence, is unknown in real-world scenarios. Figuring out how many people have the disease was the goal of testing in the first place. Though the thought experiment above is not practical, it's a typical case study used to teach epidemiologists that this is a problem that quickly becomes overwhelming if we screen widely for a disease, even given outstanding laboratory tests. If we try to go out searching for evidence of disease, the resulting data will be garbage.
We can prevent overdiagnosis by testing fewer people. Specifically, we need to test fewer uninfected people as missing infected people would be counterproductive to the goal. Of course, we don't know who is infected or not, so we must be strategic. Scientists achieve this balancing act by figuring out risk factors and using them to identify smaller subsets of the population more likely to be infected, i.e., high-risk groups. Suppose Cuomo had restricted the tests to people who had been in the vicinity of the outbreak (viral outbreaks are highly geographically localized). In this hypothetical, 70% of the healthy people are not tested, but the infected people still are. A person who tests positive then has a 61% chance of being a true-positive. That's a significant improvement in data quality, not to mention the time and money saved.
Targeting is an elegant solution to a fundamental limit of inference. Screening only high-risk groups is not about laziness or lack of resources— testing everyone is a bad idea even if we had all the money and time in the world to devote to a single disease. The rallying cry "everyone can get a test" is good politics but bad public health policy. It is an expert's job to know better (and say so loudly) to counter the public's understandable but fear-driven demands.
Testing less is already the gold standard solution to this problem, and we see evidence of this all over medicine. Once ubiquitous, pap smears and other screening tests for cervical cancer are now recommended at most once every few years due to the resultant overtreatment problems. The PSA test for prostate cancer is now used only sparingly in adults over 50, as its data proved mostly to be noise. Using ECGs to screen for heart disease is not recommended for anyone who is asymptomatic. Mammograms are now targeted to increasingly smaller high-risk groups by age and family history after identifying too many non-cancerous abnormalities caused scares. Again, the tip of the iceberg of what is a well-known problem.
Data is a tool. We choose how we use our tools. When a metal detector alarms, we do not scream and flee, nor do we charge the person for attempted murder in a court of law; instead, we look for other evidence. Multiple data sources could also inform a diagnosis. In fact, this is how doctors typically diagnose disease. Patients come into the office with symptoms, and the symptoms determine which tests are ordered. Laboratory data is combined with clinical data before making a diagnosis, reducing the amount of error.
Some physicians are, no doubt, still operating in this usual way in their practice. But at least in NYC, the official guidance blasted through commercials and street signage is to do the exact thing that we know from both theory and experience causes overdiagnosis, test without reason or evidence. Worse, the tests are free —as long as the person brings up no other health concerns or symptoms at the appointment. Otherwise, it becomes a medical appointment billed to insurance. This situation is perverse, incentivizing patients not to discuss the exact context that would lead to better medicine.
It is under these unprecedented circumstances that the CDC decided that COVID-19 surveillance cases can be confirmed and reported based on laboratory evidence alone. This standard is a break from best practices that, to be honest, I struggle to understand. Yes, there is an operational need to move fast. Standardization is important for comparison. Using one data set means there are fewer phone calls involved, less code, etc. But using laboratory evidence alone when actively screening for a disease means the data becomes effectively noise. Is that really worth it?
One could argue it is, and some do. One reason we screen for this disease is to prevent new infections. Scientists believe transmission can occur without symptoms. Since we cannot wait for symptoms to confirm a case and still achieve the goal, the situation seems to require overdiagnosis as a means to an end [3]. I have argued that this is a flawed goal. However, if we choose to pursue it, we must also accept the consequences.
Scientists and analysts should not use the COVID-19 testing data to do other research or support any scientific conclusions about the severity or prevalence of this disease. Unfortunately and unsurprisingly, many are. But, this data does not accurately estimate the total health impact or spread of COVID-19. Any inference drawn from comparing this data set with historical data would exaggerate this pandemic's severity relative to other illnesses and causes of death. One data set was collected through active screening at an unprecedented scale, and the other was not. One data set required only laboratory tests as evidence, and the other included only cases screened by clinical symptoms and confirmed by laboratory tests. As they say, garbage in, garbage out. We cannot fix the data, so we have to choose not to use the tool.
Lastly, I am concerned that this testing data is being reported to the public directly as news without any context or caveats. Attempts to point out flaws are censored. Some scientific leaders (notably Dr. Fauci) have admitted that they believe it acceptable to lie to the public as a means to an end. Perhaps that is also the reasoning here, to mislead the public to get the desired behavior?
In general, as a leadership strategy, I find there are negative consequences to choosing to lead with lies. Accuracy is a better, more ethical approach. A nuanced message is just as effective. If you got a COVID-19 test as a precaution, not due to an inciting event or symptoms, you can trust a negative more than a positive—act and reason accordingly.
Testing too much can and does make a disease appear more common than it is. The root cause is not biological but epistemological—any data set collected via search will have the same problem. When we screen millions of credit card transactions looking for fraud or billions of network packets looking for evidence of hackers, the same false-positive problem shows up to the tune of billions of dollars of lost revenue and customers. Perhaps it wouldn't feel so jarring to communicate this point to the public if we spoke openly about data's fallibility on these other topics. We should never start with the assumption that any data set is an objective representation of the world. Data can be wrong. No one has a monopoly on truth.