Frustrated statisticians and epidemiologists took to social media this week to call out substantial flaws in two widely publicized studies trying to estimate the true spread of COVID-19 in two California counties, Santa Clara and Los Angeles.
The studies suggested that far more people in each of the counties have been infected with the new coronavirus than thought—that is, they estimated that true case counts in the two counties are up to 85 times and 55 times the number of the currently confirmed cases in the counties, respectively. Accordingly, this suggests that COVID-19 is far less deadly than thought. The large case counts in relation to unchanged number of deaths put COVID-19’s fatality rate in the same range as seasonal flu.
How dangerous is this?
We dig into the details of the studies below, but it's important to note that neither of them have been published in a scientific journal, nor have they gone through standard peer-review for scientific vetting. Instead, they have been posted online in draft form (a commonplace occurrence amid a rapidly evolving pandemic that inclines researchers to have fast access to data, however uncertain). The findings seemed to support minority arguments that COVID-19 may be no worse than seasonal flu (a leading cause of death in the US) and that the restrictive mitigation efforts currently strangling the economy may be unnecessary. In fact, three researchers who co-authored the new studies have publicly made those exact arguments. In a controversial opinion piece in the biomedical news outlet STAT, population health researcher John Ioannidis, at Stanford, argued back in mid-March that the mortality rate of COVID-19 may be much lower than expected, potentially making current lockdowns “totally irrational.” Health policy researchers Eran Bendavid and Jay Bhattacharya, also both at Stanford, made a similar argument in The Wall Street Journal at the end of March. They called current COVID-19 fatality estimates—in the range of 2 percent to 4 percent—“deeply flawed.” Ioannidis is a co-author of the study done in Santa Clara county, and Bendavid and Bhattacharya were leading researchers on both of the studies, which appeared online this month. The new studies seem to back up the researchers’ earlier arguments. But a chorus of their peers are far from convinced. In fact, criticism of the two studies has woven a damning tapestry of Twitter threads and blog posts pointing out flaws of the studies—everything from basic math errors to alleged statistical sloppiness and sample bias. In a blog review of the Santa Clara county study, statistician Andrew Gelman of Columbia University detailed several troubling aspects of the statistical analysis. He concluded:I think the authors of the above-linked paper owe us all an apology. We wasted time and effort discussing this paper whose main selling point was some numbers that were essentially the product of a statistical error. I’m serious about the apology. Everyone makes mistakes. I don’t think they[sic] authors need to apologize just because they screwed up. I think they need to apologize because these were avoidable screw-ups.A Twitter account from the lab of Erik van Nimwegen, a computational systems biologist at the University of Basel, responded to the study by tweeting the quip “Loud sobbing reported from under reverend Bayes' grave stone.” The tweet refers to Thomas Bayes, an 18th-century English reverend and statistician who put forth a foundational theorem on probability. Pleuni Pennings, an evolutionary biologist at San Francisco State University, noted in a blog regarding the Santa Clara study that “In research, we like to say that ‘extraordinary claims require extraordinary evidence.’ Here the claim is extraordinary but the evidence isn’t. Also, we learn that even if a study comes from a great university—this is no guarantee that the study is good.” Harvard epidemiologist Marc Lipsitch stated on Twitter that he concurred with similar statistical criticisms made online. He added a “kudos” to the authors for conducting the study and “providing one interpretation of it (which supports their ‘it's overblown’ view).” So what has all of these researchers up in arms?
The aim of the studies
Both studies primarily aimed to estimate how many people in each of two counties had been infected at some point with SARS-CoV-2. This is an extremely important endeavor because it can tell us the true extent of infection, help guide efforts trying to stop transmission, and better assess the full spectrum of the COVID-19 disease severity and the fatality rate. Because diagnostic testing has been so limited in the US and many COVID-19 cases appear to present with mild or even no symptoms, researchers expect the true number of people who have been infected to be much higher than we know based on confirmed cases. There is no debate about that. But just how much higher is the subject of considerable debate. The researchers went about their studies by recruiting small groups of residents and testing their blood for antibodies against SARS-CoV-2. Antibodies are Y-shaped proteins that the immune system makes to target specific molecular foes, such as viruses. If a person has antibodies that recognize SARS-CoV-2 or its components, that suggests the person was previously infected.Santa Clara
In the Santa Clara county study, researchers recruited volunteers using Facebook and had them come to one of three drive-through test sites. They ended up testing the blood of 3,330 adults and children for antibodies. They found 50 blood samples, or 1.5 percent, were positive for SARS-CoV-2 antibodies. They then adjusted their figures to try to estimate what positive tests they would have gotten back if their pool of volunteers better matched the demographics of the county. The volunteer pool skewed toward certain zip codes in the county and were enriched for women and white people relative to the county’s real makeup. The researchers’ adjustment ended up nearly doubling the prevalence of positives, bringing them from 1.5 percent to an estimated 2.8 percent. They then adjusted the data again to account for inaccuracies of the antibody test. There are two metrics for accuracy here: sensitivity and specificity. Sensitivity relates to how good the test is at correctly identifying all true positives. Specificity relates to how good the test is at correctly identifying all the true negatives—in other words, avoiding false positives. According to the authors of the Santa Clara study, the sensitivity and specificity data on their antibody test led them to estimate that the true prevalence of SARS-CoV-2 infections ranged from 2.49 percent and 4.16 percent. Based on the population of the county, that would suggest somewhere between 48,000 and 81,000 people in the county had been infected. The confirmed case count at the time of publication was only 956. That puts their infection estimate 50 to 85 times higher than the confirmed cases. The researchers then estimated an infection fatality rate (IFR) with that large number of estimated infections and an estimate of only 100 cumulative deaths (including from infections at the time. Deaths lag behind initial infections, potentially for weeks). They calculated an IFR of 0.12 percent to 0.2 percent. This falls in the ballpark of seasonal flu, which has an estimated fatality rate of about 0.1 percent.Los Angeles
There is less data available from the Los Angeles study. In an unusual move—even by today’s pandemic standards—the findings were initially announced in a press release from the county’s public health department, which provided little in the way of statistical and methodological details. A short draft of the study (PDF found here) has also circulated online, but it still has less information on the methods than the Santa Clara study. Also, the draft has even higher prevalence estimates than the press release. It’s unclear why the estimates differ, but we’ll mainly focus on the conclusions formally released from the health department. Generally, for the study, researchers used data from a market research firm to randomly select residents and invite them to get tested at one of six testing sites. The researchers set up quotas for participants by age, gender, race, and ethnicity to match the population characteristics of the county. Their goal was to recruit 1,000 participants. They tested 863 adults using the same antibody test used in the Santa Clara study, which was made by Premier Biotech, of Minneapolis, Minnesota. Of the tests given, 35 (or 4.1 percent) were positive. According to the press release, the adjusted data suggested that 2.8 percent to 5.6 percent of the county’s population had been infected with the new coronavirus. Given the county’s population, that suggests that 221,000 to 442,000 adults in the county had been infected. That estimate is 28- to 55-times higher than the 7,994 confirmed case count at the time. As in the Santa Clara study, that puts the IFR in the range of 0.3 percent to 0.13 percent, closer to the IFR of seasonal flu.Problems
Other researchers were quick to flag concerns and flaws about the studies’ methods and statistics. First, there were criticisms and notes of caution about the recruitment strategy in the Santa Clara study. Using volunteers who had been roped in by Facebook ads has the potential to target people who are, well, more likely to use Facebook. Setting up drive-through testing sites may enrich for people who have easy access to cars. Most importantly, by taking self-selected volunteers, there’s the potential you’ll get the people who are most concerned that they have had COVID-19 and want to get tested to know for sure. This could potentially inflate the number of positives in a participant pool, making the disease look more common than it really is. According to an email obtained by Buzzfeed News and reported April 24, the wife of study author Jay Bhattacharya also recruited parents via an email on a high school list serve. This may have further biased the participant pool. The email urged parents to sign up for the study to have “peace of mind” and “know if you are immune.” Bhattacharya wrote in an email to Buzzfeed that his wife's email was "sent out without my permission or my knowledge or the permission of the research team.” The random selection of participants in the LA study, along with the quotas, dodged these criticisms.... and statistics
But the most troubling concerns with the studies, by far, relate to the statistics. Perhaps the biggest concern from critics is that the antibody test the researchers used for both studies is not as accurate as the estimates suggest. Premier’s test—like dozens of others on the market—have not been thoroughly vetted for accuracy. Given the urgency of the pandemic, the Food and Drug Administration has allowed sale of such tests on the market without the usual vetting. In fact, the FDA even flags to healthcare providers to be aware of their limitations. Premier reported testing its test against known positive and negative samples to determine its sensitivity and specificity, and the study authors did their own tests at Stanford. In Premier’s hands, the test correctly identified 25 known positive samples out of a total of 37. In tests at Stanford, the study authors reported correctly identifying 153 known positives out of 160 with the test. Combining the estimates, they figured a sensitivity was most likely about 80 percent (within a range of 72.1 percent and 87 percent possible). When Premier tried 30 samples known to be negative, its antibody test accurately identified all 30 of them as negative. But in Stanford’s labs, the test only correctly identified 369 of 371 truly negative samples tested. The authors of the study concluded the test most likely had a specificity of about 99.5 percent (within a range of 98.3 percent and 99.9 percent possible). The specificity estimate suggests that just 0.5 percent of tests will be false positives, but the range leaves open the possibility that up to 1.7 percent of tests are false positives. This is a big sticking point for critics. In the Santa Clara study, the researchers only found 50 samples out of 3,330 were positive. That’s a 1.5-percent positive rate. Given that the false positive rate may be up to 1.7 percent, it’s possible (if unlikely) that every positive test detected was a false positive. The point is not that critics think that every positive sample the study authors found was actually a false positive. Rather, they note this because it means that assessing the accuracy of the positive sample with precision is impossible. As Gelman notes in his blog:Again, the real point here is not whether zero is or “should be” in the 95% [confidence] interval, but rather that, once the specificity can get in the neighborhood of 98.5% or lower, you can’t use this crude approach to estimate the prevalence; all you can do is bound it from above, which completely destroys the ‘50-85-fold more than the number of confirmed cases’ claim.Going deeper into the math, statistician Will Fithian, of University of California, Berkeley, identified what he described as a “sloppy” math error in the calculations that the researchers used to generate their estimate ranges. The statistics on the LA study are not yet available for review, but researchers have suggested that they may suffer from the same flaws. The authors have reported that they are currently redoing their statistical analysis and will release the results soon.