Friday, April 29, 2022

Baselines, Biases and Big Data Problems - The Secrets of Statistics and the Magic of Metrology



The paradox of big data spoils vaccination surveys
Dec 2021, phys.org

Image credit: Labyrinth by Liqen, Miami 2011

So guess what -- big data is really good at minimizing sample size errors, but also at magnifying systematic biases such as nonresponse bias like how vaccinated people are more likely to respond and marginalized groups less so -- in other words, bad data. Yet because the sample size is so large, we are tricked even more into thinking it must be good data. "Biases in the data get worse with bigger sample size." -phys.org

Here's an example: "Two in 10 respondents did not have a college degree, compared with four in 10 of all U.S. adults—and race and ethnicity—the fraction of black and Asian respondents was only half of what it is in the general population."

"Worse than no survey at all" they say.

Thanks Facebook, but you can keep your invasive mass surveillance system of the entire American population to yourself. 

via Harvard: Seth Flaxman, Unrepresentative big surveys significantly overestimate US vaccine uptake, Nature (2021). DOI: 10.1038/s41586-021-04198-4


On research methods and data quality:

Meng said he began thinking about the problems posed by big data during a visit to Harvard a decade ago by a U.S. Census Bureau official. The official met with a group of statisticians and asked them about the handling of data sets that were becoming available covering large percentages of the U.S. population. Using the hypothetical example of tax data collected by the IRS, he asked whether the statisticians would prefer a sample covering 5 percent of the population that they knew was representative of the larger population or IRS data that they weren't sure was representative but covered 80 percent of the population. The statisticians chose the 5 percent. "What if it was 90 percent?" the Census Bureau official asked. The statisticians still chose the 5 percent, because if they understood the data, their answer would likely be more accurate than even a much larger set with unknown biases.

And now for the hard stuff:
Completely unrelated image; I'm just collecting pictures from science articles of people holding vials in their fingers: Wastewater Filtration, Pacific Northwest National Laboratory and Andrea Starr, 2022

Drinking alcohol to stay healthy? That might not work, says new study
Nov 2021, phys.org

This is such a great example of how statistics works (and how it doesn't). The key word is "baseline".

We've been told for years, forever?, that one glass of red wine a day is not only not-bad, it's actually good for you. In health-speak, they call that "protective".

But this study shows us that there is no group of people that we can use as a baseline, who ---doesn't--- drink and yet who also ---doesn't--- have other health issues caused by having a history of substance abuse. In other words, when the only people who don't drink are people who are recovering from being addicted to alcohol (gross exaggeration), it means you can't find a baseline for a "normal" person. And if you can't find a baseline, then you can't measure anything. 

The majority of the alcohol abstainers at baseline were former alcohol consumers and had risk factors that increased the likelihood of early death. Former alcohol use disorders, risky alcohol drinking, ever having smoked tobacco daily, and fair to poor health were associated with early death among alcohol abstainers. Those without an obvious history of these risk factors had a life expectancy similar to that of low to moderate alcohol consumers. The findings speak against recommendations to drink alcohol for health reasons.

I tell my friends this story, and they ask the first question, a good question -- what about people like Seventh Day Adventists, or Muslims, they don't drink, why can't we use them? And the answer is the reason why health science is hard.

The reason you can't use groups of people who don't drink, is because that group would likely not represent the much larger group of, let's say, all the people in America. It doesn't even matter if it's a small group; making it bigger won't help. It doesn't work because it doesn't match. You can't compare the two groups because the people aren't the same. 

Another way we see this, and one which is becoming more evident to those who can fix it, is how certain groups of people (like undocumented immigrants) are under-represented in the data. If the majority of the datapoints are White, Christian and middle class, and you're none of those, then it's possible that the data is not relevant to you. Your baseline isn't represented, so whatever health effects you're trying to measure, they aren't being compared to someone like you. 

via Public Library of Science: John U, Rumpf H-J, Hanke M, Meyer C (2021) Alcohol abstinence and mortality in a general population sample of adults in Germany: A cohort study. PLoS Med 18(11): e1003819. doi.org/10.1371/journal.pmed.1003819

Partially Related:
This is also a correlate to the story of how lead was discovered in the air -- a geochemist was trying to do such sensitive work (to measure the age of the Earth!) that he kept picking up extra lead in his results, and could not figure out where it was coming from. But it turns out that it was in the air, having been vaporized in the internal combustion engines in our cars. And that story implies that until then, all other experiments being done were "wrong" because they didn't exclude the excess lead from the otherwise "normal" background. 

Last One -- Palmar Sweating:
During a nuclear war scare (1950's), all experiments into palmar sweating at a research institute had to be abandoned because the base level of the response had become so abnormal that the tests would have been meaningless.  (p188)
-The Naked Ape, Desmond Morris, 1967

No comments:

Post a Comment