Florents Tselai Think, Code, Read, Sleep, Repeat

Why I don’t Analyze COVID-19 Data

14 Nov 2020

Someone asked me the other day on a COVID-related post on Facebook: “Hey Florents, are you interested in doing some analysis with Python?”. He asked because he knew I work with data and have occasionally published data analyses and comments on topics that are in the news (examples: Tension, polarization, and new Twitter accounts during the Greek Fires and Machine Learning-Based Personality Analysis of a Failed Finance Minister).

I responded with a short reply along the lines of “it’s irresponsible and not right of me to do such an analysis”. It’s the same response I’ve given to a few outlets that have asked for similar publishable analyses - comments. I thought I’d add some more context and details of my thinking.

It’s complex. So complex in fact that after decades of immunology research, epidemiology, and general medicine we’ve fallen back to good old mom’s advice type of “don’t breathe on other people”, “cover your mouth when coughing or sneezing”, “don’t invade some else’s personal space”. One could argue: All real-life and solution-worthy problems are inherently complex. That is why we must try to solve them. True, but not all problems are equally open to abstractions and modeling assumptions. The common element all models share is the plain fact they’re all wrong. We can’ afford this fact when our analyses are read by desperate laymen too.

Out of respect for the dead, the sick, and their families. In desperate situations, people naturally tend to seek simple and actionable answers. Their first inclination is to actively seek simple answers. Their second inclination is to point to them and blame someone (usually authorities) for not implementing them. It’s not easy to come to conclusions and draw associations. One can easily draw a diagram indicating a correlation between the number of tests and mortality. What will their families think? That based on my analysis, more tests would have saved their loved ones. I could go on adding asterisks and footnotes discussing fallacies, causality, p-values, and confidence intervals, but no one would listen. We try to do this in presentations with college-educated C-level executives even, and we lose them in the process; let alone average high school-educated citizens.

Ι don’t have enough data. We may think we do, but we really don’t. Neither historical nor current. Only a handful of countries have the philosophy and processes of having truly open data. Let me give you some examples of data points not available in Greece. Greece because it’s the country I can speak of, but most importantly because it was widely regarded as a "success story" during the first wave of the pandemic.

In Greece we don’t have access to numbers fundamental for serious analysis; for example daily ICU admissions, ICU availability rate per region, daily deaths per region, number of tests and positivity rate per region, deaths in ICU and not in ICUs, how many new cases are hospitalized and how many are recovering at home, R0 per region. We don’t even know the number of tests per region!

Now think: Would you commission or undertake a data science project, without access to these variables? I definitely wouldn’t. Still, I see a lot of experts (and wannabe experts) try and extrapolate stories and prose based on a couple of overly-simplistic ratios. Journalists are unfortunately too susceptible to this. For instance, in some rapid tests conducted in Thessaloniki, 30% tested positive. his was reported - in mainstream outlets - as “1 in 3 living in Thessaloniki test positive for the coronavirus”. I tried explaining the fallacy to a couple of journalists I knew, but the cat was already out of the bag.

I don’t have enough skin in the game. Words are cheap, bytes, and pixels even cheaper. I am not a policymaker. I cannot light - heartily come to conclusions or even hint at solutions. It is too easy for someone to extrapolate conclusions from analysis even if the original author refrains to do so. Accountability is a critical factor and indicator of professionalism. It frightens me to see “experts” pitching “free as in pro bono, untested, not openly documented and not adhering to standards, AI-driven approaches” to governments and getting away with it. Some governments discard them, while some others jump on the PR-wagon. I don’t think I’m the only one disgusted by this AI-driven cynicism when we haven’t yet satisfyingly answered the “why don't we have enough respirators?” question. This is not the time for “move fast and break things” approaches.

I don’t have enough domain knowledge. Every professional data scientist has had an embarrassing moment where their presentation is thrown off course due to a silly remark or assumption they made, due to a lack of domain knowledge. The coronavirus pandemic is not one of those domains we can jump on a project and learn on-the-fly. Clinical statistics and epidemiology have been evolving - theoretically - for decades. It is simply impossible to acquire a critical mass of relevant domain knowledge to conduct meaningful analysis.

Trust is at stake. Trust in the methodology of science, trust in medicine, statistics, technology has already been jeopardized. We, as a tech community, in particular, lost the public trust in privacy once and it’s already taken some year to restore. This time we don’t have room for yet another “move fast and break things” approach. Trust is too easy to lose and too difficult to regain. It'll be painful to see another AI winter triggered by COVID-19.

Formulating a linear combination of the above aspects though is a strictly personal process and one can weigh them otherwise. If you think you have some interesting analysis to share, feel free to do so.

Tags #data #covid #pandemic