The Data Detective

The Data Detective

The Data Detective by Tim Harford attempts to arm the reader with a set of evaluation criteria for any piece of data you might hear. Instead of dismissing or accepting it outright, the goal is to get you to ask yourself: "what does this mean?"

When talking about his motivation, Harford starts off by sharing an anecdote about How to Lie with Statistics by Daryl Huff. Huff was a journalist who was deeply (and in many cases correctly) suspicious of official statistics and facts and figures. He wrote the book as a masterclass in how to mislead people with facts and figures.

There's a lot of good lessons in Huff's book (e.g. correlation vs causation, hand-picked samples, etc), but it tends to focus on all the ways statistics can mislead and lie. Instead of evaluating data and statistics, Huff tended to mistrust most official figures. Huff then applied that same type of reasoning to lobby for the tobacco industry. He ended his career testifying that smoking does not cause cancer!

Harford says there's a place and a time for Huff's thinking, but also times when it's important to trust statistics. The book lays out ten rules for evaluating data.

Rule 1: Search your Feelings

Before responding to anything, understand first "how it makes you feel". We're generally pre-disposed to certain facts based upon what they say about us as people.

If we don't want something to be true, we will ask ourselves "must I believe this?" and look for faults. If we want something to be true, we will ask ourselves "can I believe this?" and reasons why it might be true. (e.g. Motivated reasoning). Julia Galef also touches on this in The Scout Mindset.

I find this question to be useful in almost any context (whether it's taking feedback, encountering new data, or getting in an argument with someone else).

Rule 2: Examine your everyday experience

The vignette here is about numbers released by the London tube, they indicate that the average train is mostly empty. It's not carrying passengers at all!

Obviously this seems false if you commute everyday. From the perspective of passengers, these trains are almost always full. But from the perspective of the train (or the system at large), most of them are running either at non-peak times, or in reverse.

There's an open question here though... when do you discount your experience vs choose to lean into it? I didn't find this question to be adequately answered, but Harford does mention that it's worth thinking about base rates (even though that's also a hard thing to do).

In the case of a smoker getting lung cancer, it occurs infrequently enough that it's hard to trust our own experiences. If an event really is random and rare, maybe we shouldn't trust our everyday experiences!

Rule 3: Ask what is being measured

Oftentimes, it's too easy to jump into correlations between figures without deeply understanding the thing being measured! This is a sin I find myself guilty of all the time.

An example of this is with infant mortality. If an infant is born at 24 weeks, is it a miscarriage, or an infant death? The US tends to report this figure as 'the baby is alive and they have died'. Other countries do not. This has big ramifications on the policy side. The US official figures tend to count the baby as "alive", which then complicates issues like the right to abortion. Nobody agrees on the data.

I've seen hundreds of pitch decks which conflate various measures and seek to represent the 'best' data. My personal opinion is that the best measures should be 1) explainable in a single sentence 2) time-bound 3) require the minimum number of qualifications + exceptions.

Rule 4: Step back and enjoy the view

The cadence of news rewards a certain level of story. E.g. the minute-by-minute ticker of Bloomberg rewards faster twitch than the daily news which is faster twitch still than the economist which comes out weekly. The more you zoom out, the less random noise will figure into your settings and viewpoints.

There's an interesting thought experiment here. What would a monthly or a yearly newsletter look like? Enlightenment Now argues that across nearly every dimension (infant mortality, poverty, starvation), quality of life has improved dramatically over the last 100 years. Focusing on those trends paints a very different picture of humanity, one that draws a lot of optimism that the future will be better than the present.

That said, I was recently talking with a friend who astutely noted that "news happens a lot faster now". When you think about the big stories of just the last 18 months: COVID, Russia-Ukraine conflict, Capitol Insurrection, weekly school shootings, Roe vs Wade, etc, I think he's absolutely right. Any one of these events would have been the "story of the year" ten years ago. And now they happen on a routine basis.

I'm not quite sure what to make of this. Perhaps we are going through a much more turbulent time. Perhaps the proliferation of news + social media causes events that were once missed to now rise to the surface.

Rule 5: Get the backstory

We tend to over-report on stuff that is 'interesting' and never bother to publish corrections saying "X paper was wrong". Thus, we end up with a huge replication crisis (why would you bother to do somebody else's work!?)

Survivorship bias is huge. There's the famous picture of a plane with bulletholes, but this is really everywhere. It's in research papers, news stories about kickstarters, tales of famous entrepreneurs, etc.

There's apparently a journal of medicine, the Cochrane Library, which publishes all randomized control trials in plain english. I was surprised to learn such a thing existed. I'd love to have this sort of thing elsewhere. Apparently there's also a similar DB for political policy + views, though I neglected to write down the name.

Rule 6: Ask who is missing

The Milgram Experiments famously showed that people are more likely to defer to authority figures ("doctors") when asked to apply larger and larger electric shocks to a subject.

Most of the experiments around conformity/deference to authority were only tested with male college students. Obviously these leave a lot out! Subsequent research has had trouble replicating these studies.

Another example: the famous election between Roosevelt and Landon had two competing polls. One was from Gallup who used statistical methods, and the other was from a far-ranging poll company. The far-ranging company was able to collect 10x the data, but they didn't account for sampling bias (the people more likely to respond to the poll). Gallup was far more accurate.

For any piece of data, ask about who is most likely to respond.

Obviously this has big implications for AI as well. If you only train on young white males, your algorithms will tend to not identify people of color or women!

In some cases, there are important differences related to sex. Apparently treatments + outcomes for COVID vary by sex, but most reporting doesn't collect these important statistics.

Rule 7: Get transparency on how the model was created

This chapter basically delves into the issue of AI explainability.

Google Flu Trends was a hugely impressive project... until it failed to predict trends. People weren't sure why exactly, but suddenly the causation + correlation started not matching.

There's also the story of Target predicting pregnancies before the parents know about them.

If you can't understand where a model came from, you're probably going to have a bad time. It's going to be hard to reason about.

Rule 8: Don't Take Statistical Bedrock for Granted

This chapter was pretty eye opening. It made me realize how many stats we site actually come from the federal government (unemployment, interest rates, census, demographics, etc).

In order to be credible, a government can't fudge these numbers! Nobody pays attention to the official Argentinian inflation rate because it doesn't line up with lived experience.

It's also difficult to find a 'higher-ROI' outlet than measuring these various statistics. If a 1B worth of spend... it seems pretty worthwhile to do!

Rule 9: Misinformation can be beautiful too

I forgot most of this chapter before I wrote it up!

Rule 10: Keep an open mind

Above all, be curious. There's a wonderful line here... that really resonated with me.

People have the ability to learn anything... you just have to get them interested first.