Debunking the myth of purely ‘data-driven’ decisions

data is a great tool, some expect too much of it

It is impossible to reason from data alone.

We always make assumptions and draw from what we already believe to be true.

This should not be blasphemous. As a data scientist, I do not consider this a particularly negative or pessimistic viewpoint. It’s just a realistic assessment of what data can actually do for us as a tool.

Let’s try to see if we can build quickly build an intuition about why this is true.

Take a moment to examine the shape data above. Ask yourself, what shape will come next? What color will it be? How confident are you in this inference?

. . . 
. . . 
. . .
. . .

Done thinking? I’d love to hear what you thought. Unfortunately, blogs are a one-sided conversation, so you’ll have to settle for my inner monologue.

I see a set of shapes, or rather a set of two shapes, one inscribed in the other. There are broad patterns I could pull out like triangles tend to have inscribed white circles (ignore the exception). I could do summary stats to predict what comes next:

There is a 60% chance the next shape is a triangle because 17 in 28 shapes are triangles. There is a 57% chance the next shape is a triangle with an inscribed circle because 16 in 28 shapes are triangles with inscribed circles. 

Why do I think it’s best to assume things will just continue in the same way? These shapes hardly look natural. They are probably artificial in some way. What do I know in my real life that could have such strange shapes?

These are pieces from the board game we accidentally knocked off the table. We were able to find all the brightly-colored pieces quickly. The white spherical pieces seemed to have rolled farther away. Since it’s all that’s left, there is a 100% chance the next shape we find is a white sphere.

But, wait, we never even observed a white sphere at all. Why would I predict something not before seen or observed? It’s as if I just made that up. Where did that come from? What does it even mean to come next when I am looking at a static image anyway? What if the next shape is not new but evolving, something like …

These are rare tri-polar magnets that change shape and color as they attract each other. The next shape will be an orange triangle with an inscribed blue star—the midway point between the square and triangle magnet evolution. Look, one square magnet has already started to evolve into a triangle, proof that we are on the right track!

All the ideas above explain the data equally well. They are all ‘data-driven’ in that they fit the data perfectly. One even allows complete certainty in our inference. Yet each one points to a very different prediction of what will happen next and why. Which one is the best? How can I tell?

Data does not speak for itself

How can one analyze a data set such that they guarantee a correct conclusion every single time?

This question has plagued epistemologists (philosophers that study the theory of knowledge) and statisticians (mathematicians who study applied data analysis) forever. Those who have proposed a generalizable solution so far have come up short.

This is what is known as the problem of induction. David Hume famously described this problem in 1739: there is no way to justify a priori that learning from past experiences (from data) is valid except to observe that it has worked before. That’s circular logic — we assume we can learn from experience because we have learned from experience.

Incredible minds have picked up the problem of induction while studying scientific thinking. This includes Karl Popper, who proposed that scientific theories could only be falsified, not proven, and Thomas Kuhn, who coined the term ‘paradigm shift,’ a consensus model of knowledge that explains the uneven timeline of advances throughout history. Each of these thinkers deserves separate consideration. This blog is meant to be a quick read, so here’s my best two-sentence summary of the least controversial bits:

Data tells us that something was observed, not why

We need to know the why before we can learn or predict from data.

Humans must come up with the why through creativity.

Infinite stories and yet nothing to do

This leaves us with a new problem, a very different problem than analyzing data.  We need a why to make an inference. As we saw above, there are many different why’s that can explain any data set. I wrote down three whys for the shape data above, but there are actually infinite why’s.  You probably thought of something different than I did. If a limit exists, it is a practical one imposed by restrictions on time, energy, and imagination.

A quick note on semantics: different disciplines have used various words to describe the large set of potential why’s allowed by any data set — ideas, universes, scenarios, myths, models, generalizations, explanations, algorithms, hypotheses, inferences, etc. From now on, I am choosing to use the word story because it reflects both the creativity inherent to their generation and the skepticism that should be applied before their acceptance.

Any data set is explainable by infinite stories. To act or learn, we must choose one story over another. We need a story selection process. This selection process is called inductive reasoning. Reasoning is, literally, the act of creating reasons. Reasoning is choosing to believe one possible story over another. Reasons have nothing to do with data and everything to do with what we already believe (i.e., assumptions).

Four common inductive reasons (to choose one story over another)

#1 Someone else says that one story is true

One could choose to interview the person who collected the data or perhaps an expert in the field to get their opinion. This usually helps us with, at a minimum, figuring out what the data is designed to represent. Is it magnets or board game piecesHow many board game pieces are there? How the data was collected is important too. Was it random or selected for certain qualities? Of course, this person (source of truth) could always be mistaken, confused, tired, biased, or lying. You’d have to assume the person is trustworthy. We all need to start somewhere.

#2 The story is more of the same

By nature of being an observation, all data is about the past. To use data to do anything at all, we have to assume that the past is a good model for the future. The sun has risen each morning; therefore, it will rise again tomorrow morning. This past-is-future assumption is ubiquitous. Hume called it the Principle of Uniformity of Nature. This reason has the added benefit of making the inferences simple — things will continue to be the same… 59% of people polled said they would vote for Candidate A today, so 59% of all people will vote for Candidate A on election day. The problem with this reason is that while many things are constant across time, the world also changes. How can we know what changes and what is constant? We’ll have to assume any specific problem or inference.

#3 The story has proven useful before

This is a very famous reason, in my opinion, the greatest of all time reason or GOAT reason, to prefer one story over another. It is the heart of the modern scientific method. Scientists tell themselves stories (a.k.a. hypotheses, theories) and then try to kill the story by finding experimental data inconsistent with the story (Karl Popper called this falsification). In this view, the best science stories are the ones that have survived the longest, challenged with the most experimentation, without being rejected. 

In my years as an experimental scientist, I found this ‘old rules over new rules’ reason particularly elucidating because it allowed me to prefer the story ‘the data is bad’ over stories that were actually consistent with observed data. Here is an inspired-by-real-life example. Which story should you prefer: new data collected by a first-year student demonstrates that Einstein’s theory of special relativity was wrong all this time, or a first-year student makes a methodological error during an experiment? Sure, it could go either way. But the old-rules reason would lead one to prefer the story about a student making a mistake. I, for one, would be willing to bet on it.

You may have heard this ‘old rules’ reasoning in the context of Bayesian statistics. Bayesians call the ‘best existing story’ a prior. This is a powerful approach but also leads us astray from time to time. No story is ever perfect, plus who gets to say which existing story is best anyway?

#4 The story is the simplest

This reason is better known as Occam’s Razor. We prefer the simplest story because we believe that our world is more likely to be simple than complex. Occam’s Razor has been incorporated into many machine learning algorithms, particularly those used in natural language processing with a high-dimensionality feature space. AI practitioners call it regularization, but that’s basically just renaming Occam’s razor — at its core, it is just math that says simple is better. Is our world exactly as simple as it needs to be, and no more? Or is it sometimes overly complicated? What do you think?

All good reasons, except when they are wrong

These are four of the most common reasons to prefer one story over another. For any given data problem, we will use different reasons to get from infinite stories to one.

None of these reasons originate from the data itself. None of these reasons are universally true or universally helpful. Human judgment provides the reasons, sometimes in the form of existing expertise or context, and identifying relevant concepts or ideas.

In summary, analysis from data, reasoning from data is an inherently pragmatic and human endeavor. Data alone is insufficient to come to any conclusion. There are too many stories we could tell ourselves. This idea gives a whole new meaning to the phrase ‘reasonable people can disagree.’

As a data scientist, this ‘infinite stories’ mental model influences how I think about my role in a business and society. I code my favorite reasons into an algorithm. This is what we call machine learning. But who is to say my reasons are better than others? My reasons will work well for some use cases and poorly for others. It will depend on what is true or what we choose to assume is true. There is no such thing universal best-practice of learning from data.

Human or machine, none of this even begins to resemble is a repeatable or predictable path to truth. Accepting this reality actually helps data scientists do their job, the job of making data useful for a specific purpose, instead of searching for right answers.