My statistics obsession was initially fueled by fear.
I was two years into a graduate research project examining the differences between neuroplasticity in adult and developing brains. Things were going poorly.
Coming into graduate school with a fellowship, I had the rare freedom to come up with own my graduate research project. Seeking easy wins, I choose to do something crazy. I started by trying to reproduce well-known past experimental results.
I designed conceptual replication experiments. Conceptual replications are experiments designed to test the same hypothesis of a previous study without using the same methods. Some scientists consider this type of experiment inferior to exact replications. Still, they are practical. Wet lab research is expensive. Labs do not always have the same equipment or reagents needed to do an exact replication. Plus, experimental methodologies improve with time.
I also happen to like conceptual replications for their scientific value. A scientific finding is not particularly useful if it is only true in a very narrow experimental context. If an effect is also observable from other measurement modalities or in slightly different experimental contexts, it suggests the potential for a larger impact.
So two years in, I had performed dozens of replication experiments. More than half of the time, I did not observe the same result as the scientists who had come before me. This was a trend. It was starting to shake me.
During my undergraduate studies, I worked in three labs doing independent research. In two out of three positions, I started my research by attempting to reproduce the experiments of a Ph.D. student who had graduated. Both times, I failed to replicate the previous graduate student's published results.
I shrugged off these early experiences. I was inexperienced. Graduate school was my chance to do things right. I had set out to do better research, to be a better scientist. Reproducing results, some of which were in textbooks, seemed like it should be easy. It was not working. I was failing again. I sensed my scientific career was at risk.
Statistics became a critical part of my search for an explanation. At first, it was a troubleshooting endeavor. I read dozens of statistics textbooks. I spent my weekends running Monte Carlo simulations to develop strong probabilistic intuitions. I re-implemented hypothesis tests and Bayesian algorithms in Matlab—to prove to myself I could. I ended up reading a ton of philosophy (epistemology). Yes, it was the start of something great.
No data. No problem.
I have my Ph.D.
I am not a scientist anymore.
This period of research changed the trajectory of my entire life. It led me to invent my signature approach to data science and statistical modeling.
I analyze fake data.
Yes, completely fake, as in randomly generated. I generate and use fake data to build models of real-world phenomena. It works wonderfully. How did I arrive at this strange methodology?
Back to graduate school ...
Given my replication experiments' limits, it was always possible that methodological differences explained the reproducibility problems. Which methodological difference was creating the problem? I started looking for a way to compare my process with the original research, step by step. I needed intermediate results to discern, where my results first diverged from the published study. All I had were the final graphs.
This was before the open science movement. There was minimal data archiving. Data sharing was not part of the culture.
So, I reverse-engineered the studies. Working backward from the published graphs, I tried to figure out what the raw data must have been. I used theory, plus experimental data I collected, to make informed estimates about noise sources and distributions. Eventually, I worked all the way back from published figures to raw data. It was fake raw data. This new data was consistent with the effects I was able to replicate. Then, I added the missing signal. When analyzed exactly as described in the Methods section of the original papers, I could reproduce the figures in the final published articles: every graph, every data point, every error bar, most representative examples. I had mastered the art of simulation.
Exceptional claims require exceptional evidence.
Carl Sagan.
Those of you who have followed the reproducibility crisis in science will not be surprised to learn that this process exposed evidentiary problems, signs of p-hacking, and other inferential mistakes. Addressing the reproducibility crisis in biological sciences continues to be a passion of mine. This is not the topic for today.
I want to talk about fake data.
Fake data is an amazing tool for developing robust, scalable, and highly-accurate algorithms and analytic strategies for business use cases.
With fake but realistic data, I could test the sensitivity of different analytic approaches. I could figure out how much noise or systemic error it took to flip the conclusion. I could identify the root causes when the results deviated from expectations. In short, I could build more robust analytics.
By controlling both the quality of the data and how it was analyzed, I controlled the output consistency. In other words, I could engineer the system.
AI products should be the result of information engineering, not data science.
The single most important thing to learn from statistics or epistemology is that data is not knowledge. Data is not objective. Data is only as useful as the business problem or scientific question to which it is applied.
In science, we do not get to control both the data and analytics. The data is as good as our methods and experiments allow. The result is a slow, painful process of iterative discovery and learning from falsification.
In business, we control both the data and analytics. We choose what data we collect and how accurately it is collected. We have the option to enforce quality standards on data. We also build the analytic approach. We get to control the quality of the outcome. Data scientists should use this amazing power more.
In short, I have learned that data science is bad for business. Product development processes should not be scientific—science self-corrects over generations. Products should be engineered. Engineering is the application of scientific theories to achieve specific outcomes. Engineering does not need to create new knowledge.
Data is not magic. I suspect information engineering will be a big part of the future of data science. There are tricks to it. I want to write about it.