Data Science: What’s in a Name?
Explore. Hypothesize. Test. Repeat.
That’s what scientists do. We explore the world around us, come up with hypotheses that generalize our observations, and then test those hypotheses through controlled experiments. The positive and negative outcomes of those experiments advance our understanding of reality. The word “scientist” carries a certain cachet, and deservedly so. Scientists have made key discoveries that make our lives better, creating the foundation for advances in technology. Moreover, the scientific method is a harsh taskmaster: it requires that our leaps of faith be falsifiable, and that we determine the truth of our claims through repeatable experiments.
Hence, it’s no surprise that many professions — and even religions — have wrapped themselves in the flag of science. Science — especially “rocket science” — has come to connote anything that requires a high degree of intelligence. Which leads to the trend of “data scientists” in Silicon Valley and beyond. I use scare quotes because the term acts as a Rorschach test: how a person interprets the term often reveals more about person than the profession.
Let’s try the definition from Wikipedia, which aspires to present a neutral point of view:
Data scientists solve complex data problems through employing deep expertise in some scientific discipline. It is generally expected that data scientists are able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects are not required. However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three.
But Drew Conway also points out the challenge with this definition: “the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits.” It’s ironic that a profession devoted to rigorous analysis struggles to converge on a precise definition. Not that there’s a lack of debate. You can get a taste of that debate from a popular Quora post on “How would you define data science and data scientists and distinguish it from older related terms?”.
I’m not going to attempt to resolve the debate here. There’s clearly a need for people who blend math, computer science, software engineering, and product sense — I should know, since I hired a bunch of them to be data scientists at LinkedIn, and they’ve made key contributions to our products. Are they analysts? Engineers? Product visionaries? The answer is yes to all of these, which is what makes these folks so hard to hire!
I am, however, skeptical of the use of any term to create an elite club of experts. We are what we do. And if we’re going to call ourselves scientists, then the most important thing we can do is follow the scientific method in our quest to understand the world around us and advance the state of technology.
- Explore: The reason that data science emphasize technical skills is that those skills are essential for performing exploratory data analysis.
- Hypothesize: The point of exploration is to make surprising observations and generalize those observations to generate hypotheses.
- Test: Testing is what makes this process a science. Testing is how we validate hypotheses by subjecting them to cold, harsh reality.
- Repeat: Science is an endless process. And, like software engineering, data science at its best is agile and iterative.
Explore. Hypothesize. Test. Repeat. If you do all of these, then you’ve earned the right to call it whatever you want.