A primer on data science

Described as the "sexiest job of the 21st century" by Harvard Business Review, data science has penetrated the annals of business history in the last decade. Leaders today would be wise to harness this powerful capability, and consider the recent guidance from McKinsey & Company's 2017 quarterly report:

"While CEOs and other members of the executive team don’t need to be the experts on data science, they must at least become conversant with a jungle of new jargon and buzzwords (Hadoop, genetic algorithms, in-memory analytics, deep learning, and the like) and understand at a high level the limits of the various kinds of algorithmic models. In addition to constant communication from the top that analytics is a priority and public celebration of successes, small signals such as recommending and showing up for learning opportunities also resonate."

Click image to download the McKinsey Quarterly report (PDF)

What does data science mean

Data science is defined as:

An interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

Additionally, data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data.

Why data science matters

The complexity of data science lends itself to rabbit holes, corner cases, and the risk of getting mired in a blackhole of minutiae. So let's first agree on why data science actually matters: IT HELPS BUSINESSES MAKE BETTER DECISIONS. That's about it.

Having personally endured dozens of data science briefings, the most valuable outcome is unequivocally a better informed decision.

Technology has spawned a galaxy of information that's so big Google requires 4-5 million servers just to keep up. Today's executive have, at their finger tips, more data than any team of humans can analyze. Therefore, we must employ new tools and tactics to make sense of the information.

Data science helps you connect the dots, allowing you to see patterns. A regression model might reveal a constellation in the night sky you've never seen before. A classification model might yield an insight as important as the north star. Valuable knowledge awaits.

Knowledge = Power?

Contrary to popular thought, however, knowledge by itself is NOT power. It is simply potential power. Knowledge + ACTION = power. And it is here where so many brilliant data science endeavors fail to launch.

To address this conundrum, it is recommended your data science/analytics folks form a working relationship with your business operations team. This handshake is crucial because each data science output, e.g. churn predication score, etc. must be accompanied by an operational rollout plan with an expectation of accountability from management.

Resources to share

If you are considering starting—or furthering—your data science efforts, below are a few resources & tips you may wish to share with your team:

Most teams today are building their data science models in Python
The SciKit-Learn website has free, open source tools for efficient data mining
Pandas is arguably the best python library for data framing (here is a tutorial)
You can use ROC curves to evaluate your data science output, e.g. applied to renewal rate
Want to simply explore the many uses of data science? Check out: The Information Diet
For the more advanced readers, you can delve into Google's machine intelligence library: TensorFlow

The DBT Ventures team hopes this information will help your team make sense of the night sky that is today's business canvas.

The importance of statisticians in SaaS

If you're going to explore data science strategies for your SaaS business, you'd be well-served to learn about "ROC curves".

Why?

Because ROC curves assess the quality of data science output. Think of ROC curves as a report card. They help you visualize the quality of the data science deliverable on your desk.

For example, let's say your data science team (or consultants) builds a model to help your sales team identify which prospects are most likely to buy. We'll call it a "Propensity To Buy" score. And since businesses love lingo, we'll call it a "PTB" score. Acronyms, FTW.

Two models walk into a startup

To step a quick step back: data science models typically fall into two camps: 1) regression: trying to predict a continuous outcome or variable, or 2) classification: trying to predict a binary outcome. Our fictitious PTB score is therefore a . . . you guessed it, a "classification" model. Nicely done. Now we're getting somewhere.

But how do you objectively assess the quality of something very smart people produced by ingesting dozens if not hundreds of variables and training sets? The ROC curve. Boom.

We can thank WWII radar engineers for the lengthy name: Receiver Operating Characteristic. But their intent was much simpler: they needed a way to know how much of the good stuff their model captured (true positive rate/TPR) vs. the amount of bad stuff their model also captured (false positive rate/FPR).

For example:

TPR: Radar imaging model captures a Nazi battalion of Panzer IV tanks = nice work
FPR: Radar imaging model captures a herd of very large French cows = needs work

Same goes for business: how many of your prospects are being correctly classified (TPR) vs. incorrectly classified (FPR). Here's a visual of ROC curves look like in the wild:

We'll get into this topic much deeper in future posts, but for now we just wanted to make sure the DBT readership is aware of the crucial tool for assessing data science output.

Blog

Demystifying data science