Viewing entries in
Data Science

Building a churn prediction model with machine learning

July 25, 2020

Story of leveraging machine learning to predict—and reduce—churn (SaaS terminology for when a customer leaves). Our approach pitted two ML models against each other: XGBoost vs. Random Forest. The latter emerged victorious, and the model output (csv file from a Python notebook in Mode) was integrated both technically (Salesforce) and operationally (via weekly Red Account meeting) for the Clearbit Customer Success team.

Demystifying data science

August 6, 2017

A primer on data science

Described as the "sexiest job of the 21st century" by Harvard Business Review, data science has penetrated the annals of business history in the last decade. Leaders today would be wise to harness this powerful capability, and consider the recent guidance from McKinsey & Company's 2017 quarterly report:

"While CEOs and other members of the executive team don’t need to be the experts on data science, they must at least become conversant with a jungle of new jargon and buzzwords (Hadoop, genetic algorithms, in-memory analytics, deep learning, and the like) and understand at a high level the limits of the various kinds of algorithmic models. In addition to constant communication from the top that analytics is a priority and public celebration of successes, small signals such as recommending and showing up for learning opportunities also resonate."

Click image to download the McKinsey Quarterly report (PDF)

What does data science mean

Data science is defined as:

An interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

Additionally, data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data.

Why data science matters

The complexity of data science lends itself to rabbit holes, corner cases, and the risk of getting mired in a blackhole of minutiae. So let's first agree on why data science actually matters: IT HELPS BUSINESSES MAKE BETTER DECISIONS. That's about it.

Having personally endured dozens of data science briefings, the most valuable outcome is unequivocally a better informed decision.

Technology has spawned a galaxy of information that's so big Google requires 4-5 million servers just to keep up. Today's executive have, at their finger tips, more data than any team of humans can analyze. Therefore, we must employ new tools and tactics to make sense of the information.

Data science helps you connect the dots, allowing you to see patterns. A regression model might reveal a constellation in the night sky you've never seen before. A classification model might yield an insight as important as the north star. Valuable knowledge awaits.

Knowledge = Power?

Contrary to popular thought, however, knowledge by itself is NOT power. It is simply potential power. Knowledge + ACTION = power. And it is here where so many brilliant data science endeavors fail to launch.

To address this conundrum, it is recommended your data science/analytics folks form a working relationship with your business operations team. This handshake is crucial because each data science output, e.g. churn predication score, etc. must be accompanied by an operational rollout plan with an expectation of accountability from management.

Resources to share

If you are considering starting—or furthering—your data science efforts, below are a few resources & tips you may wish to share with your team:

Most teams today are building their data science models in Python
The SciKit-Learn website has free, open source tools for efficient data mining
Pandas is arguably the best python library for data framing (here is a tutorial)
You can use ROC curves to evaluate your data science output, e.g. applied to renewal rate
Want to simply explore the many uses of data science? Check out: The Information Diet
For the more advanced readers, you can delve into Google's machine intelligence library: TensorFlow

The DBT Ventures team hopes this information will help your team make sense of the night sky that is today's business canvas.

3 ways to improve renewal rates

December 7, 2016

Whether we're talking about churn, retention or renewal rates, the fundamental aim is the same: keep the revenue you already have.

The gravity of this challenge can not be overstated: if you aren't able to improve your renewal rates to a sustainable level, your business will bleed out and die.

Okay, maybe that was a bit dramatic but you get the point: the stakes are high. Failing to improve renewal rates can result it:

Decelerating growth
Poor unit economics
A battalion of former customers that never found value in your product and talk about it
Low employee morale
Severe fundraising problems
Bankruptcy

Over the last 10 years, the proliferation of software solutions, business models, and pricing strategies has resulted in a plethora of new revenue retention metrics, so let's first align on our terminology. I am defining renewal rate as:

Renewal Rate = 100% - Lost ACV/(Lost ACV + Total Renewed ACV)

[ACV: Annual Contract Value]

Let's do an example where we have $5M ACV up for renewal in Q1-2017 and your team is able to successfully renew $4.6M of it, but alas $400k is lost to the churn monster:

Renewal Rate = 100% - Lost ACV/(Lost ACV + Total Renewed ACV)

Renewal Rate = 100% - $400k/($400k + $4.6M)

Renewal Rate = 100% - $400k/$5M

Renewal Rate = 100% - 8%

Renewal Rate = 92%

Easy enough, right? Good. Related to this, if you're interested in seeing all the different ways that SaaS companies report these metrics to the Street, I highly recommend bookmarking Pacific Crest's report: Public SaaS Company Disclosure Metrics for Retention and Renewal Rates.

It's become fashionable to report on "net" revenue metrics. "Net" metrics are created by simply combining two numbers, typically expansion and churn. The risk here is that expansion from existing accounts can mask a churn problem. Or as I like to say, "Nets can cover things up."

Therefore, "net" metrics are out of scope for this article as they increase the complexity of something simple: keeping the revenue you already have.

3 Ways to Improve Renewal Rates

"Screen the team." This strategy has to do with the rigor your company applies to screening customer teams during the sales process. Given the hundreds of software options in the crowded cloud, the real battles aren't being fought over technology, but rather program resources. Similar to organizing an effective sports team, your Sales Engineers and Account Executives must evaluate:
- Does this potential customer have the right players on the field?
- If not, how do we make a business case for additional resources whether that be internal or external agency help?
- Do they have access to developer resources? If so, how many hours per week? How many sprints per release? How many stories per epic? Specificity is key here.
- Who is going to own the day-to-day adoption and evolution of the program to use your software, i.e. who is the program manager?
- Does the potential customer have strong executive sponsorship and a desire to make this work?
"ROC your renewals." If you aren't familiar with ROC curves, now is a good time to start. The goal of this strategy is to perform data science analysis to identify the 1-2 customer attributes and/or behaviors that are highly correlated with retention success, e.g. renewal rates, and then mobilizing your entire company—marketing, sales, customer success, design, engineering, everyone—to prioritize and improve these metrics:
- For Facebook it was 7 friends in 10 days (video, if preferred)
- For Twitter its was follow 30 people
- For Slack it was sending 2,000 messages
- For Zynga {cough} it was 1 day retention (user returns day after signing up for game)
- For Dropbox it was adding 1 file to 1 folder
"Pay the retention piper." Incentives matter. No matter how inspirational your company or product vision is, the behavior of your employees are primarily driven by how they are compensated. If your Account Executives are comp'd 100% on growth, they will spend 100% of their time closing new business (and 0% on retention). If your Customer Success Managers are comp'd 100% on usage metrics, they will spend the vast majority of their time on improving usage (and little time on lead generation, customer references, etc).

S&M vs. R&D: Where does your company stack up?

November 21, 2016

Perhaps the biggest decisions C-level executives make is resource allocation:

Should I double down on hiring more Account Executives?
Should we invest in hiring more top-tier PMs to guide product development?
What are the cost implications of migrating our backend to H-Base/AWS?
Does sales really need $350k for SPIFFs next quarter?

Therefore, let's create some guide wires with the data that's publicly available to understand how Sales & Marketing costs typically compare to Research & Development costs:

Notice the spike in the S&M-R&D ratio in year two: this is usually the result of the founding pair hiring some sales muscle once they've established product fit, sometimes prematurely. You can use this chart to benchmark your levels of relative investment over time.

But publicly-available 10-Ks and S-1s have a misleading quality to them: they bundle very big things into bigger things. Therefore, I've injected six line items in blue to help you isolate the following costs relative to median revenue:

S&M Investment
Sales-Only Investment
R&D Investment

Based on the data, marketing is typically 19-22% of revenue, so I've unbundled the above numbers to show just the 'S' in S&M. The critical achievement here really starts in year 3-4 when the successful companies start to focus on driving sales efficiency per rep thereby eventually reducing Sales:Revenue from ~50% to 30-35% in year 7-8.

Superforecasting to the rescue (again)

November 23, 2015

Welcome back to forecaster training, inspired by Part V of Edge.org's Master Class in Superforecasting. The unique skill of superforecasting resonates deeply with DBT Ventures due, in part, to the immense impact across the four key components of the DBT endeavor: ideas, data science, customer success, and leadership.

This segment draws heavily from Danny's contingent valuation experiments which, if you haven't perused before, are a hygienic read (1,832 academic citations agree).

The contingent value experiments reveal the similarity between 3 superficially very different things:

Subject's judgement of value, i.e. scope sensitivity
Likelihood of an event happening between 2 different time periods
Scenario bias

For example what is more probable: the first scenario, or the second?

. . . while continuing to manifest a vexing problem: people's judgement of explanations and forecasting accuracy are vulnerable to rich narratives, i.e. attribute substitution.

We can also fall prey to assigning too much probability to too many possibilities which violates the axiom of probabilities to begin with.

Yet scenarios CAN be useful when thinking backward in time. The relationship between counterfactuals and hindsight bias (which we discussed previously) is powerful.

Getting people to imagine counterfactual alternatives to reality is a way of counteracting hindsight bias. Hindsight bias is a difficulty people have remembering past dates of ignorance. Counterfactual scenarios can reconnect us to our past states of ignorance. And that can be a useful, humbling exercise. Its good mental hygiene. Its useful for de-biasing.

"One learns from Shakespeare that self-overhearing is the prime function of soliloquy. Hamlet teaches us how to talk to oneself, and not how to talk to others." -Harold Bloom

Get people to listen to themselves think about how they think, i.e. can you build the capacity to listen to yourself talk to yourself. . . and decide if you like what you hear, a fleeting achievement of consciousness to be sure, but relevant to superforecasting nonetheless.

So how can superforecasting improve the world? Well, we could use forecasting skills to improve the quality of high-stakes policy debate. Today's political discourse is NOT motivated by pure accuracy goals. Quite the opposite. And political pundits have a myriad of habits/tactics/issues which actively remove accuracy from the conversation:

Ego defense
Self-promotion
Loyalty to a community of co-believers
Rhetorical obfuscation
Attribute substitution (big one)
Functionalist blurring, and—one of the most pervasive—
Super (qualified) forecasting

So what should we do? Introduce a superforecasting tournament in order to disrupt "stale-status heirarchies" and invite pundits to compete. Boom. Politics solved.

ROC your world

October 11, 2015

The importance of statisticians in SaaS

If you're going to explore data science strategies for your SaaS business, you'd be well-served to learn about "ROC curves".

Why?

Because ROC curves assess the quality of data science output. Think of ROC curves as a report card. They help you visualize the quality of the data science deliverable on your desk.

For example, let's say your data science team (or consultants) builds a model to help your sales team identify which prospects are most likely to buy. We'll call it a "Propensity To Buy" score. And since businesses love lingo, we'll call it a "PTB" score. Acronyms, FTW.

Two models walk into a startup

To step a quick step back: data science models typically fall into two camps: 1) regression: trying to predict a continuous outcome or variable, or 2) classification: trying to predict a binary outcome. Our fictitious PTB score is therefore a . . . you guessed it, a "classification" model. Nicely done. Now we're getting somewhere.

But how do you objectively assess the quality of something very smart people produced by ingesting dozens if not hundreds of variables and training sets? The ROC curve. Boom.

We can thank WWII radar engineers for the lengthy name: Receiver Operating Characteristic. But their intent was much simpler: they needed a way to know how much of the good stuff their model captured (true positive rate/TPR) vs. the amount of bad stuff their model also captured (false positive rate/FPR).

For example:

TPR: Radar imaging model captures a Nazi battalion of Panzer IV tanks = nice work
FPR: Radar imaging model captures a herd of very large French cows = needs work

Same goes for business: how many of your prospects are being correctly classified (TPR) vs. incorrectly classified (FPR). Here's a visual of ROC curves look like in the wild:

We'll get into this topic much deeper in future posts, but for now we just wanted to make sure the DBT readership is aware of the crucial tool for assessing data science output.

Do you have what it take to be a superforecaster?

September 23, 2015

Three top traits of superforecasters include:

They tolerate dissonance
They practice "counterfactualizing"
They embrace (unabashedly) rampant hypothesis generation

Want to learn more?

Jump on over to Edge.org to witness a fantastic synthesis of genius minds challenging each other's thinking, but moreso unpacking their questions.

Edge Master Class 2015: A Short Course in Superforecasting

About Edge.org: To arrive at the edge of the world's knowledge, seek out the most complex and sophisticated minds, put them in a room together, and have them ask each other the questions they are asking themselves.

Blog