The Machine Learning Workflow: Bird's Eye View

Welcome to Part 1 of our Data Science Primer. This bird’s eye view of the machine learning workflow will give you an end-to-end blueprint for data science and applied ML. You’ll learn the “ELI5” intuition behind machine learning, key terminology, and the ingredients to an effective ML model.

You may have already seen some of the tutorials on our site. Tutorials are excellent for getting your feet wet, but to consistently get great results in data science, you must have a systematic approach to solving problems. So that’s what we’ll provide you here!

First, we must clear up one of the biggest misconceptions about machine learning:

Machine Learning ≠ Algorithms

When you open a textbook or a university syllabus, you’ll often be greeted by a grocery list of algorithms. They all have fancy sounding names, and they usually fill up the entire table of contents.

This has fueled the misconception that mastering machine learning is about memorizing dozens of algorithms:

However, that’s not the full picture at all. In practice, applied machine learning is not about the algorithms. It’s much more than that.

Machine learning is a comprehensive approach to solving problems…

…and individual algorithms are only one piece of the puzzle. The rest of the puzzle is how you apply them the right way.

What makes machine learning so special? (ELI5)

Machine learning is the practice of teaching computers how to learn patterns from data, often for making decisions or predictions. For true machine learning, the computer must be able to learn patterns that it’s not explicitly programmed to identify.

ELI5: The curious child

Even though it sounds cool and mysterious, machine learning is simply a reflection of how humans learn naturally. Here’s an example of how we “machine learn” when we’re kids:

Imagine a child playing at home, in the living room. Suddenly, he sees a candle for the first time ever! It piques his curiosity and he cautiously waddles over.

Since he doesn’t know better, he sticks his hand over the candle flame.
“Ouch!” he yells, as he yanks his hand back.
“Hmm… that red and bright thing really hurts!”

Nope, not touching it! — #thanksmachinelearning

Two days later, the child is playing in the kitchen. Suddenly, he sees a stove-top for the first time ever! Again, he cautiously waddles over.

He’s curious again, and he’s thinking about sticking his hand over it.
Suddenly, he notices that it’s red and bright!
“Ahh…” he thinks to himself, “not today!”
He remembers that red and bright means pain, and he ignores the stove top.

To be clear, this is only machine learning because the child learned patterns from the candle. He learned that the pattern of “red and bright means pain.”

On the other hand, if he ignored the stove-top simply because his parents warned him, that’d be “explicit programming” instead of machine learning.

Key Terminology

When starting out in data science, it’s better to focus on developing practical intuition instead of diving into technicalities (which you can revisit later). Therefore, it’s critical to be clear and concise with our terminology.

Before going any further, let’s just make sure we have a shared language for discussing the machine learning workflow:

Model – a set of patterns learned from data.
Algorithm – a specific ML process used to train a model.
Training data – the dataset from which the algorithm learns the model.
Test data – a new dataset for reliably evaluating model performance.
Features – Variables (columns) in the dataset used to train the model.
Target variable – A specific variable you’re trying to predict.
Observations – Data points (rows) in the dataset.

Example: Primary school students

For example, let’s say you have a dataset of 150 primary school students, and you wish to predict their Height based on their Age, Gender, and Weight…

Here’s how you would describe the problem:

You have 150 observations…
1 target variable (Height)…
3 features (Age, Gender, Weight)…
You might then separate your dataset into two subsets:
1. Set of 120 used to train several models (training set)
2. Set of 30 used to pick the best model (test set)

By the way, we’ll explain why separate training and test sets are super important in Model Training.

Machine Learning Tasks

Academic machine learning usually focuses on analyzing individual algorithms. However, in applied machine learning, you should first pick the right machine learning task for the job.

A task is a specific objective for your algorithms.
Algorithms can be swapped in and out, as long as you pick the right task.
In fact, you should always try multiple algorithms because you most likely won’t know which one will perform best for your dataset.

The two most common categories of tasks are supervised learning and unsupervised learning. (There are other tasks as well, but let’s start with the basics.)

Supervised Learning

Supervised learning includes tasks for “labeled” data (i.e. you have a target variable). In practice, it’s often used as an advanced form of predictive modeling.

For supervised learning, each observation must be labeled with a “correct answer.” Only then can you build a predictive model because you must tell the algorithm what’s “correct” while training it (hence, “supervising” it).

Regression is the task for modeling continuous target variables.
Classification is the task for modeling categorical (a.k.a. “class”) target variables.

Logistic Regression (Supervised Learning)

Unsupervised Learning

Unsupervised learning includes tasks for “unlabeled” data (i.e. you do not have a target variable). In practice, it’s often used either as a form of automated data analysis or automated signal extraction.

Unlabeled data has no predetermined “correct answer.” Instead, you’ll allow the algorithm to directly learn patterns from the data (without “supervision”).

Clustering is the most common unsupervised learning task, and it’s for finding groups within your data.

Ingredients to Effective Machine Learning

Even though there are different types of ML tasks (and many different algorithms for each), the key ingredients to success are always the same. To consistently build effective ML models that get great real-world results, you’ll need the following three pillars:

Skilled Chef (Ingredients to Effective ML)

#1: A skilled chef (human guidance)

First, even though we are “teaching computers to learn on their own,” human guidance plays a huge role. Data scientists need to make dozens of decisions along the way.

For example, how much data do you need? Are there any fatal flaws in the data? What’s the right ML task for the job? How do you define success? These are all key decisions you’ll need to make as the human “operator.”

Fresh Ingredients (Ingredients to Effective ML)

#2: Fresh ingredients (clean, relevant data)

The second essential element is the quality of your data. Garbage In = Garbage Out, no matter which algorithms you use. This is something all professional data scientists pick up on very quickly.

That’s why data scientists spend most their time understanding the data, cleaning it, and engineering new features. It’s not the “sexiest” part of the job, but it’s what will ultimately move the needle the most in terms of model performance.

Don't Overcook It (Ingredients to Effective ML)

#3: Don’t overcook it (avoid overfitting)

One of the most dangerous pitfalls in machine learning is overfitting. An overfit model has “memorized” the noise in the training set, instead of learning the true underlying patterns.

An overfit model within a hedge fund can cost millions of dollars in losses. An overfit model within a hospital can costs thousands of lives. For most applications, the stakes won’t be quite that high, but overfitting is still the single largest mistake you must avoid.

In Model Training, we’ll teach you strategies for preventing overfitting by (A) choosing the right algorithms and (B) tuning them correctly. You can also learn more about it by reading about the Bias-Variance Tradeoff.

The Applied Machine Learning Workflow

With all of the fundamentals and terminology out of the way, it’s time to talk about the machine learning workflow. Remember, as data scientists we want a consistent process to getting great results. That’s where the machine learning workflow comes in.

There are five core steps:

Exploratory Analysis – First, “get to know” the data. This step should be quick, efficient, and decisive.
Data Cleaning – Then, clean your data to avoid many common pitfalls. Better data beats fancier algorithms.
Feature Engineering – Next, help your algorithms “focus” on what’s important by creating new features.
Algorithm Selection – Choose the best, most appropriate algorithms without wasting your time.
Model Training – Finally, train your models. This step is pretty formulaic once you’ve done the first four.

Of course, there are other situational steps as well:

Project Scoping – Sometimes you’ll need to roadmap the project and anticipate data needs.
Data Wrangling – You may also need to restructure your dataset into a format that algorithms can handle.
Preprocessing – Transforming your features first can often improve performance further.
Ensembling – You can squeeze out even more performance by combining multiple models.

For beginners, we recommend focusing on the five core steps first. These are the non-negotiable steps to training an effective model using ML. The other ones slot in easily once you understand the core machine learning workflow.

That wraps it up for the Bird’s Eye View of the Machine Learning Workflow. Next, it’s time to learn more about the first core step: Exploratory Analysis!

More About the ML Workflow

Read the rest of our Intro to Data Science here.