Bird's Eye View of the Machine Learning Workflow

Bird's Eye View

Chapter 1 of 7 in the Data Science Primer

Welcome to our 7-part mini-course on data science and applied machine learning!

Over these 7 chapters, our goal is to provide you with an end-to-end blueprint for applied machine learning, while keeping this as actionable and succinct as possible.

With that, let's get started with a bird's eye view of the machine learning workflow.

First things first. One really cool (optional) challenge you can do in the next hour is training your first machine learning model!

That's right, we've put together a complete step-by-step tutorial for training a model that can predict wine quality. Feel free to go check it out at any time.

Now, tutorials like that are excellent for getting your feet wet, but if you want to consistently get great results with machine learning, you must develop a reliable, systematic approach to solving problems.

And that's what we'll tackle throughout the rest of this mini-course.

Machine Learning ≠ Algorithms

First, we must clear up one of the biggest misconceptions about machine learning:

Machine learning is not about algorithms.

When you open a textbook or a university syllabus, you'll often be greeted by a grocery list of algorithms.

This has fueled the misconception that machine learning is about mastering dozens of algorithms. However, it's much more than that...

Machine learning is a comprehensive approach to solving problems...

...and individual algorithms are only one piece of the puzzle. The rest of the puzzle is how you apply them the right way.

No to grocery lists

Machine learning is not just a list of algorithms.

What makes machine learning so special?

Machine learning is the practice of teaching computers how to learn patterns from data, often for making decisions or predictions.

For true machine learning, the computer must be able to learn patterns that it's not explicitly programmed to identify.

Example: the curious child

A young child is playing at home... And he sees a candle! He cautiously waddles over.

  1. Out of curiosity, he sticks his hand over the candle flame.
  2. "Ouch!," he yells, as he yanks his hand back.
  3. "Hmm... that red and bright thing really hurts!"
Ooh a candle!

Ooh a candle!

Two days later, he's playing in the kitchen... And he sees a stove-top! Again, he cautiously waddles over.

  1. He's curious again, and he's thinking about sticking his hand over it.
  2. Suddenly, he notices that it's red and bright!
  3. "Ahh..." he thinks to himself, "not today!"
  4. He remembers that red and bright means pain, and he ignores the stove top.

To be clear, it's only machine learning because the child learned patterns from the candle.

  • He learned that the pattern of "red and bright means pain."
  • On the other hand, if he ignored the stove-top simply because his parents warned him, that'd be "explicit programming" instead of machine learning.
Nope to stovetop

#thanksmachinelearning

Key Terminology

For this mini-course, we will focus on developing practical intuition instead of diving into technicalities (which we'll save for Chapter 7: Next Steps).

Therefore, it's even more important to be clear and concise with our terminology.

Before going any further, let's just make sure we have a shared language for discussing these topics:

  • Model - a set of patterns learned from data.
  • Algorithm - a specific ML process used to train a model.
  • Training data - the dataset from which the algorithm learns the model.
  • Test data - a new dataset for reliably evaluating model performance.
  • Features - Variables (columns) in the dataset used to train the model.
  • Target variable - A specific variable you're trying to predict.
  • Observations - Data points (rows) in the dataset.

Example: Primary school students

Primary School Example Terminology

For example, let's say you have a dataset of 150 primary school students, and you wish to predict their Height based on their Age, Gender, and Weight...

  • You have 150 observations...
  • 1 target variable (Height)...
  • 3 features (Age, Gender, Weight)...
  • You might then separate your dataset into two subsets:
    1. Set of 120 used to train several models (training set)
    2. Set of 30 used to pick the best model (test set)

By the way, we'll explain why separate training and test sets are super important in Chapter 6: Model Training.

Machine Learning Tasks

Academic machine learning starts with and focuses on individual algorithms. However, in applied machine learning, you should first pick the right machine learning task for the job.

  • task is a specific objective for your algorithms.
  • Algorithms can be swapped in and out, as long as you pick the right task.
  • In fact, you should always try multiple algorithms because you most likely won't know which one will perform best for your dataset.

The two most common categories of tasks are supervised learning and unsupervised learning. (There are other tasks as well, but the concepts you'll learn in this course will be widely applicable.)

Supervised Learning

Supervised learning includes tasks for "labeled" data (i.e. you have a target variable).

  • In practice, it's often used as an advanced form of predictive modeling.
  • Each observation must be labeled with a "correct answer."
  • Only then can you build a predictive model because you must tell the algorithm what's "correct" while training it (hence, "supervising" it).
  • Regression is the task for modeling continuous target variables.
  • Classification is the task for modeling categorical (a.k.a. "class") target variables.
Logistic Regression

Unsupervised Learning

Unsupervised learning includes tasks for "unlabeled" data (i.e. you do not have a target variable).

  • In practice, it's often used either as a form of automated data analysis or automated signal extraction.
  • Unlabeled data has no predetermined "correct answer."
  • You'll allow the algorithm to directly learn patterns from the data (without "supervision").
  • Clustering is the most common unsupervised learning task, and it's for finding groups within your data.
Clustering

The 3 Elements of Great Machine Learning

How to consistently build effective models that get great results.

The Blueprint

Our machine learning blueprint is designed around those 3 elements.

There are 5 core steps:

  • 1

    Exploratory Analysis

    First, "get to know" the data. This step should be quick, efficient, and decisive.

  • 2

    Data Cleaning

    Then, clean your data to avoid many common pitfalls. Better data beats fancier algorithms.

  • 3

    Feature Engineering

    Next, help your algorithms "focus" on what's important by creating new features.

  • 4

    Algorithm Selection

    Choose the best, most appropriate algorithms without wasting your time.

  • 5

    Model Training

    Finally, train your models. This step is pretty formulaic once you've done the first 4.

What Goes Into a Successful Model

Of course, there are other situational steps as well:

  • S

    Project Scoping

    Sometimes, you'll need to roadmap the project and anticipate data needs.

  • W

    Data Wrangling

    You may also need to restructure your dataset into a format that algorithms can handle.

  • P

    Preprocessing

    Often, transforming your features first can further improve performance.

  • E

    Ensembling

    You can squeeze out even more performance by combining multiple models.

However, for this mini-course, we're going to focus on the 5 core steps. The other ones slot in easily once you understand the core workflow.

Jordan on Fundamentals

"Get the fundamentals down and the level of everything you do will rise." ~ Michael Jordan

Key takeaway: Machine learning should not be haphazard and piecemeal. It should be systematic and organized.

Furthermore, even if you forget everything else taught in this course, please remember: 'Better data beats fancier algorithms' - this insight will serve you well.

Checkpoint Quiz

Here's a quick quiz to make sure you got everything:

  • What are the 5 core steps of the machine learning workflow?
  • When the curious child learned that "red and bright means pain," what did he learn?
    • (A) An algorithm.
    • (B) A pattern.
    • (C) A model.
    • (D) Both (B) and (C).
    • (E) None of the above.
  • In the example of the curious child, what was the training data? What was the test data?
  • In your own words, describe the 3 essential elements of great machine learning.

You can keep track of your answers in the Companion Worksheet, which also has an answer key at the end.

Additional Resources

« Primer Hub Page
Chapter 2: Exploratory Analysis »