Can you guess the answer to this riddle?
- If you’ve studied machine learning, you’ve seen this everywhere…
- If you’re a programmer, you’ve done this a thousand times…
- If you’ve practiced any skill, this is already second-nature for you…
Nope, it’s not overdosing on coffee… It’s… iteration!
Yes, iteration as in repeating a set of tasks to achieve a result.
Wait, isn’t that just… the dictionary definition? Well, yes it is. And yes, that’s all we really mean. And no, we’re not about to reveal some mind-blowing secret about it.
But we do hope to frame this simple concept in a way that might be new to you. Our goal is to walk through a tour of several essential concepts in ML, but to do so from a different perspective than the common approach taught in textbooks.
You see, most books focus on the sequential process for machine learning: load data, then preprocess it, then fit models, then make predictions, etc.
This sequential approach is certainly reasonable and helpful to see, but real-world machine learning is rarely so linear. Instead, practical machine learning has a distinct cyclical nature that demands constant iteration, tuning, and improvement.
Therefore, we hope to showcase how the simple technique of iteration is actually very beautiful and profound in the context of machine learning. This post is intended for beginners, but more experienced users can enjoy it as well.
Why make a fuss about iteration?
Iteration is a central concept of machine learning, and it’s vital on many levels. Knowing exactly where this simple concept appears in the ML workflow has many practical benefits:
- You’ll better understand the algorithms you work with.
- You’ll anticipate more realistic timelines for your projects.
- You’ll spot low hanging fruit for model improvement.
- You’ll find it easier to stay motivated after poor initial results.
- You’ll be able to solve bigger problems with machine learning.
From our experience, seeing the ML workflow from the perspective of iteration can really help beginners see the big picture concepts behind machine learning.
So without further ado, let’s begin our tour of the 5 levels of machine learning iteration.
Table of Contents
- The Model Level: Fitting Parameters
- The Micro Level: Tuning Hyperparameters
- The Macro Level: Solving Your Problem
- The Meta Level: Improving Your Data
- The Human Level: Improving Yourself
The Model Level: Fitting Parameters
The first level where iteration plays a big role is at the model level. Any model, whether it be a regression model, a decision tree, or a neural network, is defined by many (sometimes even millions) of model parameters.
For example, a regression model is defined by its feature coefficients, a decision tree is defined by its branch locations, and a neural network is defined by the weights connecting its layers.
But how does the machine learn the right values for all of the model parameters? Here’s where iterative algorithms come into play!
Fitting Parameters with Gradient Descent
One of the shining successes in machine learning is the gradient descent algorithm (and its modified counterpart, stochastic gradient descent).
Gradient descent is an iterative method for finding the minimum of a function. In machine learning, that function is typically the loss (or cost) function. “Loss” is simply some metric that quantifies the cost of wrong predictions.
Gradient descent calculates the loss achieved by a model with a given set of parameters, and then alters those parameters to reduce the loss. It repeats this process until that loss can’t substantially be reduced further.
The final set of parameters that minimize the loss now define your fitted model.
Gradient Descent Intuition
We won’t derive the math behind gradient descent here, but we’ll paint a picture of the intuition:
- Imagine a mountain range with hills and valleys (loss function).
- Each location (parameter set) on the mountain has an altitude (loss).
- Now drop a ball somewhere on a mountain (initialization).
- At any moment, the ball rolls in the steepest direction (the gradient).
- It continue to roll (iteration) until it gets stuck in a valley (local minimum).
- Ideally, you want to find the lowest possible valley (global minimum).
- There are clever ways to prevent the ball from being stuck in local minima (e.g. initializing multiple balls, giving it more momentum so it can traverse small hills, etc.)
- Oh yeah, and if the mountain terrain is shaped like a bowl (convex function), then the ball is guaranteed to reach the lowest point.
Here’s a great short video from Andrew Ng further explaining the intuition behind Gradient Descent.
To learn more about the math behind gradient descent, we recommend these resources:
- Lecture video from MIT on Gradient Descent
- Notes on mathematical optimization from scipy-lectures.org
In practice, especially when using existing ML implementations like Scikit-Learn, you won’t need to implement gradient descent from scratch.
The Micro Level: Tuning Hyperparameters
The next level where iteration plays a huge role is at what we named the “micro” level, more commonly known as the general model or model family.
You can think of a model family as broad category of models with customizable structures. Logistic regressions, decision trees, SVMs, and neural networks are actually all different families of models. Each model family has a set of structural choices you must make before actually fitting the model parameters.
For example, within the logistic regression family, you can build separate models using either L1 or L2 regularization penalties. Within the decision tree family, you can have different models with different structural choices such as the depth of the tree, pruning thresholds, or even the splitting criteria.
These structural choices are called hyperparameters.
Why Hyperparameters are Special
Hyperparameters are “higher-level” parameters that cannot be learned directly from the data using gradient descent or other optimization algorithms. They describe structural information about a model that must be decided before fitting model parameters.
So when people say they are going to “train a logistic regression model,” what they really mean is a two-stage process.
- First, decide hyperparameters for the model family: e.g. Should the model have an L1 or L2 penalty to prevent overfitting?
- Then, fit the model parameters to the data: e.g. What are the model coefficients that minimize the loss function?
We discussed earlier how gradient descent can help perform Step 2. But in order fit model parameters using gradient descent, the user must first set the hyperparameters from model family.
So how can we tackle the Step 1, finding the best hyperparameters for the model family?
Tuning Hyperparameters with Cross-Validation.
Cross-validation is one of those techniques that works in so many scenarios that you’ll almost feel like you’re cheating when you use it.
In this context, cross-validation is an iterative method for evaluating the performance of models built with a given set of hyperparameters. It’s a clever way to reuse your training data by dividing it into parts and cycling through them (pseudocode below).
With cross-validation, you can fit and evaluate models with various sets of hyperparameters using only your training data. That means you can save the test set as a true untainted hold-out set for your final model selection (more on this in the next section).
Here’s a short and sweet video explaining the idea behind the most popular form of cross-validation, k-fold cross-validation.
Cross-Validation Step-by-Step
These are the steps for selecting hyperparameters using 10-fold cross-validation:
- Split your training data into 10 equal parts, or “folds.”
- From all sets of hyperparameters you wish to consider, choose a set of hyperparameters.
- Train your model with that set of hyperparameters on the first 9 folds.
- Evaluate it on the 10th fold, or the”hold-out” fold.
- Repeat steps (3) and (4) 10 times with the same set of hyperparameters, each time holding out a different fold.
- Aggregate the performance across all 10 folds. This is your performance metric for the set of hyperparameters.
- Repeat steps (2) to (6) for all sets of hyperparameters you wish to consider.
Here’s how that looks in pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
all_folds = split_into_k_parts(all_training_data) for set_p in hyperparameter_sets: model = InstanceFromModelFamily() for fold_k in all_folds: training_folds = all_folds besides fold_k fit model on training_folds using set_p fold_k_performance = evaluate model on fold_k set_p_performance = average all k fold_k_performances for set_p select set from hyperparameter_sets with best set_p_performance |
The Macro Level: Solving Your Problem
Now we’re going to step away from individual models and even model families. We’re going to discuss iteration at the problem-solving level.
Often, the first model you build for a problem won’t be the best possible, even if you tune it perfectly with cross-validation. That’s because fitting model parameters and tuning hyperparameters are only two parts of the entire machine learning problem-solving workflow.
There are several other iterative techniques that you can leverage to find the best performing solution. We consider these next 2 techniques to be low-hanging-fruit for improving your predictive performance.
Trying Different Model Families
There’s a concept in machine learning called the No Free Lunch theorem. There are different interpretations of the NFL theorem (not to be confused with National Football League), but the one we care about states: There is no one model family that works best for every problem.
Depending on a variety of factors such as the type of data, problem domain, sparsity of data, and even the amount of data you’ve collected, different model families will perform better.
Therefore, one of the easiest ways to improve your solution for a given problem is to try several different model families. This level of iteration sits nicely above the previous level.
Here’s how that looks in pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
training_data, test_data = randomly_split(all_data) list_of_families = logistic regression, decision tree, SVM, neural network, etc... for model_family in list_of_families: best_model = tuned with cross-validation on training_data evaluate best_model from each model_family on test_data select final model |
Note that the cross-validation step is the same as the one in the previous section. This beautiful form of nested iteration is an effective way of solving problems with machine learning.
Ensembling Models
The next way to improve your solution is by combining multiple models into an ensemble. This is a direct extension from the iterative process needed to fit those models.
We’ll save a detailed discussion of ensemble methods for a different post, but a common form of ensembling is simply averaging the predictions from your multiple models. Of course, there are more advanced ways to combine your models, but the iteration needed to fit multiple models is the same.
This combined prediction will often see a small performance increase over any of the individual models. Here’s the pseudocode for building a simple ensemble model:
1 2 3 4 5 6 7 8 9 10 11 12 |
training_data, test_data = randomly_split(all_data) list_of_families = logistic regression, decision tree, SVM, neural network, etc... for model_family in list_of_families: best_model = tuned with cross-validation on training_data average predictions by best_model from each model_family ... profit! (often) |
Note how most of that process is exactly the same as the previous technique!
Practically, that means you can easily double-up on these two techniques. First, you can build the best model from a variety of different model families. Then you can ensemble them. Finally, you can evaluate the individual models and the ensemble model on the same test set.
As a final word of caution: You should always keep an untainted test set to select your final model. We recommend splitting your data into train and test sets at the very beginning of your modeling process. Don’t touch the test set until the very end.
The Meta Level: Improving Your Data
Better data beats better algorithms. That doesn’t always mean more data beats better algorithms. Yes, better data often implies more data, but it also implies cleaner data, more relevant data, and better features engineered from the data.
Improving your data is also an iterative process. As you tackle larger challenges with machine learning, you’ll realize that it’s pretty damn hard to get your data completely right from the start.
Maybe there’s some key feature that you didn’t think to collect. Maybe you didn’t collect enough data. Maybe you misunderstood one of the columns in the dataset and need to circle back with a colleague to explain it.
A great machine learning practitioner always keeps an open mind toward continuously improving the dataset.
Collecting Better Data
The ability to collect better data is a skill that develops with time, experience, and more domain expertise. For example, if you’re building a real estate pricing model, you should collect every bit of information about the house itself, the nearby neighborhood, and even past property tax payments that are publicly available.
Another element of better data is the overall cleanliness of the data. That means having less missing data, lower measurement error, and doing your best to replace proxy metrics with primary metrics.
Here are a few questions to ask yourself that can spark ideas for improving your dataset:
- Are you collecting all the features that you need?
- Can you clean the data better?
- Can you reduce measurement error?
- Are there outliers that can be removed?
- Is it cheap to collect more data?
Engineering Better Features
Feature engineering, or creating new features from the data by leveraging domain knowledge, is one of the most valuable activities you can do to improve your models.
It’s often difficult and time-consuming, but it’s considered essential in applied machine learning. Therefore, as a machine learning practitioner, it is your duty to continue learning about your chosen domain.
That’s because as you learn more about the domain, you’ll develop better intuition around the types of features that are most impactful. You should treat this is as an iterative process that improves alongside your growth in personal expertise.
The Human Level: Improving Yourself
Now we’ve arrived at the most important level of iteration in machine learning: the human level. Even if you forget everything else from this post, we hope you take away the lesson from this section.
Here’s the truth: machine learning and data science are big and hairy topics. Especially if you’re a beginner, you may be feeling overwhelmed with all there is to learn. There are so many moving pieces, and new developments are happening every day.
And you know what? Parts of ML are still very tough and confusing for us. But that’s OK, because we strongly believe the most important level of iteration is at the human level, the machine learning practitioner.
So we want to conclude this lengthy post with a few parting suggestions. We hope that this last section can help you keep things in perspective and feel less overwhelmed by the information overload in this field.
#1. Never stop learning.
As you can see, iteration is built into every layer of the machine learning process. Your personal skills are no exception. Machine learning is a deep and rich field, and everything will become easier the more you practice.
#2. Don’t expect perfection from the start.
You don’t need to win your very first Kaggle competition. And it’s fine if you build a model and find out it completely sucks. The most valuable treasure is your personal growth and improvement, and that should be your main focus.
#3. It’s OK to not know everything.
In fact, it’s almost impossible to know everything about ML. The key is to build a foundation that will help you pick up new algorithms and techniques as you need them. And you guessed it… understanding iteration is part of that foundation.
#4. Try everything at least twice.
Struggling with an algorithm or task? Spending much longer than you thought it would take? No problem, just remember to try it at least one more time. Everything is easier and faster on the second try, and this is the best way to see your progress.
#5. Cycle between theory, practice, and projects.
We believe the most effective way to learn machine learning is by cycling between theory, targeted practice, and larger projects. This is the fastest way to master the theory while developing practical, real-world skills. You can learn more about this approach from our free guide: How to Learn Machine Learning, The Self-Starter Way
Summary of Iteration in Machine Learning
Iteration is a simple concept, yet beautiful in its application. It glues machine learning together on every level.
1 2 3 4 5 |
Human Level: Repeatedly practice to improve your skills. Meta Level: Continue to improve your data and features. Macro Level: Explore different model families and ensembles. Micro Level: Cross-validation to tune model hyperparameters. Model Level: Gradient descent to fit model parameters. |