Algorithm Selection for Machine Learning

Welcome to Part 5 of our Data Science Primer. Choosing the right ML algorithm for your task can be overwhelming. There are dozens of options, each with their own advantages and disadvantages. However, rather than bombarding you with all options, we’re going to jump straight to best practices.

Specifically, we’ll introduce two powerful mechanisms in modern algorithms: regularization and ensembles. As you’ll see, these mechanisms “fix” some fatal flaws in older methods, which has lead to their popularity. Let’s get started!

How to Pick ML Algorithms

In this lesson, we’ll introduce five effective machine learning algorithms for regression tasks. They each have classification counterparts as well.

And yes, just five for now. Instead of giving you a long list of algorithms, our goal is to explain a few essential concepts (e.g. regularization, ensembling, automatic feature selection) that will teach you why some algorithms tend to perform better than others.

In applied machine learning, individual algorithms should be swapped in and out depending on which performs best for the problem and the dataset. Therefore, we will focus on intuition and practical benefits over math and theory.

Why Linear Regression is Flawed

To introduce the reasoning for some of the advanced algorithms, let’s start by discussing basic linear regression. Linear regression models are very common, yet deeply flawed.

Simple linear regression models fit a “straight line” (technically a hyperplane depending on the number of features, but it’s the same idea). In practice, they rarely perform well. We actually recommend skipping them for most machine learning problems.

Their main advantage is that they are easy to interpret and understand. However, our goal is not to study the data and write a research report. Our goal is to build a model that can make accurate predictions.

In this regard, simple linear regression suffers from two major flaws:

It’s prone to overfit with many input features.
It cannot easily express non-linear relationships.

Let’s take a look at how we can address the first flaw.

Regularization in Machine Learning

This is the first “advanced” tactic for improving model performance. It’s considered pretty “advanced” in many ML courses, but it’s really pretty easy to understand and implement.

The first flaw of linear models is that they are prone to be overfit with many input features.

The number of features is too damn high!

Let’s take an extreme example to illustrate why this happens:

Let’s say you have 100 observations in your training dataset.
Let’s say you also have 100 features.
If you fit a linear regression model with all of those 100 features, you can perfectly “memorize” the training set.

Each coefficient would simply memorize one observation. This model would have perfect accuracy on the training data, but perform poorly on unseen data. It hasn’t learned the true underlying patterns; it has only memorized the noise in the training data.

Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.

It can discourage large coefficients (by dampening them).
It can also remove features entirely (by setting their coefficients to 0).
The “strength” of the penalty is tunable. (More on this in the Model Training guide)

Regularized Regression Algos

There are 3 common types of regularized linear regression algorithms.

Lasso Regression

Lasso, or LASSO, stands for Least Absolute Shrinkage and Selection Operator. Lasso regression penalizes the absolute size of coefficients. Practically, this leads to coefficients that can be exactly 0.

Thus, Lasso offers automatic feature selection because it can completely remove some features. Remember, the “strength” of the penalty should be tuned. A stronger penalty leads to more coefficients pushed to zero.

Ridge Regression

Ridge stands Really Intense Dangerous Grapefruit Eating (just kidding… it’s just ridge). Ridge regression penalizes the squared size of coefficients. Practically, this leads to smaller coefficients, but it doesn’t force them to 0.

In other words, Ridge offers feature shrinkage. Again, the “strength” of the penalty should be tuned. A stronger penalty leads to coefficients pushed closer to zero.

Elastic-Net

Elastic-Net is a compromise between Lasso and Ridge. Elastic-Net penalizes a mix of both absolute and squared size. The ratio of the two penalty types should be tuned. The overall strength should also be tuned.

Oh and in case you’re wondering, there’s no “best” type of penalty. It really depends on the dataset and the problem. We recommend trying different algorithms that use a range of penalty strengths as part of the tuning process.

Decision Tree Algos

Awesome, we’ve just seen 3 algorithms that can protect linear regression from overfitting. But if you remember, linear regression suffers from two main flaws:

It’s prone to overfit with many input features.
It cannot easily express non-linear relationships.

How can we address the second flaw?

Linear Arrows Non-Linear Maze — Non-linear relationships require a different strategy.

To model non-linear relationships, we need to move away from linear models. We need to bring in a new category of algorithms called decision trees.

Decision trees model data as a “tree” of hierarchical branches. They make branches until they reach “leaves” that represent predictions.

Due to their branching structure, decision trees can easily model nonlinear relationships.

For example, let’s say for Single Family homes, larger lots command higher prices.
However, let’s say for Apartments, smaller lots command higher prices (i.e. it’s a proxy for urban / rural).
This reversal of correlation is difficult for linear models to capture unless you explicitly add an interaction term (i.e. you can anticipate it ahead of time).
On the other hand, decision trees can capture this relationship naturally.

Unfortunately, decision trees suffer from a major flaw as well. If you allow them to grow limitlessly, they can completely “memorize” the training data, just from creating more and more and more branches.

As a result, individual unconstrained decision trees are very prone to being overfit.

So, how can we take advantage of the flexibility of decision trees while preventing them from overfitting the training data?

Tree Ensembles

Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are “bagging” and “boosting.”

Bagging attempts to reduce the chance overfitting complex models. It trains a large number of “strong” learners in parallel (a strong learner is a model that’s relatively unconstrained). Bagging then combines all the strong learners together in order to “smooth out” their predictions.

Boosting attempts to improve the predictive flexibility of simple models. It trains a large number of “weak” learners in sequence (a weak learner is a constrained model, e.g. limiting the max depth of each decision tree). Each one in the sequence focuses on learning from the mistakes of the one before it. Boosting then combines all the weak learners into a single strong learner.

While bagging and boosting are both ensemble methods, they approach the problem from opposite directions:

Bagging uses complex base models and tries to “smooth out” their predictions.
Boosting uses simple base models and tries to “boost” their aggregate complexity.

Ensembling is a general term, but when the base models are decision trees, they have special names: random forests and boosted trees!

Random forests

Random forests train a large number of “strong” decision trees and combine their predictions through bagging. In addition, there are two sources of “randomness” for random forests:

Each tree is only allowed to choose from a random subset of features to split on (leading to feature selection).
Each tree is only trained on a random subset of observations (a process called resampling).

In practice, random forests tend to perform very well right out of the box. They often beat many other models that take up to weeks to develop. They don’t have many complicated parameters to tune, making them the perfect “swiss-army-knife” algorithm that almost always gets good results.

Boosted trees

Boosted trees train a sequence of “weak”, constrained decision trees and combine their predictions through boosting.

Each tree is allowed a maximum depth, which should be tuned.
Each tree in the sequence tries to correct the prediction errors of the one before it.

In practice, boosted trees tend to have the highest performance ceilings. They often beat many other types of models after proper tuning, but they are more complicated to tune than random forests.

Key takeaway: The most effective algorithms typically offer a combination of regularization, automatic feature selection, ability to express nonlinear relationships, and/or ensembling. Those algorithms include:

Lasso regression
Ridge regression
Elastic-Net
Random forest
Boosted tree

That wraps it up for the Algorithm Selection step of the Machine Learning Workflow. Next, it’s time to train our models in the next core step: Model Training!

More on ML Algorithms

Read the rest of our Intro to Data Science here.