Feature Engineering for Machine Learning

Welcome to Part 4 of our Data Science Primer. In this guide, we’ll see how we can perform feature engineering to help out our algorithms and improve model performance. Remember, out of all the core steps in applied machine learning, data scientists usually spend the most time on feature engineering.

Feature Engineering is the Biggest Factor of a Successful ML Model

What is Feature Engineering?

Feature engineering is about creating new input features from your existing ones. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

All data scientists should master the process of engineering new features, for three big reasons:

You can isolate and highlight key information, which helps your algorithms “focus” on what’s important.
You can bring in your own domain expertise.
Most importantly, once you understand the “vocabulary” of feature engineering, you can bring in other people’s domain expertise!

In this guide, we will introduce several heuristics to help spark new ideas. Of course, this will not be an exhaustive compendium of all feature engineering, for which there are limitless possibilities. The good news is that this skill will naturally improve as you gain more experience.

Infuse Domain Knowledge

You can often engineer informative features by tapping into your (or others’) expertise about the domain. Try to think of specific information you might want to isolate or put the focus on. Here, you have a lot of “creative freedom” and “skill expression” as a data scientist.

For example, let’s say you’re working on a US real-estate model, using a dataset of historical prices going back to the 2000’s. Well, for this scenario, it’s important to remember that the subprime mortgage housing crisis occurred within that timeframe:

Domain Knowledge and Feature Engineering — Zillow Home Values (2007 – 2017)

If you suspect that prices would be affected, you could create an indicator variable for transactions during that period. Indicator variables are binary variables that can be either 0 or 1. They “indicate” if an observation meets a certain condition, and they are very useful for isolating key properties.

As you might suspect, “domain knowledge” is very broad and open-ended. At some point, you’ll get stuck or exhaust your ideas. That’s where these next few steps come in. These are a few specific heuristics that can help spark more.

Create Interaction Features

The first of these heuristics is checking to see if you can create any interaction features that make sense. These are combinations of two or more features.

By the way, in some contexts, “interaction terms” must be products between two variables. In our context, interaction features can be products, sums, or differences between two features.

A general tip is to look at each pair of features and ask yourself, “could I combine this information in any way that might be even more useful?”

Example (real-estate)

We know that quality and quantity of nearby schools will affect housing prices. So how can we ensure our ML model picks up on this?

Let’s say we already have a feature in the dataset called ‘num_schools’, i.e. the number of schools within 5 miles of a property.
Let’s say we also have the feature ‘median_school’, i.e. the median quality score of those schools.

However, we might suspect that what’s really important is having many school options, but only if they are good.

Well, to capture that interaction, we could simple create a new feature ‘school_score’ = ‘num_schools’ x ‘median_school’

This new ‘school_score’ feature would only have a high value (relatively) if both those conditions are met.

Combine Sparse Classes

The next heuristic we’ll consider is grouping sparse classes. Sparse classes (in categorical features) are those that have very few total observations. They can be problematic for certain machine learning algorithms, causing models to be overfit.

There’s no formal rule of how many observations each class needs. It also depends on the size of your dataset and the number of other features you have.

However, as a rule of thumb, we recommend combining classes until each one has at least ~50 observations. As with any “rule” of thumb, use this as a guideline (not actually as a rule).

Let’s take a look at the real-estate example:

To begin, we can group similar classes. In the chart above, the exterior_walls feature has several classes that are quite similar.

We might want to group 'Wood Siding', 'Wood Shingle', and 'Wood' into a single class. In fact, let’s just label all of them as 'Wood'.

Next, we can group the remaining sparse classes into a single ‘Other’ class, even if there’s already an ‘Other’ class.

We’d group 'Concrete Block', 'Stucco', 'Masonry', 'Other', and 'Asbestos shingle' into just 'Other'.

Here’s how the class distributions look after combining similar and other classes:

After combining sparse classes, we have fewer unique classes, but each one has more observations. Often, an eyeball test is enough to decide if you want to group certain classes together.

Add Dummy Variables

Most machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values. Therefore, we need to create dummy variables for our categorical features.

Dummy variables are a set of binary (0 or 1) variables that each represent a single class from a categorical feature. The information you represent is exactly the same, but this numeric representation allows you to pass the technical requirements for algorithms.

In the example above, after grouping sparse classes, we were left with 8 classes, which translate to 8 dummy variables:

Remove Unused Features

Finally, we should remove unused or redundant features from the dataset.

Unused features are those that don’t make sense to pass into our machine learning algorithms. Examples include:

ID columns
Features that wouldn’t be available at the time of prediction
Other text descriptions

Redundant features would typically be those that have been replaced by other features that you’ve added during feature engineering. For example, if you group a numeric feature into a categorical one, you can often improve model performance by removing the “distracting” original feature.

Analytical Base Table (ABT)

After completing Data Cleaning and Feature Engineering, you’ll have transformed your raw dataset into an analytical base table (ABT). We call it an “ABT” because it’s what you’ll be building your models on.

Ping Pong Table (Not ABT) — “Would someone please ask Alex to get off the ping-pong table? We’re waiting to play!”

As a final tip: Not all of the features you engineer need to be winners. In fact, you’ll often find that many of them don’t improve your model. That’s fine because one highly predictive feature makes up for ten duds.

The key is choosing machine learning algorithms that can automatically select the best features among many options (built-in feature selection). This will allow you to avoid overfitting your model despite providing many input features. We’ll talk about this in the next core step of the Machine Learning Workflow: Algorithm Selection!

More About Feature Engineering

Read the rest of our Intro to Data Science here.