Feature Engineering for Better Machine Learning Models

Feature Engineering

Chapter 4 of 7 in the Data Science Primer

Welcome to our 7-part mini-course on data science and applied machine learning!

In the previous chapter, you learned a reliable framework for cleaning your dataset. We fixed structural errors, handled missing data, and filtered observations.

In this guide, we'll see how we can perform feature engineering to help out our algorithms and improve model performance.

Remember, out of all the core steps, data scientists usually spend the most time on feature engineering:

What Goes Into a Successful Model

What is Feature Engineering?

Feature engineering is about creating new input features from your existing ones.

In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:

  1. You can isolate and highlight key information, which helps your algorithms "focus" on what’s important.
  2. You can bring in your own domain expertise.
  3. Most importantly, once you understand the "vocabulary" of feature engineering, you can bring in other people’s domain expertise!

In this lesson, we will introduce several heuristics to help spark new ideas.

Before moving on, we just want to note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step.

The good news is that this skill will naturally improve as you gain more experience.

Feature engineering

Getting classy.

Infuse Domain Knowledge

You can often engineer informative features by tapping into your (or others’) expertise about the domain.

Try to think of specific information you might want to isolate. Here, you have a lot of "creative freedom."

Going back to our example with the real-estate dataset, let's say you remembered that the housing crisis occurred in the same timeframe...

Zillow Screenshot

Screenshot taken from Zillow Home Values

Well, if you suspect that prices would be affected, you could create an indicator variable for transactions during that period.​ Indicator variables are binary variables that can be either 0 or 1. They "indicate" if an observation meets a certain condition, and they are very useful for isolating key properties.

As you might suspect, "domain knowledge" is very broad and open-ended. At some point, you'll get stuck or exhaust your ideas.

That's where these next few steps come in. These are a few specific heuristics that can help spark more.

Create Interaction Features

Joining forces

Joining forces.

The first of these heuristics is checking to see if you can create any interaction features that make sense. These are combinations of two or more features.

By the way, in some contexts, "interaction terms" must be products between two variables. In our context, interaction features can be products, sums, or differences between two features.

A general tip is to look at each pair of features and ask yourself, "could I combine this information in any way that might be even more useful?"

Example (real-estate)

  • Let's say we already had a feature called 'num_schools', i.e. the number of schools within 5 miles of a property.
  • Let's say we also had the feature 'median_school', i.e. the median quality score of those schools.
  • However, we might suspect that what's really important is having many school options, but only if they are good.
  • Well, to capture that interaction, we could simple create a new feature 'school_score' = 'num_schools' 'median_school'

Combine Sparse Classes

The next heuristic we’ll consider is grouping sparse classes.

Sparse classes (in categorical features) are those that have very few total observations. They can be problematic for certain machine learning algorithms, causing models to be overfit.

  • There's no formal rule of how many each class needs.
  • It also depends on the size of your dataset and the number of other features you have.
  • As a rule of thumb, we recommend combining classes until each one has at least ~50 observations. As with any "rule" of thumb, use this as a guideline (not actually as a rule).

Let's take a look at the real-estate example:

Before grouping sparse classes

To begin, we can group similar classes. In the chart above, the 'exterior_walls' feature has several classes that are quite similar.

  • We might want to group 'Wood Siding', 'Wood Shingle', and 'Wood' into a single class. In fact, let's just label all of them as 'Wood'.

Next, we can group the remaining sparse classes into a single 'Other' class, even if there's already an 'Other' class.

  • We'd group 'Concrete Block', 'Stucco', 'Masonry', 'Other', and 'Asbestos shingle' into just 'Other'.

Here's how the class distributions look after combining similar and other classes:

After grouping sparse classes

After combining sparse classes, we have fewer unique classes, but each one has more observations.

Often, an eyeball test is enough to decide if you want to group certain classes together.

Add Dummy Variables

Most machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values.

Therefore, we need to create dummy variables for our categorical features.

Dummy variables are a set of binary (0 or 1) variables that each represent a single class from a categorical feature.

The information you represent is exactly the same, but this numeric representation allows you to pass the technical requirements for algorithms.

In the example above, after grouping sparse classes, we were left with 8 classes, which translate to 8 dummy variables:

Dummy Variables Example

(The 3rd column depicts an example for an observation with brick walls)

Remove Unused Features

Finally, remove unused or redundant features from the dataset.

Unused features are those that don’t make sense to pass into our machine learning algorithms. Examples include:

  • ID columns
  • Features that wouldn't be available at the time of prediction
  • Other text descriptions

Redundant features would typically be those that have been replaced by other features that you’ve added during feature engineering.

Checkpoint Quiz

After completing Data Cleaning and Feature Engineering, you'll have transformed your raw dataset into an analytical base table (ABT). We call it an "ABT" because it's what you'll be building your models on.

As a final tip: Not all of the features you engineer need to be winners. In fact, you’ll often find that many of them don’t improve your model. That’s fine because one highly predictive feature makes up for 10 duds.

The key is choosing machine learning algorithms that can automatically select the best features among many options (built-in feature selection).

This will allow you to avoid overfitting your model despite providing many input features.

Ping Pong Table

"Would someone please ask Alex to get off the ping-pong table? We're waiting to play!"


Quiz time!

  • What are indicator variables and why are they useful?
  • What are two criteria you can use to group sparse classes?
  • In a set of dummy variables created from the same feature, would there ever be multiple variables with value 1 (per observation)?
  • In our real-estate example, what would be the values for the 'exterior_walls' dummy variables if a property had metal walls?

You can keep track of your answers in the Companion Worksheet, which also has an answer key at the end.

Additional Resources

« Chapter 3: Data Cleaning
Chapter 5: Algorithm Selection »