The Feature-Engineering Process

Features are the input variables (datapoints) used to train Machine Learning and AI models. Feature-engineering is the process of extracting data and transforming it in such a way that in can be useful for training predictive models.

There are many challenges and steps in the feature-engineering process, but essentially there are two key aspects.

  1. The technical or Science part
  2. The intuitive or Art part

The Technical Part

Data wrangling
is the process of converting raw
data into a
usable form

The technical part of the feature-engineering process (data wrangling) involves extracting, clearning, formatting, transforming, labeling and classifying raw data in order to provide context that a machine learning model can learn from. It requires various skills including the ability to manipulate, analyze and format data. Collecting, cleaning and formatting data can be quite time-consuming because data is often located in different places, different formats (e.g. text files, images) and may be extracted from a variety of sources (data files, voice recognition, etc).


The Art Part

The art part of the feature-engineering process can be even more challenging than the technical part because transforming raw data into useful data often requires research, brainstorming and experimentation along with business-domain knowlege and intuitive thinking (intuiton). Let’s look at some real-life, practical examples.

A data science team was using property data to train an advertising model for a mortgage product for seniors. The training data contained various data points including purchase date and current age of the homeowner. The team expected that homeowner’s age would also be an important feature in the model. Initially it was not and model valiation score was weak. Then, one of the team members noticed that the average Age was showing up as 84. However, because of their business domain knowledge, the team knew that the average should be around 70. After closer inspection, the team realized that the age provided was the consumer’s current age, not the age at time of purchase (aha!). Using the current age of homeowner in combination with the purchase date, the team created a new feature variable:

Age-at-Time-of-Purchase

Purchase Date Current Age Age-at-Time-of-Purchase
01/12/2005 90 70
10/24/2020 76 72
05/08/2016 78 71

The new feature, Age-At-Time-of Purchase, transformed Age from being the least important feature to the most important feature in the model. Additionaly, the model validation score (AUC) increased significantly from 0.64 (poor) to 0.91 (excellent).

Point-in-Time Correctness

As we saw in the previous example, transforming a feature via point-in-time correctness, can have a significant impact. For example, a datapoint such the date a consumer moved into a new residence is not very useful in its raw state, that is until you incorporate the element of time.

FEATURE VARIABLE IMPORTANCE USEFUL
Move-in-Date 0.00% NO
Length-of-Residence 65.34% YES


Move-in-Date + Point-in-Time Correctness
= Length of Residence


Data Leakage

Predictive models use past events to predict future events. However, a predictive model cannot use the future to predict the present. When this happens, it’s called data leakage. While a model with data leakage may perform great in a training environment, the model will fail in real-time because you simply won’t have future data at your disposal.

Point-in-time correctness ensures no future data is used for model training. When preparing training data, you need to wach out for data leakage in order tobe certain that your model is only trained on time-correct datapoints.

Data leakage happens when you include features that occurred in the future (after the event). Many commonly used demographic feature-variables, such as a consumer’s age, income or geographic location can change over time. Other feature variables. such as a person’s race, eye color or blood type will never change; so data leakage is not an issue in those instances.


Time-Series Features

Data variables that change over time are referred to as time-series feature variables. When you are dealing with time-series features, point-in-time correctness is a critical-to-success piece of the puzzle.

The key is to be sure you’re always collecting and storing timestamps for any feature variables in your training data that can change over time because the timestamp is vital when you’re dealing with point-in-time-correct features.

Feature-Variable Power

Feature evaluation and feature selection is another important step in the feature-engineering process. One of the key decisions the data science team must undertake is determining which feature-variables should be included in a model and alternatively, which feature-variables should be excluded from a model.

Feature Importance vs. Feature Coverage

Importance                   Coverage                   Feature-Power
90% 10% 9%
70% 30% 21%
50% 50% 25%
30% 70% 21%
20% 90% 18%
5% 100% 5%


The current population
of the United States in 2024
is approximately:
341 MILLION

Estimated number of
US households:
131 MILLION



Importance                   Coverage                   Feature-Power
1 0.498 0
2 0.923 1
3 0.350 0
4 0.147 0
5 .071 0


Feature-Coverage is based on the availability of the value in a data set.


Feature-Importance is based on the predictive contribution of a data point.

Better Practices vs. Best Practices

Feature-Power Ranking is calculated by multiplying the relative importance of a feature’s predictiveness by the percentage of coverage availability for each feature in the training data set. This is approach is considered the best practice for selecting which datapoints to keep or abandon. However, there can be a great deal of variation in the process because striking the right balance of predictiveness and coverage is highly subjective as every data scientist may have different opinions on what is optimal. Consensus and areement may be difficult to reach. Establishing pre-defined thresholds may seem like a good solution. The problem with that is every situaton is different and what’s ideal may not be practical. Factors such as coverage and the cost of the data must also be taken into consideration.

We believe predictiveness is the more important than coverage because the whole point of a model is to make more precise predictions. Instead of developing a single model with the best coverage, we build multple models using the most predictive feature variables. This delivers better predictions and better business outcomes (check out our CASE STUDY). As we like to say, when it comes to model development... the more, the merrier!