The Feature-Engineering Process
Features are the input variables (datapoints) used to train Machine Learning and AI models. Feature-engineering is the process of extracting data and transforming it in such a way that in can be useful for training predictive models.
There are many challenges and steps in the feature-engineering process, but essentially there are two key aspects.
- The technical or Science part
- The intuitive or Art part
The Technical Part
Data wrangling
is the process of converting raw
data into a
usable form
The technical part of the feature-engineering process (data wrangling) involves extracting, clearning, formatting, transforming, labeling and classifying raw data in order to provide context that a machine learning model can learn from. It requires various skills including the ability to manipulate, analyze and format data. Collecting, cleaning and formatting data can be quite time-consuming because data is often located in different places, different formats (e.g. text files, images) and may be extracted from a variety of sources (data files, voice recognition, etc).
The Art Part
The art part of the feature-engineering process can be even more challenging than the technical part because transforming raw data into useful data often requires research, brainstorming and experimentation along with business-domain knowlege and intuitive thinking (intuiton). Let’s look at some real-life, practical examples.
A data science team was using property data to train an advertising model for a mortgage product for seniors. The training data contained various data points including purchase date and current age of the homeowner. The team expected that homeowner’s age would also be an important feature in the model. Initially it was not and model valiation score was weak. Then, one of the team members noticed that the average Age was showing up as 84. However, because of their business domain knowledge, the team knew that the average should be around 70. After closer inspection, the team realized that the age provided was the consumer’s current age, not the age at time of purchase (aha!). Using the current age of homeowner in combination with the purchase date, the team created a new feature variable:
Age-at-Time-of-Purchase
Purchase Date | Current Age | Age-at-Time-of-Purchase |
---|---|---|
01/12/2005 | 90 | 70 |
10/24/2020 | 76 | 72 |
05/08/2016 | 78 | 71 |
The new feature, Age-At-Time-of Purchase, transformed Age from being the least important feature to the most important feature in the model. Additionaly, the model validation score (AUC) increased significantly from 0.64 (poor) to 0.91 (excellent).