Feature Engineering
- Feature engineering is the process of transforming raw data into informative input features by creating or modifying variables, thereby enabling machine learning models to capture patterns in the data better.
- It applies before or during model training
- Example: Creating “Total Spend” from “Price × Quantity”
Feature Selection
- Feature selection involves choosing the most relevant features from the dataset, reducing noise and redundancy to improve model accuracy, efficiency, and interpretability.
- It applies after features have been created
- Example: Selecting “Price” and “Season” while dropping “Product ID”
Importance of Feature Engineering and Feature Selection
1. Improve Model Accuracy
- Models learn patterns from the input data.
- Well-engineered and selected features lead to better predictions.
- Example: Creating total_sales = price × quantity gives the model direct business logic.
2. Reduce Overfitting
- Irrelevant or noisy features can confuse the model and make it memorize training data.
- Selecting only meaningful features helps the model generalize better on new data.
3. Speed Up Training
- Fewer, relevant features reduce training time and computational cost.
- Especially important with large datasets (e.g., image or text data).
4. Improve Interpretability
- With fewer and clearer features, it’s easier to explain the model’s behavior.
- Critical in domains like healthcare, finance, or retail strategy.
5. Deal with Real-World, Messy Data
- Real datasets often contain missing, redundant, or raw/unstructured data.
- Feature engineering cleans and transforms them into a usable format.
- Feature selection removes junk that adds no value.
Example in Fashion Retail:
- Raw data: Product ID, Price, Category, Stock, Launch Date
- Feature Engineering: Create Days_Since_Launch, Is_Discounted
- Feature Selection: Keep Category, Price, Days_Since_Launch; drop Product ID
Feature Engineering Techniques
For Machine Learning (ML)
Feature engineering is crucial in ML since models rely heavily on hand-crafted features.
Numerical Features
- Binning/Discretization: Convert continuous values into categories or intervals (e.g., age groups).
- Log/Power Transformations: Reduce skewness in data by applying log or power transformations.
- Scaling: Standardize or normalize values to bring them onto the same scale.
- Polynomial Features: Add features like x², x³ to model non-linear relationships.
- Interaction Features: Multiply two or more features to capture their combined effect.
- Aggregation Features: Compute stats like mean, median, std within groups (e.g., by customer).
- Domain-Specific Ratios: Custom ratios based on business logic (e.g., sales per visit).
Categorical Features
- One-Hot Encoding: Converts categories into binary columns.
- Label Encoding: Maps each category to a unique number (best for tree-based models).
- Target Encoding: Replace categories with the mean of the target for each group.
- Frequency Encoding: Replace categories with their occurrence count.
- Hash Encoding: Maps categories to fixed-size hash buckets to reduce dimensionality.
Time-Series Features
- Lag Features: Include values from previous time steps as features.
- Rolling Statistics: Capture moving mean, std, min, or max over a time window.
- Date-Time Feature Extraction: Extract features like day, month, hour from datetime.
- Fourier Transforms: Transform data to frequency domain to detect seasonality.
Text Features
- Bag-of-Words (BoW): Count frequency of each word across documents.
- TF-IDF: Weighs words based on frequency and uniqueness across documents.
- Word Embeddings: Converts words into dense vectors capturing context and meaning.
- N-grams: Sequences of N words (e.g., bigrams or trigrams) to capture context.
- Character-level Features: Analyzes patterns at the character level instead of word.
For Deep Learning (DL)
DL models can learn features automatically, but some feature engineering helps:
Numerical Features
- Scaling: Normalize input features to improve training performance.
- Log Transform: Useful to reduce skewed distributions.
- Binning: Convert continuous features into discrete buckets if needed.
Categorical Features
- Embedding Layers: Learn dense representations for high-cardinality categorical data.
- One-Hot Encoding: Useful for low-cardinality features.
Text Features
- Word Embeddings (BERT/GPT): Use pre-trained models to generate rich word vectors.
- Subword Tokenization: Breaks words into smaller chunks for better generalization.
- Positional Encoding: Adds position information to embeddings (used in Transformers).
Image Features
- Pixel Normalization: Scale pixel values between 0 and 1.
- Data Augmentation: Artificially expand dataset by flipping, rotating, etc.
- Pretrained CNN Features: Use features from models like ResNet, VGG, etc.
Time-Series Features
- Sequence Padding: Pad sequences to make equal length for RNNs/Transformers.
- Rolling Windows: Slice data into overlapping time windows for forecasting.
Feature Selection Techniques
- Correlation Analysis: Remove features with low or very high correlation with target.
- Mutual Information: Measures dependency between variables (non-linear too).
- Recursive Feature Elimination (RFE): Recursively removes least important features.
- L1 Regularization (Lasso): Automatically selects features by zeroing out coefficients.
- Feature Importance (Tree-based models): Rank features based on model contribution.
Python Implementation of Above Techniques
# ===== NUMERICAL FEATURES =====
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, KBinsDiscretizer
# Binning
df['age_bin'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young', 'adult', 'senior'])
# Log Transform
df['income_log'] = np.log1p(df['income'])
# Scaling
scaler = StandardScaler()
df[['age_scaled']] = scaler.fit_transform(df[['age']])
# Polynomial Features
poly = PolynomialFeatures(degree=2)
df_poly = pd.DataFrame(poly.fit_transform(df[['x1', 'x2']]), columns=poly.get_feature_names_out())
# Aggregation
df['avg_income_by_city'] = df.groupby('city')['income'].transform('mean')
# Domain-Specific Ratio
df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)
# ===== CATEGORICAL FEATURES =====
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# One-Hot Encoding
df_encoded = pd.get_dummies(df['color'], prefix='color')
# Label Encoding
df['color_encoded'] = LabelEncoder().fit_transform(df['color'])
# Target Encoding
df['city_target_encoded'] = df.groupby('city')['target'].transform('mean')
# Frequency Encoding
df['city_freq'] = df['city'].map(df['city'].value_counts(normalize=True))
# Hash Encoding
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=3, input_type='string')
hashed_features = hasher.transform(df['city'].astype(str)).toarray()
# ===== TIME-SERIES FEATURES =====
# Lag Features
df['sales_lag1'] = df['sales'].shift(1)
# Rolling Statistics
df['sales_7day_avg'] = df['sales'].rolling(7).mean()
# Date-Time Extraction
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
# Fourier Transforms (for seasonality)
from scipy.fft import fft
fft_values = np.abs(fft(df['sales'].values))
# ===== TEXT FEATURES =====
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Bag-of-Words
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['text'])
# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['text'])
# ===== DEEP LEARNING =====
# Embedding Layers (Keras example)
import tensorflow as tf
embedding_layer = tf.keras.layers.Embedding(input_dim=1000, output_dim=64)
# Pixel Normalization (Images)
images_normalized = images / 255.0
# Sequence Padding
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences, maxlen=100)
# ===== FEATURE SELECTION =====
from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression, Lasso
# Correlation Analysis
correlated_features = df.corr()[abs(df.corr()['target']) > 0.5].index
# Mutual Information
mi_scores = mutual_info_classif(X, y)
# Recursive Feature Elimination
selector = RFE(estimator=LogisticRegression(), n_features_to_select=5)
selector.fit(X, y)
# L1 Regularization
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
Key Notes
- Replace placeholders (df, X, y) with your actual data.
- For DL-specific techniques (e.g., positional encoding), refer to libraries like transformers or tensorflow.
- Use from sklearn.pipeline import Pipeline to chain multiple steps.