Feature Engineering

  • Feature engineering is the process of transforming raw data into informative input features by creating or modifying variables, thereby enabling machine learning models to capture patterns in the data better.
  • It applies before or during model training
  • Example: Creating “Total Spend” from “Price × Quantity

Feature Selection

  • Feature selection involves choosing the most relevant features from the dataset, reducing noise and redundancy to improve model accuracy, efficiency, and interpretability.
  • It applies after features have been created
  • Example: Selecting “Price” and “Season” while dropping “Product ID

Importance of Feature Engineering and Feature Selection

1. Improve Model Accuracy

  • Models learn patterns from the input data.
  • Well-engineered and selected features lead to better predictions.
  • Example: Creating total_sales = price × quantity gives the model direct business logic.

2. Reduce Overfitting

  • Irrelevant or noisy features can confuse the model and make it memorize training data.
  • Selecting only meaningful features helps the model generalize better on new data.

3. Speed Up Training

  • Fewer, relevant features reduce training time and computational cost.
  • Especially important with large datasets (e.g., image or text data).

4. Improve Interpretability

  • With fewer and clearer features, it’s easier to explain the model’s behavior.
  • Critical in domains like healthcare, finance, or retail strategy.

5. Deal with Real-World, Messy Data

  • Real datasets often contain missing, redundant, or raw/unstructured data.
  • Feature engineering cleans and transforms them into a usable format.
  • Feature selection removes junk that adds no value.

Example in Fashion Retail:

  • Raw data: Product ID, Price, Category, Stock, Launch Date
  • Feature Engineering: Create Days_Since_Launch, Is_Discounted
  • Feature Selection: Keep Category, Price, Days_Since_Launch; drop Product ID

Feature Engineering Techniques

For Machine Learning (ML)

Feature engineering is crucial in ML since models rely heavily on hand-crafted features.

Numerical Features

  • Binning/Discretization: Convert continuous values into categories or intervals (e.g., age groups).
  • Log/Power Transformations: Reduce skewness in data by applying log or power transformations.
  • Scaling: Standardize or normalize values to bring them onto the same scale.
  • Polynomial Features: Add features like x², x³ to model non-linear relationships.
  • Interaction Features: Multiply two or more features to capture their combined effect.
  • Aggregation Features: Compute stats like mean, median, std within groups (e.g., by customer).
  • Domain-Specific Ratios: Custom ratios based on business logic (e.g., sales per visit).

Categorical Features

  • One-Hot Encoding: Converts categories into binary columns.
  • Label Encoding: Maps each category to a unique number (best for tree-based models).
  • Target Encoding: Replace categories with the mean of the target for each group.
  • Frequency Encoding: Replace categories with their occurrence count.
  • Hash Encoding: Maps categories to fixed-size hash buckets to reduce dimensionality.

Time-Series Features

  • Lag Features: Include values from previous time steps as features.
  • Rolling Statistics: Capture moving mean, std, min, or max over a time window.
  • Date-Time Feature Extraction: Extract features like day, month, hour from datetime.
  • Fourier Transforms: Transform data to frequency domain to detect seasonality.

Text Features

  • Bag-of-Words (BoW): Count frequency of each word across documents.
  • TF-IDF: Weighs words based on frequency and uniqueness across documents.
  • Word Embeddings: Converts words into dense vectors capturing context and meaning.
  • N-grams: Sequences of N words (e.g., bigrams or trigrams) to capture context.
  • Character-level Features: Analyzes patterns at the character level instead of word.

For Deep Learning (DL)

DL models can learn features automatically, but some feature engineering helps:

Numerical Features

  • Scaling: Normalize input features to improve training performance.
  • Log Transform: Useful to reduce skewed distributions.
  • Binning: Convert continuous features into discrete buckets if needed.

Categorical Features

  • Embedding Layers: Learn dense representations for high-cardinality categorical data.
  • One-Hot Encoding: Useful for low-cardinality features.

Text Features

  • Word Embeddings (BERT/GPT): Use pre-trained models to generate rich word vectors.
  • Subword Tokenization: Breaks words into smaller chunks for better generalization.
  • Positional Encoding: Adds position information to embeddings (used in Transformers).

Image Features

  • Pixel Normalization: Scale pixel values between 0 and 1.
  • Data Augmentation: Artificially expand dataset by flipping, rotating, etc.
  • Pretrained CNN Features: Use features from models like ResNet, VGG, etc.

Time-Series Features

  • Sequence Padding: Pad sequences to make equal length for RNNs/Transformers.
  • Rolling Windows: Slice data into overlapping time windows for forecasting.

Feature Selection Techniques

  • Correlation Analysis: Remove features with low or very high correlation with target.
  • Mutual Information: Measures dependency between variables (non-linear too).
  • Recursive Feature Elimination (RFE): Recursively removes least important features.
  • L1 Regularization (Lasso): Automatically selects features by zeroing out coefficients.
  • Feature Importance (Tree-based models): Rank features based on model contribution.

Python Implementation of Above Techniques

# ===== NUMERICAL FEATURES =====  
import numpy as np  
import pandas as pd  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, KBinsDiscretizer  

# Binning  
df['age_bin'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young', 'adult', 'senior'])  

# Log Transform  
df['income_log'] = np.log1p(df['income'])  

# Scaling  
scaler = StandardScaler()  
df[['age_scaled']] = scaler.fit_transform(df[['age']])  

# Polynomial Features  
poly = PolynomialFeatures(degree=2)  
df_poly = pd.DataFrame(poly.fit_transform(df[['x1', 'x2']]), columns=poly.get_feature_names_out())  

# Aggregation  
df['avg_income_by_city'] = df.groupby('city')['income'].transform('mean')  

# Domain-Specific Ratio  
df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)  

# ===== CATEGORICAL FEATURES =====  
from sklearn.preprocessing import OneHotEncoder, LabelEncoder  

# One-Hot Encoding  
df_encoded = pd.get_dummies(df['color'], prefix='color')  

# Label Encoding  
df['color_encoded'] = LabelEncoder().fit_transform(df['color'])  

# Target Encoding  
df['city_target_encoded'] = df.groupby('city')['target'].transform('mean')  

# Frequency Encoding  
df['city_freq'] = df['city'].map(df['city'].value_counts(normalize=True))  

# Hash Encoding  
from sklearn.feature_extraction import FeatureHasher  
hasher = FeatureHasher(n_features=3, input_type='string')  
hashed_features = hasher.transform(df['city'].astype(str)).toarray()  

# ===== TIME-SERIES FEATURES =====  
# Lag Features  
df['sales_lag1'] = df['sales'].shift(1)  

# Rolling Statistics  
df['sales_7day_avg'] = df['sales'].rolling(7).mean()  

# Date-Time Extraction  
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour  

# Fourier Transforms (for seasonality)  
from scipy.fft import fft  
fft_values = np.abs(fft(df['sales'].values))  

# ===== TEXT FEATURES =====  
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  

# Bag-of-Words  
vectorizer = CountVectorizer()  
X_bow = vectorizer.fit_transform(df['text'])  

# TF-IDF  
tfidf = TfidfVectorizer()  
X_tfidf = tfidf.fit_transform(df['text'])  

# ===== DEEP LEARNING =====  
# Embedding Layers (Keras example)  
import tensorflow as tf  
embedding_layer = tf.keras.layers.Embedding(input_dim=1000, output_dim=64)  

# Pixel Normalization (Images)  
images_normalized = images / 255.0  

# Sequence Padding  
from tensorflow.keras.preprocessing.sequence import pad_sequences  
padded_sequences = pad_sequences(sequences, maxlen=100)  

# ===== FEATURE SELECTION =====  
from sklearn.feature_selection import mutual_info_classif, RFE  
from sklearn.linear_model import LogisticRegression, Lasso  

# Correlation Analysis  
correlated_features = df.corr()[abs(df.corr()['target']) > 0.5].index  

# Mutual Information  
mi_scores = mutual_info_classif(X, y)  

# Recursive Feature Elimination  
selector = RFE(estimator=LogisticRegression(), n_features_to_select=5)  
selector.fit(X, y)  

# L1 Regularization  
lasso = Lasso(alpha=0.1)  
lasso.fit(X, y)  

Key Notes

  • Replace placeholders (df, X, y) with your actual data.
  • For DL-specific techniques (e.g., positional encoding), refer to libraries like transformers or tensorflow.
  • Use from sklearn.pipeline import Pipeline to chain multiple steps.

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

Social Profile

© Data4Fashion 2023-2025

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies