Time Series:
- A time series is a collection of data points that are recorded in order over time, such as tracking something regularly, such as every day, week, or month.
- Each point shows how a value changes with time, and the time order is very important in time series data.
- We can use time series to analyze daily stock prices, energy consumption rates, social media engagement metrics, retail demand, etc.
- Analyzing time series data yields insights like trends, seasonal patterns, and forecasts of future events that can help generate profits.
- For example, companies can plan promotions to maximize sales throughout the year by understanding the seasonal trends in demand for retail products.
Components of a Time Series:
A time series typically consists of the following components:
Trend
A trend represents the long-term progression of the series, i.e, a general direction in which the data is moving over time.It can be upward, downward, or stationary (no trend).
The trend can follow different forms: linear, exponential, or nonlinear.
Seasonality
Seasonality refers to patterns that repeat at regular, fixed intervals due to seasonal factors (e.g., months, quarters, days of the week).These fluctuations are consistent and predictable.
For example, retail sales might peak every December due to holiday shopping.
Cyclic Patterns
Cyclic behavior involves rises and falls in the data not tied to fixed calendar intervals (unlike seasonality).Cycles are typically influenced by economic or business conditions.
Their duration is usually longer than one year and irregular in length.
Random (Irregular) Component / Noise
This represents the unpredictable variation in the data that cannot be attributed to trend, seasonality, or cycles.Often referred to as white noise, this component consists of random shocks or anomalies.
In a well-modeled time series, the residuals (errors) should resemble white noise: independent, identically distributed, and mean zero.
Types of Time Series Models:
- Moving Average (MA) Model
- The Moving Average (MA) model captures the dependency between an observation and a residual error from a moving average model applied to lagged observations(past value of the time series).
- Simply, it helps smooth out a time series by averaging the values over a fixed number of past time points.
- In an MA model, the current value of the series depends only on past error terms (random shocks).
- Mathematical form (MA(q)):
Yt=μ+εt+θ1εt−1+θ2εt−2+⋯+θqεt−q
where:
- εt: white noise
- θ1,θ2,…: MA coefficients
A lag is the amount of time by which a time series is shifted
- Autoregressive (AR) Model
- The Autoregressive (AR) model uses the past values of the time series to predict current and future values.
- The assumption is that the current value of the series is a linear function of its previous values.
- Mathematical form (AR(p)):
Yt=β0+β1Yt−1+β2Yt−2+⋯+βpYt−p+εt
where:
- Yt−k: past observations
- βk : AR coefficients
- εt: white noise
- ARMA Model (Autoregressive Moving Average)
- The ARMA(p, q) model combines both AR and MA components.
- It is suitable for modeling stationary time series data.
- Mathematical form:
- p: order of autoregression
- q: order of moving average
4. ARIMA Model (Autoregressive Integrated Moving Average)
- The ARIMA(p, d, q) model extends ARMA by adding a differencing step to make the series stationary, which is essential for many time series models.
Suitable for non-stationary univariate time series.
It combines:
AR: autoregression
I: differencing (to remove trend and stabilize the mean)
MA: moving average
Mathematical Signature:
p: number of autoregressive terms (lags)
d: number of differences needed to make the series stationary
q: the size of the moving average window
Model | Stationarity Required | Based on Past Values | Based on Errors | Uses Differencing |
---|---|---|---|---|
MA(q) | Yes | No | Yes | No |
AR(p) | Yes | Yes | No | No |
ARMA(p,q) | Yes | Yes | Yes | No |
ARIMA(p,d,q) | No | Yes | Yes | Yes |
5. SARIMA model (Seasonal AutoRegressive Integrated Moving Average):
- An advanced version of ARIMA that is specifically designed to handle time series data with seasonality patterns that repeat at regular intervals (like every week, month, or quarter).
SARIMA = ARIMA + Seasonality
The full model is written as:
SARIMA(p, d, q) × (P, D, Q, s)
Term | Meaning |
---|---|
p | Number of autoregressive terms |
d | Number of differences to make data stationary |
q | Number of moving average terms |
P | Seasonal autoregressive order |
D | Seasonal differencing |
Q | Seasonal moving average order |
s | The seasonal period (e.g. 12 for monthly data with yearly seasonality) |
6. SARIMAX model(Seasonal AutoRegressive Integrated Moving Average with eXogenous variables):
- It is an extension of the SARIMA model that allows you to include external variables (called exogenous variables) that might help explain or improve your forecast.
- The full model is written as SARIMAX(p, d, q) × (P, D, Q, s), exog = X
Stationarity in Time Series:
Stationarity means that a time series’s statistical properties, such as its mean, variance, and autocovariance, remain constant over time.
Why is Stationarity Important?
Stationary processes are easier to model and interpret.
Models like AR, MA, ARMA, and ARIMA are based on the assumption that the series is stationary (or made stationary through differencing).
How to Check for Stationarity?
One way to check is through the Autocorrelation Function (ACF):
Autocorrelation measures how similar a time series is to its past values (lagged versions of itself).
Plotting autocorrelation values for increasing lags gives a correlogram (ACF plot).
What to look for:
In a stationary series, the autocorrelation drops off quickly (usually within a few lags).
In a non-stationary series, autocorrelation declines slowly, suggesting long-term dependency.
Check Stationarity:
Visual inspection: Plot the series — if it shows obvious trends or changing variance, it’s likely non-stationary.
Statistical tests:
Augmented Dickey-Fuller (ADF) test
KPSS test
These tests formally assess whether a series is stationary.
Autocorrelation:
- We get to know from autocorrelation how much a time series is related to its past values.
- It checks if there’s a pattern that repeats over time.
Partial Autocorrelation:
- Partial autocorrelation shows the direct connection between today’s value and a specific past value, without the influence of the values in between.
Python Implementation for ARIMA & SARIMAX:
- MA, AR, and ARMA models are more academic or stepping stones toward understanding ARIMA and SARIMAX
- In addition to SARIMA is a subset of SARIMAX. If we use SARIMAX without any
exog
variables, it works just like SARIMA - So, we will focus on ARIMA & SARIMAX for implementation
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
# Loading Data
data = pd.read_csv('/content/drive/MyDrive/AirPassengers.csv',parse_dates=[0],index_col='Month')
ts = data['#Passengers']
# Fit ARIMA model (p, d, q)
model = ARIMA(ts, order=(2, 1, 2))
model_fit = model.fit()
# Forecast next 10 steps
forecast = model_fit.forecast(steps=10)
# Plot original series and forecast
plt.figure(figsize=(10, 5))
plt.plot(ts, label='Original')
plt.plot(forecast.index, forecast, label='Forecast', color='red')
plt.title('ARIMA Forecast')
plt.legend()
plt.show()
# Fit SARIMAX model
sarimax_model = SARIMAX(ts, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
sarimax_fit = sarimax_model.fit(disp=False)
# Forecast next 12 steps
sarimax_forecast = sarimax_fit.forecast(steps=12)
# Plot original series and forecast
plt.figure(figsize=(10, 5))
plt.plot(ts, label='Original')
plt.plot(sarimax_forecast.index, sarimax_forecast, label='SARIMAX Forecast', color='green')
plt.title('SARIMAX Forecast')
plt.legend()
plt.show()
Notes:
order=(2, 1, 2) → for ARIMA
Position | Parameter | Meaning |
---|---|---|
2 | p | Autoregressive (AR) terms — uses 2 past values |
1 | d | Differencing — subtracts previous value once to make the series stationary |
2 | q | Moving Average (MA) terms — uses 2 past error terms |
order=(1, 1, 1), seasonal_order=(1, 1, 1, 12) → for SARIMAX
Position | Parameter | Meaning |
---|---|---|
1 | P | Seasonal AR (lag of seasonal period, i.e. 12 months) |
1 | D | Seasonal differencing (removes seasonal patterns) |
1 | Q | Seasonal MA (lag of past seasonal residuals) |
12 | s | Seasonal period (12 months for monthly data) |
Prophet Model:
Prophet is an open-source forecasting tool by Facebook (Meta) built for:
Business users and data scientists
Handling seasonality, holidays, and trend changes
Working well out-of-the-box with minimal tuning
The model works as y(t) = trend(t) + seasonality(t) + holiday(t) + error(t)
trend(t): growth (linear or logistic)
seasonality(t): repeating cycles (like yearly or weekly)
holiday(t): impact of known events (e.g., Black Friday)
error(t): unpredictable noise
Python Implementation for Prophet Model:
Required Column Names: ds and y
ds (Date Stamp): This must contain the dates of your time series. Prophet uses this column to understand the timeline and seasonality patterns.
y: This must contain the numeric values (target variable) you want to forecast.
from prophet import Prophet
import pandas as pd
# Prepare our data (must be columns: ds = date, y = value)
data.rename(columns={'Month': 'ds', '#Passengers': 'y'}, inplace=True)
# Fit the model
model = Prophet(yearly_seasonality=True)
model.fit(data)
# Make future dataframe (next 12 months)
future = model.make_future_dataframe(periods=12, freq='M')
# Forecast
forecast = model.predict(future)
# Plot
model.plot(forecast)
plt.title('Prophet Model Forecast & Components')
model.plot_components(forecast)
Feature Engineering for Time Series:
Lag Features (Past Values):
- Capture the value of the target variable from previous time steps.
- This helps the model understand how the past affects the present.
- Use for
- Tree-based models (XGBoost, LightGBM),
- Neural networks
- Prophet (only indirectly, via external regressors)
# Creating a new feature that shows the previous time step's value.
data['lag_1'] = data['y'].shift(1)
# Creating a new feature that shows value from 12 months ago
data['lag_12'] = data['y'].shift(12)
Rolling/Window Statistics (Smoothing/Trends):
- Calculate values by averaging (or summarizing) over a moving window of past data.
- Use for
- Any machine learning model
- Also useful for exploratory data analysis
# First Shifts the target column down by 1 row to use only past data
# Second For each row, it looks at the 3 values just before it
# Last Takes the average of those 3 values
data['rolling_mean_3'] = data['y'].shift(1).rolling(window=3).mean()
# Similarly it takes standard deviation of 6 values just before it
data['rolling_std_6'] = data['y'].shift(1).rolling(window=6).std()
Date-Based Features (Calendar):
- Break down the timestamp into separate calendar components.
- Use for
- All models
- Prophet already uses month, day, and year internally, but we can add custom date-based regressors if needed
data['month'] = data['ds'].dt.month
data['dayofweek'] = data['ds'].dt.dayofweek
data['year'] = data['ds'].dt.year
data['is_weekend'] = data['dayofweek'].isin([5, 6]).astype(int) #5 means Saturday, 6 means Sunday
Trend Features (Time Index):
- A simple number that increases over time is useful for capturing long-term growth or decline.
- Use for
- Linear Regression
- Polynomial Regression
- Prophet (if trend isn’t captured automatically)
data['t'] = np.arange(len(data)) # 0, 1, 2, 3, ...
# captures curvature of trend & Introduces a non-linear trend component to the model.
data['t_squared'] = data['t'] ** 2
External Regressors (Additional Inputs):
- Other factors that influence the target variable like promotions, weather, or holidays.
- Use for
- Prophet
- Tree-based models
- Deep learning
data['promo'] = [1 if x in promo_dates else 0 for x in data['ds']]
model = Prophet()
model.add_regressor('promo')
model.fit(data)
Cross-Validation for Time Series:
- Splitting the data into training and testing sets to evaluate performance without peeking into the future.
# Sklearn Example:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
# Prophet Example:
from prophet.diagnostics import cross_validation
df_cv = cross_validation(model, initial='730 days', period='180 days', horizon='365 days')
Evaluation Metrics (Measuring Accuracy)
Metric | Description | When to Use |
---|---|---|
MAE (Mean Absolute Error) | Average of absolute errors | Easy to interpret, doesn’t penalize large errors too harshly |
RMSE (Root Mean Squared Error) | Penalizes large errors | Good when large errors are very bad (e.g. inventory planning) |
MAPE (Mean Absolute Percentage Error) | Percentage of error | Great for business, but fails when values are near 0 |
SMAPE (Symmetric MAPE) | Better version of MAPE | Avoids divide-by-zero issue, balances over/under forecasting |
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# As there is no inbuilt function in sklearn, we need to use this for smape
def smape(y_true, y_pred):
denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
diff = np.abs(y_pred - y_true) / denominator
return np.mean(diff) * 100
smape_score = smape(y_true, y_pred)
print(f"SMAPE: {smape_score:.2f}%")
Let's move to Time Series Forecasting - Machine Learning Model>>>