Introduction to Deep Learning

Deep Learning

Deep Learning is a subset of Machine Learning (ML) that utilizes Artificial Neural Networks (ANNs) with multiple layers to learn representations from data automatically.
Unlike traditional ML algorithms, which often require manual feature engineering, deep learning models learn directly from raw inputs like images, audio, or text.
Deep learning powers today’s most advanced AI applications: voice assistants, image recognition, self-driving cars, recommender systems, and more.

Artificial Neural Network(ANN)

An Artificial Neural Network is a mathematical model inspired by the human brain.
It consists of interconnected nodes (neurons) organized in layers:
- - Input Layer
  - Hidden Layers
  - Output Layer
ANNs learn by adjusting the weights of connections using data, enabling them to recognize patterns and make predictions.
ANNs are the foundation of Deep Learning
Deep Learning refers specifically to ANNs with multiple (deep) layers
However, if an ANN has only one hidden layer, it is referred to as a Shallow Neural Network.
ANN can be used with tabular datasets

We will learn some terminology now, let’s go…

GPU(Graphics Processing Unit)

Originally made for graphics and gaming, but now also used in deep learning because it can process many calculations in parallel.

Deep learning = lots of matrix operations (e.g., multiplying weights, adding biases, backpropagation).

A CPU (Central Processing Unit) has a few powerful cores.
A GPU has hundreds or thousands of smaller cores, which can work on many operations at once.

That’s perfect for deep learning, where you need to:

Train large models (like ANN, CNN, RNN)
Use large datasets (images, text, tabular data)
Perform fast matrix math

Artificial Neuron

An artificial neuron is a mathematical function inspired by a biological neuron.
It receives inputs, applies weights, adds a bias, and passes the result through an activation function to produce an output.

Biological vs Artificial Neuron

Biological Neuron	Artificial Neuron
Dendrites	Inputs
Synapse	Weights
Soma (nucleus)	Node/Neuron
Axon	Output

Perceptron

A Perceptron is the fundamental unit of a neural network(NN).
It is a computational model that mimics the structure of a biological neuron and plays a critical role in supervised learning, especially for binary classification tasks.

Invented by Frank Rosenblatt in 1957, the perceptron is considered the building block of more complex architectures in Artificial Neural Networks (ANNs).

Core Components of a Perceptron

Input Layer:
Receives raw data (features) from the domain. This layer does not perform any computation but passes information to the next layer.
Weights:
Each input is assigned a weight, representing the strength or importance of that input in influencing the output.
Bias:
An additional term that shifts the activation function, helping the model generalize better and learn more complex patterns.
Hidden Layer (in multilayer perceptrons):
Performs mathematical transformations on the inputs and passes the results to the output layer.
Activation Function:
Applies a non-linear transformation to the weighted sum of inputs plus bias. Common activation functions include:
- Sigmoid
- ReLU
- Tanh
Output:
The final output is typically a binary value (0 or 1), indicating the class to which the input data belongs.
- If the result ≥ 0.5 → Output = 1
- If the result < 0.5 → Output = 0

Mathematical Representation

Step 1: Compute the weighted sum of inputs:

z=∑(w_i⋅x_i)+b=x₁w₁+x₂w₂+…+x_nw_n+b

Step 2: Apply an activation function:

y=f(z)=f(∑w_ix_i+b)

The activation function ensures that the output is bounded and helps introduce non-linearity into the model.

Types of Perceptron

Type	Description
Single-Layer Perceptron	Can solve linearly separable problems
Multi-Layer Perceptron (MLP)	Can solve complex, non-linear problems using multiple layers and advanced activation functions

How a Perceptron Works (Step-by-Step)

Multiply each input by its corresponding weight.
Add all weighted inputs and include the bias term.
Pass the result through an activation function.
Return the output (either 0 or 1 for binary classification).

Learning Mechanism

1. Feedforward Propagation:

Data flows from the input layer through hidden layers to the output.
Each neuron computes a weighted sum and applies an activation function.
In the feedforward propagation, the Activation Function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer.

2. Backpropagation:

After predicting the output, the network compares it to the true value using a loss (cost) function.
The error is minimized by adjusting weights and biases using optimization algorithms like gradient descent.
The gradient of the cost function concerning each parameter (weights, bias) determines how much to update it.

Activation Function

An activation function determines whether a neuron should be activated or not.
It helps the neural network learn non-linear patterns by applying a mathematical transformation to the neuron’s input.

Role of Activation Function

To transform the weighted sum of inputs into an output value, which is then passed to the next layer or used as the final prediction.
Introduces non-linearity into the model so that the network can learn complex patterns.
Enables forward propagation to transform data at each layer.
Without activation functions, the network behaves like a simple linear model, regardless of depth.

Types of Activation Functions

1. Binary Step Function

The binary step function is a simple threshold-based activation function. It determines whether a neuron should be activated based on whether the input surpasses a specific threshold.
- If the input is greater than or equal to the threshold, the neuron is activated (output = 1).
- If the input is less than the threshold, the neuron is not activated (output = 0), meaning its output will not influence subsequent layers.
Limitations:
- Not differentiable, so it cannot be used with gradient-based optimization techniques like backpropagation.
- Not suitable for tasks requiring probability estimation or multi-class classification.
Mathematical Representation: f(x)= 0 if x<0 or 1 if x≥0

2. Linear Activation Function (Identity Function)

The linear activation function, also known as the identity function, directly outputs the weighted sum of the input without applying any transformation.
The output is a linear function of the input.
It is mathematically simple and differentiable, but not suitable for deep learning models.
Limitations: The function returns the same value it receives as input.
- The derivative is constant, which prevents backpropagation from effectively updating the weights.
- Using linear activation across layers results in the model behaving as a single-layer linear model, regardless of depth. This severely limits the learning capacity of the network.
Mathematical Representation:

3. Non-Linear Activation Functions

Non-linear activation functions introduce the ability to learn complex relationships in data by transforming inputs in a non-linear way.
These are widely used in modern deep learning architectures.

Advantages over Linear Activation:

Enable backpropagation, since their derivatives depend on input values.
Allow networks to stack multiple layers effectively, resulting in a non-linear transformation of the input data across layers.
Support learning of non-trivial functions, such as decision boundaries in classification problems.

Common examples:

Sigmoid: σ(x) = 1 / (1+e^−x)
Tanh: tanh(x) = sinh(x) / cosh(x) = (e^x−e⁻^x) / (e^x+e⁻^x)
ReLU:

Different Activation Functions Use:

Activation Function	Typical Use Case
Sigmoid	Output for binary classification
Tanh	Output or hidden layers
Softmax	Final output layer for multi-class classification
Linear	Output layer for regression
ReLU(Rectified Linear Unit)	Default for hidden layers
Leaky ReLU	Hidden layers where ReLU causes “dead neurons”
PReLU(Parametric ReLU)	Like Leaky ReLU but learnable
ELU(Exponential Linear Unit)	Hidden layers, deep models
Swish(by Google)	Hidden layers in large models
GELU(Gaussian Error Linear Unit)	Transformers, NLP (e.g., BERT)

Derivative:

The derivative of a function tells us how fast the output changes with respect to its input.
According to Wikipedia, “In mathematics, the derivative is a fundamental tool that quantifies the sensitivity to change of a function’s output with respect to its input.”

Imagine we’re walking on a hill.

The height at each point = value of a function
The steepness of the hill = derivative at that point

So:

If the hill is flat, the derivative is 0 → no change
If it’s steep uphill, the derivative is positive → increasing fast
If it’s steep downhill, the derivative is negative → decreasing fast

In machine learning, especially deep learning:

We use derivatives to adjust weights in the model during training
This is done through backpropagation
The derivative tells the network:
“How should I change this weight to reduce the error?”

Math Example:

Let’s take a function: f(x)=x²

Its derivative is:

Which means:

At , the slope is
So at that point, the function is increasing rapidly

In Activation Functions, the derivative of an activation function helps us compute how neuron outputs change with input, which is crucial during learning.

For example:

Sigmoid derivative: σ′(x)=σ(x)⋅(1−σ(x))

This tells us how sensitive the sigmoid output is to small changes in input.

Let’s study some activation functions in detail…

Sigmoid (Logistic) Activation Function

The Sigmoid or Logistic activation function is a smooth, S-shaped function that maps any real-valued input to a value between 0 and 1.
This makes it ideal for representing probabilities in binary classification tasks.
Mathematical Representation: f(x)=1 / (1+e^−x)
- When $f (x) \to 1$
- When $f (x) \to 0$
Derivative (Gradient) of the Sigmoid Function: f′(x) = f(x)⋅(1−f(x)).
- This derivative reaches its maximum at and decreases symmetrically on either side.

Why does Gradient Vanishing Happen?

From the graph of the sigmoid function and its derivative:

Blue line: Sigmoid curve (smooth S-shape)
Red dashed line: Its derivative, which peaks at and quickly approaches 0 as
The gradient (slope) is significant only in the range of approximately -3 to +3.
Outside this range, the sigmoid curve becomes almost flat, and its derivative approaches zero.

This means:

Inputs much larger than 3 or much smaller than -3 will have almost no gradient.
When used in deep networks, especially in hidden layers, these small gradients slow or stop learning, leading to the Vanishing Gradient Problem.

Where not useful

When the gradient is near zero, weights receive very small updates during backpropagation, preventing the model from learning effectively.
This is why sigmoid is rarely used in hidden layers of deep networks today.

Where still useful

In the output layer of binary classification problems (to model probability)
In logistic regression
In some recurrent networks or gating mechanisms (e.g., LSTMs)

Tanh Function (Hyperbolic Tangent)

The Tanh function is a widely used activation function in machine learning and deep learning.
It’s similar in shape to the sigmoid function but scaled and shifted to be zero-centered, which gives it some important advantages.
Mathematical Representation: f(x) = tanh(x) = sinh(x) / cosh(x) = (e^x−e⁻^x) / (e^x+e⁻^x)
- Output range: (-1, 1)
- At ,
Graph Behavior:
- For large positive ,
- For large negative ,
- S-shaped curve (like sigmoid), but centered at 0
Derivative: f′(x)=1−tanh²(x)
- Like sigmoid, the gradient is strongest near 0
- Gradient becomes small when , leading to vanishing gradients in deep networks

Advantages of the tanh activation function:

Its output is zero-centered (ranging from -1 to 1), allowing the network to naturally represent negative, neutral, and positive activations.
Being zero-centered helps the hidden layers produce outputs with mean close to zero, which stabilizes and speeds up learning in subsequent layers.
Compared to sigmoid, tanh has a steeper gradient, improving learning dynamics.

Where NOT useful:

Deep networks with many layers: Because it still suffers from vanishing gradients, tanh can slow down learning in very deep networks.
Output layers for classification tasks: For binary classification, sigmoid or softmax are preferred for probability outputs, not tanh.
When computational efficiency is critical: Newer activation functions like ReLU and its variants are computationally cheaper and usually perform better in deep networks.

Where useful:

Hidden layers of small to medium-sized neural networks: Because it produces zero-centered outputs, it helps with faster and more stable training compared to sigmoid.
Recurrent Neural Networks (RNNs): Commonly used in RNNs and LSTMs where outputs need to capture negative and positive values smoothly.
When data requires capturing both positive and negative signals: Since outputs range from -1 to 1, tanh models relationships that involve negative correlations better than sigmoid.

ReLU Activation Function

ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same time.
The neurons will only be deactivated if the output of the linear transformation is less than 0.

Mathematical representation:

Derivative:

Advantages of ReLU are as follows:

Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions.
ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.

Limitations:

Dying ReLU problem: Neurons can become inactive permanently if they only output zero, causing some parts of the network to stop learning.

Where to use:

Deep neural networks, especially convolutional networks and feedforward networks, due to fast training and better gradient flow.
Situations where computational efficiency is important.

Where not to use:

Output layers requiring probabilities (use sigmoid or softmax instead).
Networks are prone to dying ReLU issues. Alternatives like Leaky ReLU or ELU may perform better there.

What is the Dying ReLU Problem?

In ReLU, the output is zero for all negative input values:
If a neuron’s input consistently falls below 0, the output becomes 0, and the gradient is also 0 during backpropagation.
This means no weight updates happen for that neuron; it’s effectively “dead.”
Once a neuron dies, it may never reactivate, especially in deep networks with poor weight initialization or high learning rates.

Why is it a Problem

Reduces the model’s capacity to learn since dead neurons don’t contribute.
Can lead to underfitting or poor performance.
Especially problematic when many neurons die early in training.

Solutions / Alternatives:

Leaky ReLU: Allows a small gradient when the input is negative.
Parametric ReLU (PReLU): Like Leaky ReLU, but lets the network learn the slope for negative values.
ELU / SELU: Exponential functions that avoid flat zero regions and improve learning.

Leaky ReLU Activation Function

Leaky ReLU (Leaky Rectified Linear Unit) is a variation of the standard ReLU function designed to address the dying ReLU problem, where neurons output zero for all inputs and stop learning.
Mathematical Formula:

where α is a small constant (e.g., 0.01), allowing a small, non-zero gradient when the unit is not active.

Derivatives: f′(x)= 1 if x>0 or α if x≤0

Advantages:

- Solves dying ReLU issue: Allows negative inputs to have a small gradient, so neurons remain active during training.
- Better gradient flow: Unlike standard ReLU, the backward pass won’t zero out gradients entirely for negative activations.
- Simple and efficient: Maintains the computational efficiency of ReLU while improving robustness.
Limitations:
- α is fixed: The negative slope is typically hardcoded and not learned, which might not be optimal for all tasks.

Where to Use:

Deep neural networks where ReLU units show signs of dying.
When training, stability or convergence is affected by the standard ReLU.
Computer vision models, CNNs, and dense feedforward networks.

Where Not Ideal:

In output layers requiring bounded values or probabilistic outputs (use softmax or sigmoid there).
If you want the model to learn the negative slope, consider Parametric ReLU (PReLU) instead.

Parametric ReLU (PReLU)

Parametric ReLU (PReLU) is an advanced variant of the Rectified Linear Unit (ReLU) activation function used in neural networks.
Unlike ReLU, which has a zero slope for negative input values, PReLU introduces a learnable parameter (whereas Leaky ReLU has a fixed parameter) to adapt the slope for negative values during training.
This enables the model to learn the optimal activation behavior dynamically, rather than relying on a fixed rule.

Mathematical Formula :

Where:

is the input.
is a learnable parameter (slope of the negative part), optimized during training.

Derivatives :

The derivative of PReLU is important for backpropagation.
The value of is updated using gradient descent along with other model parameters.

Advantages of PReLU

Learnable Negative Slope: Unlike ReLU or Leaky ReLU, PReLU allows the network to learn the most suitable negative slope, improving accuracy.
Solves Dying ReLU Problem: Since the slope in the negative region isn’t zero, neurons are less likely to become inactive.

Limitations of PReLU

Overfitting Risk: The added parameter $a$ increases model complexity, potentially leading to overfitting, especially with small datasets.
Computational Cost: Slightly more expensive than ReLU due to parameter updates.

Where to Use PReLU

Deep neural networks: Especially convolutional neural networks (CNNs) for image classification (e.g., ResNet variants).
When encountering dying ReLU: PReLU can maintain gradient flow better than standard ReLU.

Where Not to Use PReLU

On small models or datasets with high overfitting risk, simpler activations like ReLU are safer.
On tabular data models, where complexity from learnable activations isn’t usually necessary.
When interpretability is key, as dynamic activations can complicate model behavior.

Comparison between Leaky RelU & PRelU

Here’s the plot comparing Leaky ReLU (with fixed slope a = 0.1) and PReLU (with learnable slope a=0.3). As you can see:

Both functions behave the same for positive x (linear).
For negative x, Leaky ReLU has a smaller fixed slope, while PReLU can adapt to a steeper or shallower slope based on training.

Exponential Linear Unit (ELU)

ELU is an activation function designed to combine the benefits of ReLU (fast convergence) and smooth gradients for negative inputs.
It was introduced to improve learning characteristics and convergence speed in deep neural networks.

Mathematical Formula:

Where:

is a hyperparameter that defines the saturation point for negative inputs (commonly set to 1.0).
is the input.

Derivative of ELU:

Note: The derivative for negative reuses the function value, which makes it efficient in practice.

Advantages of ELU

Smooth Curve: Unlike ReLU, ELU is smooth around zero, which helps reduce sharp changes in gradients.
No Dying ReLU Problem: Gradient is non-zero for , so neurons continue learning.
Mean Activation Closer to Zero: Helps with faster learning by reducing internal covariate shift.

Limitations of ELU

Computationally Expensive: Uses exp(x), which is slower than simple multiplications in ReLU/PReLU.
Alpha Tuning Required: The value of can affect performance and may require tuning.

Where to Use ELU

Deep networks, especially when smooth gradients are beneficial.
CNNs or RNNs where ReLU struggles.
When zero-centered outputs improve convergence.

Where Not to Use ELU

On resource-constrained devices where exp() is costly.
In shallow models where ReLU works just as well.
When negative activations are not appropriate (e.g., certain output layers).

Softmax Function

Swish

Gaussian Error Linear Unit (GELU)

Scaled Exponential Linear Unit (SELU)

types of ANN & Chronology

Introduction to Deep Learning

Deep Learning

Artificial Neural Network(ANN)

GPU(Graphics Processing Unit)

Artificial Neuron

Perceptron

Core Components of a Perceptron

Activation Function

Types of Activation Functions

Derivative:

Math Example:

Sigmoid (Logistic) Activation Function

Why does Gradient Vanishing Happen?

Tanh Function (Hyperbolic Tangent)

ReLU Activation Function

What is the Dying ReLU Problem?

Solutions / Alternatives:

Leaky ReLU Activation Function

Where to Use:

Where Not Ideal:

Parametric ReLU (PReLU)

Exponential Linear Unit (ELU)

Social Profile

Data Driven Fashion

Deep Learning

Artificial Neural Network(ANN)

GPU(Graphics Processing Unit)

Artificial Neuron

Perceptron

Core Components of a Perceptron

Activation Function

Types of Activation Functions

Derivative:

Math Example:

Sigmoid (Logistic) Activation Function

Why does Gradient Vanishing Happen?

Tanh Function (Hyperbolic Tangent)

ReLU Activation Function

What is the Dying ReLU Problem?

Solutions / Alternatives:

Leaky ReLU Activation Function

Where to Use:

Where Not Ideal:

Parametric ReLU (PReLU)

Exponential Linear Unit (ELU)

Register

Login here

Forgot your password?

Subscribe to our email list

Social Profile

Data Driven Fashion