Convolutional Neural Network (CNN)

Convolutional Neural Network

A Convolutional Neural Network (CNN), sometimes called a ConvNet, is a special type of artificial neural network designed specifically for working with image data and other data with a grid-like structure (such as audio spectrograms).
CNNs are particularly powerful in image recognition and classification tasks.
It is a class of deep learning models characterized by the presence of one or more convolutional layers, typically accompanied by subsampling or pooling operations, followed by one or more fully connected layers, similar to those found in traditional multilayer perceptrons.
The architecture of a Convolutional Neural Network (CNN) is specifically designed to utilize the two-dimensional (2D) structure of input data, such as images or spectrograms derived from speech signals, by recognizing patterns in small regions and applying the same filters across the entire input to maintain spatial relationships.
This is done by connecting each neuron to only a small region of the input and using the same weights for multiple regions.
Then, a pooling step reduces the size of the data while keeping important information, helping the network recognize features even if they appear in different parts of the input.
Another advantage of CNNs is that they are easier to train and use fewer parameters than regular fully connected networks.
This means they need less memory and take less time to learn, while still performing well.

Why CNN Over Traditional Neural Networks?

Traditional fully connected networks fail to scale with image data due to high dimensionality and loss of spatial structure.

For example, a 224×224 RGB image = 224 × 224 × 3 = 150,528 features. A fully connected layer would explode in parameters.

CNNs solve this using three key ideas:

Local connectivity
Shared weights
Pooling

CNN Architecture Overview

We will review some key terminologies & CNN architecture in details .

Pixel

An image is made of a grid of pixels.
Each pixel holds brightness (for grayscale) or color information (for RGB).
Grayscale image:
- Each pixel is a number between 0 (black) and 255 (white).
- Middle values like 128 are shades of gray.
RGB (color) image:
- Each pixel is a triplet: [Red, Green, Blue], with each value ranging from 0 to 255.
- Example:
  [255, 0, 0] → Red
  [0, 255, 0] → Green
  [0, 0, 255] → Blue

Let’s watch below video to clear our understanding about pixel.

Convolution: The Foundation of CNNs

The Convolution Layers are the initial layers in a Convolutional Neural Network (CNN) designed to pull out features from images such as edges, textures, and patterns.
These layers maintain the spatial relationship between pixels by applying learnable filters that extract local patterns from small regions of the input image.
Convolution is a mathematical operation that involves sliding a small filter (also called a kernel) over the input image and computing a dot product between two inputs:

1. 1. An image matrix (the input data)
  2. A kernel or filter (a smaller matrix)

This operation helps extract features like:
- Edges
- Corners
- Patterns

Convolution Formula

For a 2D image patch and a filter :

Where:

$S (i, j)$ = Output pixel value at position (i, j)
$I (i + m, j + n)$ = Input image pixel at position under the filter
$K (m, n)$ = Kernel/filter value at (m, n)

How Convolution Works

Step 1: Filter (Kernel)

A filter is a small matrix (e.g., 3×3) with preset weights. This one detects vertical edges.

Step 2: Slide the Filter Over the Image

You move the filter over the image from left to right, top to bottom.
Each time, you focus on a patch of the image the same size as the filter.

Step 3: Multiply and Sum (Dot Product)

At each step:

Multiply each value in the filter by the corresponding pixel in the image patch.
Then sum them all to get one number in the output image (feature map).
This output shows where the pattern (e.g., edge) is strong.

Output Dimension Formula (Without Padding):

For a 2D input image:

Output Dimension Formula (With Padding):

For a 2D input image:

Where:

Symbol	Meaning
H	Input image height
W	Input image width
F	Filter (kernel) size (assuming square, e.g., 3×3)
S	Stride (how many steps the filter moves each time)
P	Padding
⌊⋅⌋	Floor: round down to the nearest whole number

Stride

Stride is the number of steps the filter (kernel) moves each time during slides across the image.

Stride = 1 → The filter moves 1 pixel at a time (default).
Stride = 2 → The filter moves 2 pixels at a time, skipping every other position.

A larger stride means:

The filter covers fewer positions
The output (feature map) becomes smaller, i.e: Less detail

Vice versa for a smaller stride.

Padding

Padding means adding extra pixels (usually zeros) around the border of the input image before applying the convolution.

It solves two main problems:

Shrinking size: After the convolution operation, the original size of the image is shrunk. Also, in the image classification task, there are multiple convolution layers after which our original image is shrunk after every step, which we don’t want.
Edge loss: Filters pass over center pixels more than edges, meaning the edges get less attention.

Why Use Padding?

When we don’t use padding:

The output becomes smaller after every convolution.
Important information from edge pixels may be lost.
The model becomes biased toward the center of the image.

When we use padding:

The filter can slide over edges properly
The output image stays the same size (if needed)
The network can go deeper without shrinking the feature maps too much

Two Main Types of Padding

Padding Type	Description	Output Size
Valid	No padding (P=0)	Smaller
Same	Adds padding so output = input size	Same size

Pooling Layer

Pooling is like shrinking an image by keeping only the important parts.
It helps reduce the size of the feature maps and the number of calculations, making our model faster and less likely to overfit.

Purpose of Pooling

Reduces spatial dimensions (width & height)
Keeps important information, throws away irrelevant details
Adds invariance — small shifts or noise in the image won’t change the output much

Types of Pooling

There are two main types of pooling:

1. Max Pooling

Takes the maximum value in each region (window) of the feature map.
Keeps the strongest features
Removes weak signals or noise
Works better in practice than average pooling
Example:

Say we have this 2×2 region:

[1, 3]
[2, 4]

Max Pooling Output = 4 (the highest number)

2. Average Pooling

Takes the average of the numbers in the region.
Example:

Same region:

[1, 3]
[2, 4]

Average Pooling Output = (

Pooling Layer Parameters

Just like convolution, pooling has:

Filter size (F) — common: 2×2
Stride (S) — how far the filter moves — common: 2
No padding (usually)

Output Dimension Formula :

For a 2D input image:

Where:

, = input dimensions
= pooling filter size
= stride

Python implementation for Convolutional Neural Network (CNN)

Training a simple Convolutional Neural Network (CNN) to classify Fashion MNIST images.
The Fashion MNIST dataset contains 70,000 grayscale images of fashion items, categorized into 10 classes, with 7,000 images in each class.
The dataset is divided into 60,000 training images and 10,000 testing images. The classes are mutually exclusive, and there is no overlap between them.

# Import libraries
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

# Load Fashion MNIST dataset
(x_train, y_train), (x_test, y_test) = datasets.fashion_mnist.load_data()

# Preprocess data: reshape and normalize pixel value range from 0 to 1
x_train = x_train.reshape((-1, 28, 28, 1)).astype('float32') / 255.0
x_test = x_test.reshape((-1, 28, 28, 1)).astype('float32') / 255.0

# Class names for display
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Visualize first 20 images
plt.figure(figsize=(10, 2))
for i in range(20):
    plt.subplot(2, 10, i + 1)
    plt.imshow(x_train[i].reshape(28, 28), cmap='gray')
    plt.xticks([])
    plt.yticks([])
    plt.xlabel(class_names[y_train[i]])
    plt.savefig('fashion_mnist.png')
plt.show()

# Lets check total number of class
print('Total output classes= ',len(np.unique(y_train)))

Total output classes=  10

As we found total number of classes 10 , so we will use final dense layer with 10 output (multiclass classification)

# One-hot encode 
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# Split validation set
x_train, x_valid = x_train[5000:], x_train[:5000]
y_train, y_valid = y_train[5000:], y_train[:5000]

print('x_train:', x_train.shape)
print('x_valid:', x_valid.shape)
print('x_test :', x_test.shape)

Output:
x_train: (55000, 28, 28, 1)
x_valid: (5000, 28, 28, 1)
x_test : (10000, 28, 28, 1)

One-hot encoding:

Converts class labels (like 0, 1, …, 9) into one-hot encoded vectors.
Example: If a label was 3, after one-hot encoding: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] # 1 at index 3

# Build CNN model
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Dropout(0.3),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(10, activation='softmax')
])
model.summary()

Dropout

Dropout is a regularization technique that helps prevent overfitting in neural networks — including CNNs.

During training, dropout:

Randomly turns off (sets to 0) a percentage of neurons in a layer.
This is done on each training batch, so the “off” neurons change every time.
You specify how much to drop — e.g., Dropout(0.3) means 30% of neurons are “dropped” at each step.

Flatten

Flatten is a layer that converts a multi-dimensional tensor (e.g., a feature map) into a 1D vector.
After convolutional and pooling layers, your data is in a 3D shape like:

(batch_size, height, width, channels)
e.g., (None, 7, 7, 64)

But a Dense (fully connected) layer needs 1D input per sample:

(batch_size, features)
e.g., (None, 3136)

So we use flatten() to reshape the data into a 1D array.

# Early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=10,
    verbose=1,
    restore_best_weights=True
)

EarlyStopping

EarlyStopping is a callback in TensorFlow/Keras that stops training automatically when the model stops improving on the validation set.
When training a neural network:
- It may start to overfit after a certain number of epochs (i.e., learning noise instead of useful patterns).
- You don’t want to waste time training for 100 epochs if your model is already at its best by epoch 20.
- So EarlyStopping watches a metric (like val_loss) and stops when things stop getting better.

# Model checkpoint 
checkpointer = ModelCheckpoint(
    filepath='fashion_model_best.keras',
    verbose=1,
    save_best_only=True,
    save_weights_only=False
)

ModelCheckpoint

ModelCheckpointis a callback in Keras that automatically saves your model during training, typically when it’s performing at its best.
We use it
- To save our best model during training, so we can use it later for predictions or deployment.
- In case the training crashes, we don’t lose progress.
- If you’re using EarlyStopping, we want to save the best version before overfitting starts.

# Compile model
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Why Compile model?

model.compile() is a required step in Keras before training the model. It configures how the model will learn by specifying:

1. 1. Optimizer – How the model updates weights
  2. Loss function – What the model tries to minimize
  3. Metrics – How to evaluate the model’s performance

# Train the model
history = model.fit(x_train, y_train,
                    batch_size=32,
                    epochs=10,
                    validation_data=(x_valid, y_valid),
                    callbacks=[checkpointer, early_stopping],
                    verbose=2,
                    shuffle=True)

Epoch 1: val_loss improved from inf to 0.39301, saving model to fashion_model_best.keras
1719/1719 - 29s - 17ms/step - accuracy: 0.7700 - loss: 0.6278 - val_accuracy: 0.8552 - val_loss: 0.3930
Epoch 2/10

Epoch 2: val_loss improved from 0.39301 to 0.36378, saving model to fashion_model_best.keras
1719/1719 - 27s - 16ms/step - accuracy: 0.8584 - loss: 0.3990 - val_accuracy: 0.8756 - val_loss: 0.3638
Epoch 3/10

Epoch 3: val_loss improved from 0.36378 to 0.29245, saving model to fashion_model_best.keras
1719/1719 - 42s - 24ms/step - accuracy: 0.8765 - loss: 0.3496 - val_accuracy: 0.8946 - val_loss: 0.2924
Epoch 4/10

Epoch 4: val_loss improved from 0.29245 to 0.28631, saving model to fashion_model_best.keras
1719/1719 - 27s - 16ms/step - accuracy: 0.8855 - loss: 0.3287 - val_accuracy: 0.9040 - val_loss: 0.2863
Epoch 5/10

Epoch 5: val_loss did not improve from 0.28631
1719/1719 - 41s - 24ms/step - accuracy: 0.8915 - loss: 0.3154 - val_accuracy: 0.8924 - val_loss: 0.3112
Epoch 6/10

Epoch 6: val_loss did not improve from 0.28631
1719/1719 - 40s - 23ms/step - accuracy: 0.8924 - loss: 0.3136 - val_accuracy: 0.8994 - val_loss: 0.2931
Epoch 7/10

Epoch 7: val_loss did not improve from 0.28631
1719/1719 - 27s - 16ms/step - accuracy: 0.8910 - loss: 0.3163 - val_accuracy: 0.9054 - val_loss: 0.3048
Epoch 8/10

Epoch 8: val_loss did not improve from 0.28631
1719/1719 - 43s - 25ms/step - accuracy: 0.8901 - loss: 0.3214 - val_accuracy: 0.8974 - val_loss: 0.3612
Epoch 9/10

Epoch 9: val_loss did not improve from 0.28631
1719/1719 - 39s - 23ms/step - accuracy: 0.8894 - loss: 0.3298 - val_accuracy: 0.8874 - val_loss: 0.3308
Epoch 10/10

Epoch 10: val_loss did not improve from 0.28631
1719/1719 - 41s - 24ms/step - accuracy: 0.8887 - loss: 0.3357 - val_accuracy: 0.8954 - val_loss: 0.3332
Restoring model weights from the end of the best epoch: 4.

# Load best model (automatically includes weights + optimizer state)
model = tf.keras.models.load_model('fashion_model_best.keras')

# Predict test set
y_hat = model.predict(x_test)

# Plot predictions
fig = plt.figure(figsize=(20, 8))
for i, idx in enumerate(np.random.choice(x_test.shape[0], size=32, replace=False)):
    ax = fig.add_subplot(4, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(x_test[idx].reshape(28, 28), cmap='gray')
    pred_idx = np.argmax(y_hat[idx])
    true_idx = np.argmax(y_test[idx])
    ax.set_title("{} ({})".format(class_names[pred_idx], class_names[true_idx]),
                 color=("blue" if pred_idx == true_idx else "red"))

# Accuracy plot
plt.figure()
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.5, 1])
plt.show()

# Evaluate test set
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print("Test accuracy:", test_acc)

Let’s check our model with external image uploading from my device

import matplotlib.image as img

# Create image path from my my device
img_path = '/content/T Shirt.jpg'

# See actual color image 
myimage = img.imread(img_path)
plt.imshow(myimage)
plt.axis('off')  
plt.show()

# Load and preprocess the image from  color to grayscale 
img = tf.keras.preprocessing.image.load_img(img_path, target_size=(28, 28), color_mode="grayscale")
img_array = tf.keras.preprocessing.image.img_to_array(img)
img_array = img_array.reshape((1, 28, 28, 1)) / 255.0

# Predict using the model
predictions = model.predict(img_array)
predicted_index = np.argmax(predictions[0])
predicted_class = class_names[predicted_index]

# Plot the image with predicted label
plt.figure(figsize=(4, 4))
plt.imshow(img_array[0].reshape(28, 28), cmap='gray')
plt.title(f"Predicted: {predicted_class}", fontsize=14)
plt.savefig('predicted_image.png')
plt.axis('off')
plt.show()

We can see that our model predict my t-shirt image as Pullover , although it is not correct but it close to t-shirt feature. If we improve our model , it might give correct prediction

Convolutional Neural Network (CNN)