Attention Mechanism

An attention mechanism is a method in deep learning that helps models focus on the most important parts of the input data, especially for long or complex sequences.
The introduction of attention mechanisms laid the foundation for the transformer architecture, which now powers state-of-the-art large language models (LLMs) like ChatGPT (Vaswani et al., 2017).
Attention mechanisms mimic how humans focus on important information and ignore the rest, helping models highlight key parts of input while saving memory and processing power.
Attention mechanisms give different importance (weights) to input parts, and the model learns these weights during training to focus on what matters most for the task.
Bahdanau et al. (2014) introduced attention to improve RNNs for translation, and later it was applied to CNNs for tasks like image captioning and visual question answering.
The 2017 paper “Attention is All You Need” introduced the transformer model, which uses only attention and feedforward layers and now powers today’s leading generative AI models.
Although attention is key in NLP tasks with LLMs, diffusion models for image generation commonly use attention mechanisms to improve image quality and detail, where models like Vision Transformers have improved in tasks like object detection and segmentation.

Importance of Attention Mechanism

Transformer models, powered by attention mechanisms, have set new benchmarks across deep learning fields due to their distinct advantages over convolutional and recurrent neural network methods.

RNNs process sequential data step-by-step, limiting their ability to capture long-range dependencies, whereas attention mechanisms analyze entire sequences at once and selectively focus on relevant parts.
CNNs focus on local data subsets, which limits their ability to capture distant dependencies like connections between words in text or pixels in images, but attention mechanisms effectively handle long-range relationships across the entire input.
Attention mechanisms perform many computations simultaneously instead of sequentially, allowing efficient parallel processing that uses the speed and power of GPUs.

Sequence-to-Sequence Model Problem

A Sequence-to-Sequence (Seq2Seq) model is a type of neural network architecture used for sequence-to-sequence learning tasks, like machine translation, text summarization, and chatbot responses, where both the input and output are sequences. It essentially maps an input sequence to an output sequence, even if their lengths differ.
RNNs are neural networks that process sequences step-by-step, keeping a “memory” called the hidden state. However, RNNs face problems like vanishing gradients, making it hard to learn long sequences.
LSTMs improved this by adding gates to keep long-term memory.
Seq2Seq uses two LSTMs:
- Encoder: Reads the input sentence and compresses it into one fixed-size vector called the context vector.
- Decoder: Uses this context vector to generate the output sentence word by word.
The fixed-length context vector works for different sentence lengths but causes problems:
- It treats long and short sentences equally, causing information loss (bottleneck).
- It often forgets important details from the start of the sentence, reducing accuracy on longer inputs.

Solution to this Problem

Attention Improved Seq2Seq (Bahdanau et al., 2014) as:

Instead of passing just one context vector, the model passes all encoder hidden states to the decoder.
The attention mechanism helps the decoder focus on the most relevant parts of the input for each output word.
This removes the fixed-length bottleneck and improves translation, especially for long sentences.
With the introduction of transformer models in 2017, which rely entirely on attention, RNNs became largely outdated in the field of NLP.

Core Processes of Attention Mechanisms

Embedding

Input sequence is turned into vector embeddings (numerical representations).
Each element in the sequence has its feature vector(s).

Scoring

Checks relationships (similarities/dependencies) between vectors.
Computes alignment scores or attention scores that decide how much attention each element should get.
Uses softmax to convert scores into attention weights (values between 0 and 1).
- 0 = Ignore this element.
- 1 = Focus completely on this element (others get 0).
- All weights add up to 1 (like probabilities).

Weighting

Uses weights to strengthen important elements and weaken unimportant ones.
Helps the model make better predictions by focusing on the right information.

Queries, Keys, and Values in Attention Mechanisms

In the Transformer attention mechanism (from the “Attention is All You Need” paper), every word (or token) in a sentence is turned into three vectors:

Query (Q): What is this word looking for (its question)?
Key (K): What this word offers (its information or label).
Value (V): The actual content that will be passed on if the key matches the query.

Imagine we go to a library looking for a book on “machine learning.”

Our query is: “We want books about machine learning.”
Each book has a key (its title and topic).
Each book also has value (its content inside).

We check the title of each book (key), and if it matches our interest (query), we give more attention to that book. Then, we read the content (value) of the book accordingly. That means, books that match better get more of our time (attention weight); irrelevant books get ignored.

Variants of Attention Mechanisms

Different types of attention mechanisms mainly vary in how they encode vectors, compute alignment scores, and apply attention weights to extract relevant information. Some common variants of attention mechanisms are as follows:

Additive Attention (Bahdanau et al., 2015)

Introduced by Dzmitry Bahdanau et al. in 2015,
Designed for translation tasks.
Uses bidirectional RNN that reads the input text both forwards and backwards, combining the results.
This helps when languages structure words differently (i.e, adjectives and nouns appear in a different order)
In machine translation, the decoder’s hidden state (representing the translated sentence) acts like a “query” that searches for relevant information.
The encoder’s hidden state (representing the original sentence) acts like a “key” that provides the needed details.
This helps the decoder focus on the right parts of the input when generating each word.
The alignment scores are calculated using a simple feedforward neural network called the attention layer.
This layer has three sets of trainable weights:

1. Query Weights (Wq) – Adjusts the decoder’s hidden state (what we want to translate).
2. Key Weights (Wk) – Adjusts the encoder’s hidden state (the original sentence).
3. Value Weights (Wv) – Scales the final output for better translation.

These weights represent the model’s learned knowledge.
As the model trains, it updates these weights to improve translation accuracy by minimizing the error (loss function).
How it works step by step:
- Alignment Check:
  - The query (decoder’s current state) and key (encoder’s hidden states) are transformed using weights (Wq and Wk).
    Alignment Score = Wq * q + Wk * k

- - If q and k are aligned, produce a Large positive value.
  - If mismatched, produce a Small/negative value.

- Activation & Scoring:
  - The result passes through a tanh function (squashing it between -1 and 1).
    Alignment Score = tanh ( Wq * q + Wk * k)

- - Then, it’s multiplied by another weight (Wv) to get the final alignment score.
  - This produces the alignment score between the query vector and the key vector
    Alignment Score = tanh ( Wq * q + Wk * k) * Wv

- Softmax & Weighting:
  - The scores go through softmax, converting them into attention weights (probabilities).

Attention Weights, α = softmax ( Alignment Score )

- - The final context vector (used for translation) is a weighted sum of all key vectors.

Context Vector = ∑ αi * Vi

$V i$ = value vectors (output representations for each input position).

= attention weights (how much focus the model puts on each input position).

This context vector helps the decoder generate the next word in the translation.
One key advantage is that the query and key vectors can have different sizes, making the model more flexible.

Dot Product Attention (Luong et al., 2015)

Luong et al. improved Bahdanau’s attention by replacing addition and tanh activation with dot products (multiplication) between vectors for faster and simpler alignment scoring.
In machine translation:
- The decoder’s current hidden state acts as the query (Q) — what the model is trying to translate.
- The encoder’s hidden states are the keys (K) — parts of the original sentence.
- The values (V) are typically the same as the keys and represent the information used for output.
Instead of using a small neural network (like in additive attention), dot product attention directly compares query and key vectors using their dot product:
How It Works – Step by Step
- Alignment Score: The dot product is calculated between the query and key:

- - If Q and K are similar (aligned in meaning), the score is large.
  - If not, the score is small or negative.

Limitation: Both Q and K must have the same dimensions (dₖ), unlike in additive attention.

- Softmax:
  - Convert scores into attention weights (just like in additive attention):

$_{Attention Weights, α = softmax ( Alignment Score )}$

- - This ensures the weights are positive and add up to 1.
- Context Vector
  - Similarly, as additive attention, use the attention weights to calculate a weighted sum of the value vectors:
    Context Vector = ∑ αi * Vi

This gives the decoder focused, relevant information from the input to help generate the next word.

Importance: Dot-product attention became the foundation for Transformer models (like GPT and BERT), enabling modern AI breakthroughs.

Scaled Dot-Product Attention (Vaswani et al., 2017)

The authors of “Attention Is All You Need” (Transformers) improved dot-product attention by adding a scaling factor to handle long input sequences more effectively.

Why Scaling Is Needed

While dot product attention is fast and simple, it has a drawback:

When the dimension of the query/key vectors d_k is very large, their dot product also becomes very large.
Large values in the softmax function can cause extremely small gradients during training (known as vanishing gradients).
This makes learning harder and slows down the training process.

To fix this, the dot product is scaled down before applying softmax:

Alignment Score Calculation:
- Compute dot products between Query (Q) and Key (K) vectors (like Luong’s method).
- Scale the scores by dividing by √(dimension of keys,dₖ):

Q * K^T/ √dₖ

- This prevents overly large values that hurt training.
Softmax & Weighting:
- Apply softmax to get attention weights (α).

$_{Attention Weights, α = Softmax ()}$

- This ensures the weights are between 0 and 1 and add up to 1.
Context Vector
- Compute context vector as a weighted average of Value (V) vectors:

Context Vector = softmax ( Scaled Alignment Score ) * V

Finally, we can write the above formula as below, which is used in Transformers

Attention( Q, K, V) = softmax ( Q * K^T/ √dₖ ) * V

- Where:
  - = Query matrix (what we’re focusing on)
  - = Key matrix (what we’re comparing to)
  - = Value matrix (information we want to extract)
  - = Dimension of the key/query vectors
  - The division by $_{\sqrtdₖ}$ is the scaling factor

Why It’s Revolutionary:

This tiny change (division by $_{\sqrtdₖ}$ enabled Transformers to handle massive-scale data, powering modern AI like ChatGPT and beyond.

Self-Attention (Cheng et al., 2016)

So far, we have discussed cross-attention, where queries come from one data source and keys come from another data source. For example, in machine translation, the keys and values come from the source language (e.g., English text), while the queries come from the target language (e.g., French). Both Bahdanau and Luong’s attention mechanisms were designed specifically for machine translation
Cheng et al. introduced self-attention, referred to as “intra-attention” in their work, as a technique to enhance overall machine reading performance.
Self-attention enables a model to weigh different parts of its input dynamically, deciding which elements are most relevant for prediction.

How Self-Attention Works

Imagine we’re reading a sentence:
“The dress she bought from Paris is stunning.”
When trying to understand the word “she”, our brain quickly connects it to “bought”, “dress”, and “Paris”. We’re paying attention to other words that help define context.
This is what self-attention does—it learns what to focus on.
Self-attention uses the same formula as scaled dot-product attention. The difference is that in self-attention, the query, key, and value vectors all come from the same source input, whereas in other settings (like cross-attention), the queries and keys/values may come from different sources.
Self-Attention formula:

Attention( Q, K, V) = softmax ( Q * K^T/ √dₖ ) * V

- Where:
  - = Query matrix (what we’re focusing on)
  - = Key matrix (what we’re comparing to)
  - = Value matrix (information we want to extract)
  - = Dimension of the key/query vectors
  - Softmax = converts scores into probabilities
  - The division by $_{\sqrtdₖ}$ is the scaling factor to stabilize gradients
Cheng et al. initially explored self-attention for reading and understanding text, but it was later found to be equally powerful for generating text. This advancement, along with the development of Transformers, laid the foundation for modern generative AI and large language models (LLMs).

How Language Models Like ChatGPT Translate Text Using Self-Attention

Imagine we ask a language model:

“Translate bonjour into English.”

The model doesn’t understand languages the way humans do. Instead, it learns from tons of examples how groups of words (called tokens) relate to each other, using a technique called self-attention.

Self-attention helps the model look at all the words in a sentence — both the original and the translated — at the same time.
Unlike older models that separate source and target sentences, these new models (like GPT) treat everything as one long sequence. Example:

Input: “Translate bonjour into English.”

Output: “hello”

During training, the model sees lots of mixed-language examples. It learns from a big vocabulary that includes words from many languages (English, French, Spanish, etc.).
The model doesn’t “know” that French and English are different — it just learns patterns between words across languages. When we say “Translate X into Y”, the next word often comes from Language Y.

Python Implementation of Attention Mechanism to Predict Fashion Category

#Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, Dense, Attention, GlobalAveragePooling1D



# Load dataset
df = pd.read_csv('/content/fashion_descriptions.csv')

## Data Preprocessing
# Encode target labels
le = LabelEncoder()
df['label'] = le.fit_transform(df['category'])

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['description'])
sequences = tokenizer.texts_to_sequences(df['description'])
word_index = tokenizer.word_index

# Pad sequences
max_len = max(len(seq) for seq in sequences)
X = pad_sequences(sequences, maxlen=max_len)
y = np.array(df['label'])

vocab_size = len(word_index) + 1
num_classes = len(le.classes_)


# Input layer
input_ = Input(shape=(max_len,), name='input')

# Embedding layer
embedding = Embedding(input_dim=vocab_size, output_dim=64, name='embedding')(input_)

# BiLSTM layer
lstm_out = Bidirectional(LSTM(64, return_sequences=True), name='bilstm')(embedding)

# Self-Attention using built-in layer
attention_out = Attention(name='attention')([lstm_out, lstm_out])  # query = value = key = lstm_out

# Pooling to reduce sequence
context_vector = GlobalAveragePooling1D(name='pooling')(attention_out)

# Output layer
output = Dense(num_classes, activation='softmax', name='output')(context_vector)

# Model
model = Model(inputs=input_, outputs=output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

model.fit(X, y, epochs=20, batch_size=2, verbose=1)

#Prediction
user_input = input("Enter a product description: ")

# Convert input text to sequence and pad
seq = tokenizer.texts_to_sequences([user_input])
padded = pad_sequences(seq, maxlen=max_len)

# Make prediction
pred = model.predict(padded)

# Decode the predicted label
predicted_category = le.inverse_transform([np.argmax(pred)])

print("Predicted category:", predicted_category[0])

Output:
Enter a product description: waterproof hiking shoes for travel
Predicted category: shoes

Reference:

IBM. (n.d.). Attention Mechanism. IBM Think. Retrieved July 19, 2025, from https://www.ibm.com/think/topics/attention-mechanism
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv. Retrieved July 19, 2025, from https://arxiv.org/abs/1409.0473
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv. Retrieved July 19, 2025, from https://arxiv.org/abs/1706.03762

Attention Mechanism