Attention Mechanism
- An attention mechanism is a method in deep learning that helps models focus on the most important parts of the input data, especially for long or complex sequences.
- The introduction of attention mechanisms laid the foundation for the transformer architecture, which now powers state-of-the-art large language models (LLMs) like ChatGPT (Vaswani et al., 2017).
- Attention mechanisms mimic how humans focus on important information and ignore the rest, helping models highlight key parts of input while saving memory and processing power.
- Attention mechanisms give different importance (weights) to input parts, and the model learns these weights during training to focus on what matters most for the task.
- Bahdanau et al. (2014) introduced attention to improve RNNs for translation, and later it was applied to CNNs for tasks like image captioning and visual question answering.
- The 2017 paper “Attention is All You Need” introduced the transformer model, which uses only attention and feedforward layers and now powers today’s leading generative AI models.
- Although attention is key in NLP tasks with LLMs, diffusion models for image generation commonly use attention mechanisms to improve image quality and detail, where models like Vision Transformers have improved in tasks like object detection and segmentation.
Importance of Attention Mechanism
Transformer models, powered by attention mechanisms, have set new benchmarks across deep learning fields due to their distinct advantages over convolutional and recurrent neural network methods.
- RNNs process sequential data step-by-step, limiting their ability to capture long-range dependencies, whereas attention mechanisms analyze entire sequences at once and selectively focus on relevant parts.
- CNNs focus on local data subsets, which limits their ability to capture distant dependencies like connections between words in text or pixels in images, but attention mechanisms effectively handle long-range relationships across the entire input.
- Attention mechanisms perform many computations simultaneously instead of sequentially, allowing efficient parallel processing that uses the speed and power of GPUs.
Sequence-to-Sequence Model Problem
- A Sequence-to-Sequence (Seq2Seq) model is a type of neural network architecture used for sequence-to-sequence learning tasks, like machine translation, text summarization, and chatbot responses, where both the input and output are sequences. It essentially maps an input sequence to an output sequence, even if their lengths differ.
- RNNs are neural networks that process sequences step-by-step, keeping a “memory” called the hidden state. However, RNNs face problems like vanishing gradients, making it hard to learn long sequences.
- LSTMs improved this by adding gates to keep long-term memory.
- Seq2Seq uses two LSTMs:
- Encoder: Reads the input sentence and compresses it into one fixed-size vector called the context vector.
- Decoder: Uses this context vector to generate the output sentence word by word.
- The fixed-length context vector works for different sentence lengths but causes problems:
- It treats long and short sentences equally, causing information loss (bottleneck).
- It often forgets important details from the start of the sentence, reducing accuracy on longer inputs.
Solution to this Problem
Attention Improved Seq2Seq (Bahdanau et al., 2014) as:
- Instead of passing just one context vector, the model passes all encoder hidden states to the decoder.
- The attention mechanism helps the decoder focus on the most relevant parts of the input for each output word.
- This removes the fixed-length bottleneck and improves translation, especially for long sentences.
- With the introduction of transformer models in 2017, which rely entirely on attention, RNNs became largely outdated in the field of NLP.
#Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, Dense, Attention, GlobalAveragePooling1D
# Load dataset
df = pd.read_csv('/content/fashion_descriptions.csv')
## Data Preprocessing
# Encode target labels
le = LabelEncoder()
df['label'] = le.fit_transform(df['category'])
# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['description'])
sequences = tokenizer.texts_to_sequences(df['description'])
word_index = tokenizer.word_index
# Pad sequences
max_len = max(len(seq) for seq in sequences)
X = pad_sequences(sequences, maxlen=max_len)
y = np.array(df['label'])
vocab_size = len(word_index) + 1
num_classes = len(le.classes_)
# Input layer
input_ = Input(shape=(max_len,), name='input')
# Embedding layer
embedding = Embedding(input_dim=vocab_size, output_dim=64, name='embedding')(input_)
# BiLSTM layer
lstm_out = Bidirectional(LSTM(64, return_sequences=True), name='bilstm')(embedding)
# Self-Attention using built-in layer
attention_out = Attention(name='attention')([lstm_out, lstm_out]) # query = value = key = lstm_out
# Pooling to reduce sequence
context_vector = GlobalAveragePooling1D(name='pooling')(attention_out)
# Output layer
output = Dense(num_classes, activation='softmax', name='output')(context_vector)
# Model
model = Model(inputs=input_, outputs=output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
model.fit(X, y, epochs=20, batch_size=2, verbose=1)
#Prediction
user_input = input("Enter a product description: ")
# Convert input text to sequence and pad
seq = tokenizer.texts_to_sequences([user_input])
padded = pad_sequences(seq, maxlen=max_len)
# Make prediction
pred = model.predict(padded)
# Decode the predicted label
predicted_category = le.inverse_transform([np.argmax(pred)])
print("Predicted category:", predicted_category[0])
Output:
Enter a product description: waterproof hiking shoes for travel
Predicted category: shoes
Reference:
- IBM. (n.d.). Attention Mechanism. IBM Think. Retrieved July 19, 2025, from https://www.ibm.com/think/topics/attention-mechanism
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv. Retrieved July 19, 2025, from https://arxiv.org/abs/1409.0473
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv. Retrieved July 19, 2025, from https://arxiv.org/abs/1706.03762