Naïve:
- It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features.
- Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
- Hence each feature individually contributes to identify that it is an apple without depending on each other.
Bayes:
- It is called Bayes because it depends on the principle of Bayes’ Theorem.
Bayes’s Theorem:
According to Wikipedia, In probability theory and statistics, Bayes’s theorem (alternatively Bayes’s law or Bayes’s rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Mathematically, it can be written as: P( A | B ) = P( B | A ) X P( A ) / P( B )
Where A and B are events and P(B)≠0
- P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
- P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
- P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.
Example
Let’s understand it with the help of an example:
The problem statement:
You are planning a picnic today, but the morning is cloudy”
Oh no! 50% of all rainy days start off cloudy! But cloudy mornings are common (about 40% of days start cloudy) And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%) What is the chance of rain during the day?
We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.
The chance of Rain given Cloud is written P(Rain|Cloud)
So let’s put that in the formula: P( Rain | Cloud ) = P( Rain ) X P( Cloud | Rain ) / P( Cloud )
- P(Rain) is Probability of Rain = 10%
- P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
- P(Cloud) is Probability of Cloud = 40%
P( Rain | Cloud ) = ( 0.1 x 0.5 ) / 0.4 = 0.125
Or a 12.5% chance of rain. Not too bad, let’s have a picnic!
Types of Naive Bayes:
1. Gaussian Naive Bayes
- Assumption: Features follow a Gaussian (normal) distribution.
- Use Case: Suitable for continuous data. It’s commonly used when the features are real-valued and continuous.
- Example: Predicting whether a student will pass or fail based on their scores in various subjects, assuming that the scores follow a normal distribution.
2. Multinomial Naive Bayes
- Assumption: Features follow a multinomial distribution.
- Use Case: Suitable for discrete data, especially when features represent counts or frequencies.
- Example: Text classification tasks like spam detection or sentiment analysis, where the features are word frequencies in a document.
3. Bernoulli Naive Bayes
- Assumption: Features are binary (i.e., they take on two values, typically 0 or 1).
- Use Case: Suitable for binary/Boolean data.
- Example: Email spam detection, where each feature indicates the presence or absence of a word in an email.
4. Complement Naive Bayes
- Assumption: Similar to Multinomial Naive Bayes but is particularly effective for imbalanced datasets.
- Use Case: Often used in text classification when classes are imbalanced.
- Example: Document classification tasks where one class (e.g., spam) is much less frequent than the other.
5. Categorical Naive Bayes
- Assumption: Features follow a categorical distribution.
- Use Case: Suitable for categorical data where features are discrete and have no order (nominal features).
- Example: Classifying animals based on features like color, type, and habitat.
6. Kernel Naive Bayes
- Assumption: Extends Gaussian Naive Bayes by using kernel density estimation to estimate the distribution of features.
- Use Case: Useful when the feature distribution is not Gaussian.
- Example: Non-linear data where Gaussian Naive Bayes might not be accurate.
Python Implementation for Naive Bayes:
Problem Statement: Spam filtering using naive Bayes classifiers in order to predict whether a new mail based on its content, can be categorized as spam or not spam.
# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Load dataset
data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /06_Naive-Bayes/Naive Bayes Class/spam.tsv',sep='\t',names=['Class','Message'])
Basic Checks
data.head()
Output: Class Message 0 ham I've been searching for the right words to tha... 1 spam Free entry in 2 a wkly comp to win FA Cup fina... 2 ham Nah I don't think he goes to usf, he lives aro... 3 ham Even my brother is not like to speak with me. ... 4 ham I HAVE A DATE ON SUNDAY WITH WILL!!!
data.tail()
Output: Class Message 5562 spam This is the 2nd time we have tried 2 contact u... 5563 ham Will ü b going to esplanade fr home? 5564 ham Pity, * was in mood for that. So...any other s... 5565 ham The guy did some bitching but I acted like i'd... 5566 ham Rofl. Its true to its name
data.info()
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5567 entries, 0 to 5566 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Class 5567 non-null object 1 Message 5567 non-null object dtypes: object(2) memory usage: 87.1+ KB
# Create a column to keep the count of the characters
data['Length'] = data['Message'].apply(len)
data.head()
Output: Class Message Length 0 ham I've been searching for the right words to tha... 196 1 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 2 ham Nah I don't think he goes to usf, he lives aro... 61 3 ham Even my brother is not like to speak with me. ... 77 4 ham I HAVE A DATE ON SUNDAY WITH WILL!!! 36
data.describe()
Output: Length count 5567.000000 mean 80.450153 std 59.891023 min 2.000000 25% 36.000000 50% 62.000000 75% 122.000000 max 910.000000
Insight : Found a 910 character long message. Let’s use masking to find this message
data.Class.value_counts()
Output: Class ham 4821 spam 746 Name: count, dtype: int64
EDA:
data[data.Length==910]['Message'].iloc[0]
Output: For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..
data[data.Length==2]['Message'].iloc[0]
Output: Ok
Data Preprocessing (Text Preprocessing):
# Creating object for target values
data['Class'].value_counts()
print(data['Class'].value_counts())
Output: Class ham 4821 spam 746 Name: count, dtype: int64
# Lets assign 1 for ham & 0 for spam
data.loc[data['Class']=='ham','Class']=1
data.loc[data['Class']=='spam','Class']=0
data['Class'].value_counts()
Output: Class 1 4821 0 746 Name: count, dtype: int64
Removing Punctuation
- Python’s built-in string library can help to get a quick list of all the possible punctuation
import string
string.punctuation
Output: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Why is it important to remove punctuation?
- These have impact
- Lets understand with below example
"This message is spam" == "This message is spam."
Output: False
# Let's remove the punctuation
def remove_punct(text):
text ="".join([char for char in text if char not in string.punctuation])
return text
data['text_clean'] = data['Message'].apply(lambda x: remove_punct(x))
data.head()
Output: Class Message Length text_clean 0 ham I've been searching for the right words to tha... 196 Ive been searching for the right words to than... 1 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 Free entry in 2 a wkly comp to win FA Cup fina... 2 ham Nah I don't think he goes to usf, he lives aro... 61 Nah I dont think he goes to usf he lives aroun... 3 ham Even my brother is not like to speak with me. ... 77 Even my brother is not like to speak with me T... 4 ham I HAVE A DATE ON SUNDAY WITH WILL!!! 36 I HAVE A DATE ON SUNDAY WITH WILL
Tokenization
- Process of converting the normal text strings into a list of tokens( also known as lemmas))
# Splitting X & y
Xset = data['text_clean'].values
yset = data['Class'].values
yset
Output: array([1, 0, 1, ..., 1, 1, 1], dtype=object)
#It is object type , so lets convert to integer
yset = yset.astype('int')
yset
Output: array([1, 0, 1, ..., 1, 1, 1])
Xset
Output: ['Ive been searching for the right words to thank you for this breather I promise i wont take your help for granted and will fulfil my promise You have been wonderful and a blessing at all times' 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s' 'Nah I dont think he goes to usf he lives around here though' ... 'Pity was in mood for that Soany other suggestions' 'The guy did some bitching but I acted like id be interested in buying something else next week and he gave it to us for free' 'Rofl Its true to its name']
# Splitting train & test data
from sklearn.model_selection import train_test_split
Xset_train,Xset_test,yset_train,yset_test = train_test_split(Xset,yset,test_size=0.2,random_state=73)
Countvectorizer
We need to convert each of those messages into a vector.
The SciKit learn’s algorithm models can work with and machine learning model which we will use
#Initialize the object for countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = "english")
Stopwords
Stopwords are the words in any language which does not add much meaning to a sentence .
They are the words which are very common in test documents such as a, an, the, you, your, etc.
The stop words are highly appear in text documents.
However, they are not being helpful for text analysis in many of the cases.
So , it is better to remove from text .
We can focus on the important words if the stop words have removed.
Xset_train_cv = cv.fit_transform(Xset_train)
Xset_test_cv = cv.transform(Xset_test)
cv.get_feature_names_out()
Output: array(['008704050406', '0089my', '0121', ..., 'zyada', 'üll', '〨ud'], dtype=object)
Training a model
- With messages represented as vectors, we can finally train our spam/ham classifier.
- Now we can actually use almost any sort of classification algorithm.
- For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.
# Creating the model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(Xset_train_cv,yset_train)
yset_predict = model.predict(Xset_test_cv)
yset_predict
Output: array([0, 0, 1, ..., 1, 1, 1])
Evaluation:
from sklearn.metrics import accuracy_score
AccuracyScore = accuracy_score(yset_test,yset_predict)*100
AccuracyScore
Output: 98.20466786355476
Spam Classification Application:
msg = input("Enter Message: ") # to get the input message
msgInput = cv.transform([msg]) #
predict = model.predict(msgInput)
if(predict[0]=='spam'):
print("------------------------MESSAGE-SENT-[CHECK-SPAM-FOLDER]---------------------------")
else:
print("---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------")
Output: Enter Message: Please can you give me yearly sales report ---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------
Bag of Words:
- We cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which machine can understand and can perform the required modelling on it.
# Creating a list of sentences
doc = ["dog bites man", "man bites dog", "dog eats meat", "man eats food"]
doc[3]
Output: man eats food
# corpus is the collection of text
print("Our corpus: ", doc)
# Initialise the object for CountVectorizer
cv = CountVectorizer()
#Build a BOW representation for the corpus
bow_doc = cv.fit_transform(doc)
#Look at the vocabulary mapping
print("Our vocabulary: ", cv.vocabulary_)
#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_doc[0].toarray())
print("BoW representation for 'man bites dog: ",bow_doc[1].toarray())
#Get the representation using this vocabulary, for a new text
temp = cv.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())
Output: Our corpus: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food'] Our vocabulary: {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3} BoW representation for 'dog bites man': [[1 1 0 0 1 0]] BoW representation for 'man bites dog: [[1 1 0 0 1 0]] Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]
TF-IDF:
In BOW approach we saw so far, all the words in the text are treated equally important.
There is no notion of some words in the document being more important than others.
TF-IDF addresses this issue.
It aims to quantify the importance of a given word relative to other words in the document.
Term Frequency (tf) TF:
Term Frequency, which measures how frequently a term occurs in a document.
Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones.
Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
- TF(t) = (Number of times term ‘t’ appears in a document) / (Total number of terms in the document).
Inverse Document Frequency (idf)
It measures how important a term is.
While computing TF, all terms are considered equally important.
However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance.
Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
- IDF(t) = log_e(Total number of documents / Number of documents with term t in it).corpus.
It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.
Let’s see an example:
Consider a document containing 100 words wherein the word cat appears 3 times.
The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03.
Now, assume we have 10 million documents and the word cat appears in one thousand of these.
Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12
Previous model start from vectorizer, here we will use TfidVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
Xset_train_tv = tv.fit_transform(Xset_train)
Xset_test_tv = tv.transform(Xset_test)
Model Creation:
from sklearn.naive_bayes import BernoulliNB
model1 = BernoulliNB(alpha = 0.01)
model1.fit(Xset_train_tv,yset_train)
yset_predict1 = model1.predict(Xset_test_tv)
from sklearn.metrics import accuracy_score
AccuracyScore = accuracy_score(yset_test,yset_predict1)*100
AccuracyScore
Output: 98.56373429084381
Application of Naive Bayes
Text classification / spam filtering / Sentiment analysis:
- Naive Bayes classifiers mostly used in text classification
- News article classification SPORTS, TECHNOLOGY etc.
- Spam or Ham: Naive Bayes is the most popular method for mail filtering
- Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).
Recommendation System:
- Naive Bayes classifier and Collabrative filtering together buids a recommendation system that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.