Naïve:

  • It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features.
  • Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
  • Hence each feature individually contributes to identify that it is an apple without depending on each other.

Bayes:

  • It is called Bayes because it depends on the principle of Bayes’ Theorem.

Bayes’s Theorem:

  • According to Wikipedia, In probability theory and statistics, Bayes’s theorem (alternatively Bayes’s law or Bayes’s rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

  • Mathematically, it can be written as: P( A | B ) = P( B | A ) X P( A )  /  P( B )

  • Where A and B are events and P(B)≠0

    • P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
    • P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
    • P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.

Example

Let’s understand it with the help of an example:

The problem statement:

  • You are planning a picnic today, but the morning is cloudy”

  • Oh no! 50% of all rainy days start off cloudy! But cloudy mornings are common (about 40% of days start cloudy) And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%) What is the chance of rain during the day?

  • We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.

  • The chance of Rain given Cloud is written P(Rain|Cloud)

So let’s put that in the formula: P( Rain | Cloud ) = P( Rain ) X P( Cloud | Rain ) / P( Cloud )

  • P(Rain) is Probability of Rain = 10%
  • P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
  • P(Cloud) is Probability of Cloud = 40%

P( Rain | Cloud ) = ( 0.1 x 0.5 ) / 0.4 = 0.125

Or a 12.5% chance of rain. Not too bad, let’s have a picnic!

Types of Naive Bayes:

1. Gaussian Naive Bayes

  • Assumption: Features follow a Gaussian (normal) distribution.
  • Use Case: Suitable for continuous data. It’s commonly used when the features are real-valued and continuous.
  • Example: Predicting whether a student will pass or fail based on their scores in various subjects, assuming that the scores follow a normal distribution.

2. Multinomial Naive Bayes

  • Assumption: Features follow a multinomial distribution.
  • Use Case: Suitable for discrete data, especially when features represent counts or frequencies.
  • Example: Text classification tasks like spam detection or sentiment analysis, where the features are word frequencies in a document.

3. Bernoulli Naive Bayes

  • Assumption: Features are binary (i.e., they take on two values, typically 0 or 1).
  • Use Case: Suitable for binary/Boolean data.
  • Example: Email spam detection, where each feature indicates the presence or absence of a word in an email.

4. Complement Naive Bayes

  • Assumption: Similar to Multinomial Naive Bayes but is particularly effective for imbalanced datasets.
  • Use Case: Often used in text classification when classes are imbalanced.
  • Example: Document classification tasks where one class (e.g., spam) is much less frequent than the other.

5. Categorical Naive Bayes

  • Assumption: Features follow a categorical distribution.
  • Use Case: Suitable for categorical data where features are discrete and have no order (nominal features).
  • Example: Classifying animals based on features like color, type, and habitat.

6. Kernel Naive Bayes

  • Assumption: Extends Gaussian Naive Bayes by using kernel density estimation to estimate the distribution of features.
  • Use Case: Useful when the feature distribution is not Gaussian.
  • Example: Non-linear data where Gaussian Naive Bayes might not be accurate.

Python Implementation for Naive Bayes:

Problem Statement: Spam filtering using naive Bayes classifiers in order to predict whether a new mail based on its content, can be categorized as spam or not spam.

# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
# Load dataset

data = pd.read_csv('/content/drive/MyDrive/Data Science/CDS-07-Machine Learning & Deep Learning/06. Machine Learning Model /06_Naive-Bayes/Naive Bayes Class/spam.tsv',sep='\t',names=['Class','Message'])

Basic Checks

data.head()
Output:
	Class	Message
0	ham	I've been searching for the right words to tha...
1	spam	Free entry in 2 a wkly comp to win FA Cup fina...
2	ham	Nah I don't think he goes to usf, he lives aro...
3	ham	Even my brother is not like to speak with me. ...
4	ham	I HAVE A DATE ON SUNDAY WITH WILL!!!
data.tail()
Output:
	Class	Message
5562	spam	This is the 2nd time we have tried 2 contact u...
5563	ham	Will ü b going to esplanade fr home?
5564	ham	Pity, * was in mood for that. So...any other s...
5565	ham	The guy did some bitching but I acted like i'd...
5566	ham	Rofl. Its true to its name
data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5567 entries, 0 to 5566
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5567 non-null   object
 1   Message  5567 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB
# Create a column to keep the count of the characters

data['Length'] = data['Message'].apply(len)

data.head()
Output:
	Class	Message	Length
0	ham	I've been searching for the right words to tha...	196
1	spam	Free entry in 2 a wkly comp to win FA Cup fina...	155
2	ham	Nah I don't think he goes to usf, he lives aro...	61
3	ham	Even my brother is not like to speak with me. ...	77
4	ham	I HAVE A DATE ON SUNDAY WITH WILL!!!	                36 
data.describe()
Output:
            Length
count  5567.000000
mean     80.450153
std      59.891023
min       2.000000
25%      36.000000
50%      62.000000
75%     122.000000
max     910.000000

Insight : Found a 910 character long message. Let’s use masking to find this message

data.Class.value_counts()
Output:
Class
ham     4821
spam     746
Name: count, dtype: int64

EDA:

data[data.Length==910]['Message'].iloc[0]
Output:
For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..
data[data.Length==2]['Message'].iloc[0]
Output:
Ok

Data Preprocessing (Text Preprocessing):

# Creating object for target values

data['Class'].value_counts()

print(data['Class'].value_counts())
Output:
Class
ham     4821
spam     746
Name: count, dtype: int64
# Lets assign 1 for ham & 0 for spam

data.loc[data['Class']=='ham','Class']=1
data.loc[data['Class']=='spam','Class']=0

data['Class'].value_counts()
Output:
Class
1    4821
0     746
Name: count, dtype: int64

Removing Punctuation

  • Python’s built-in string library can help to get a quick list of all the possible punctuation
import string

string.punctuation
Output:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Why is it important to remove punctuation?

  • These have impact
  • Lets understand with below example
"This message is spam" == "This message is spam."
Output:
False
# Let's remove the punctuation

def remove_punct(text):
  text ="".join([char for char in text if char not in string.punctuation])
  return text

data['text_clean'] = data['Message'].apply(lambda x: remove_punct(x))

data.head()
Output:
	Class	                                        Message	Length	text_clean
0	ham	I've been searching for the right words to tha...	196	Ive been searching for the right words to than...
1	spam	Free entry in 2 a wkly comp to win FA Cup fina...	155	Free entry in 2 a wkly comp to win FA Cup fina...
2	ham	Nah I don't think he goes to usf, he lives aro...	61	Nah I dont think he goes to usf he lives aroun...
3	ham	Even my brother is not like to speak with me. ...	77	Even my brother is not like to speak with me T...
4	ham	I HAVE A DATE ON SUNDAY WITH WILL!!!	                36	I HAVE A DATE ON SUNDAY WITH WILL

Tokenization

  • Process of converting the normal text strings into a list of tokens( also known as lemmas))
# Splitting X & y

Xset = data['text_clean'].values

yset = data['Class'].values

yset
Output:
array([1, 0, 1, ..., 1, 1, 1], dtype=object)
#It is object type , so lets convert to integer

yset = yset.astype('int')

yset
Output:
array([1, 0, 1, ..., 1, 1, 1])
Xset
Output:
['Ive been searching for the right words to thank you for this breather I promise i wont take your help for granted and will fulfil my promise You have been wonderful and a blessing at all times'
 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s'
 'Nah I dont think he goes to usf he lives around here though' ...
 'Pity  was in mood for that Soany other suggestions'
 'The guy did some bitching but I acted like id be interested in buying something else next week and he gave it to us for free'
 'Rofl Its true to its name']
# Splitting train & test data

from sklearn.model_selection import train_test_split

Xset_train,Xset_test,yset_train,yset_test = train_test_split(Xset,yset,test_size=0.2,random_state=73)

Countvectorizer

  • We need to convert each of those messages into a vector.

  • The SciKit learn’s algorithm models can work with and machine learning model which we will use

#Initialize the object for countvectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = "english")

Stopwords

  • Stopwords are the words in any language which does not add much meaning to a sentence .

  • They are the words which are very common in test documents such as a, an, the, you, your, etc.

  • The stop words are highly appear in text documents.

  • However, they are not being helpful for text analysis in many of the cases.

  • So , it is better to remove from text .

  • We can focus on the important words if the stop words have removed.

Xset_train_cv = cv.fit_transform(Xset_train)
Xset_test_cv = cv.transform(Xset_test)
cv.get_feature_names_out()
Output:
array(['008704050406', '0089my', '0121', ..., 'zyada', 'üll', '〨ud'],
      dtype=object)

Training a model

  • With messages represented as vectors, we can finally train our spam/ham classifier.
  • Now we can actually use almost any sort of classification algorithm.
  • For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.
# Creating the model

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

model.fit(Xset_train_cv,yset_train)

yset_predict = model.predict(Xset_test_cv)
yset_predict
Output:
array([0, 0, 1, ..., 1, 1, 1])

Evaluation:

from sklearn.metrics import accuracy_score

AccuracyScore = accuracy_score(yset_test,yset_predict)*100
AccuracyScore
Output:
98.20466786355476

Spam Classification Application:

msg = input("Enter Message: ") # to get the input message
msgInput = cv.transform([msg]) #
predict = model.predict(msgInput)
if(predict[0]=='spam'):
    print("------------------------MESSAGE-SENT-[CHECK-SPAM-FOLDER]---------------------------")
else:
    print("---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------")
Output:
Enter Message: Please can you give me yearly sales report
---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------

Bag of Words:

  • We cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which machine can understand and can perform the required modelling on it.
# Creating a list of sentences

doc = ["dog bites man", "man bites dog", "dog eats meat", "man eats food"]

doc[3]
Output:
man eats food
# corpus is the collection of text

print("Our corpus: ", doc)

# Initialise the object for CountVectorizer
cv = CountVectorizer()

#Build a BOW representation for the corpus
bow_doc = cv.fit_transform(doc)

#Look at the vocabulary mapping
print("Our vocabulary: ", cv.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_doc[0].toarray())
print("BoW representation for 'man bites dog: ",bow_doc[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = cv.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())
Output:
Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
BoW representation for 'dog bites man':  [[1 1 0 0 1 0]]
BoW representation for 'man bites dog:  [[1 1 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]

TF-IDF:

  • In BOW approach we saw so far, all the words in the text are treated equally important.

  • There is no notion of some words in the document being more important than others.

  • TF-IDF addresses this issue.

  • It aims to quantify the importance of a given word relative to other words in the document.

Term Frequency (tf) TF:

  • Term Frequency, which measures how frequently a term occurs in a document.

  • Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones.

  • Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

    • TF(t) = (Number of times term ‘t’ appears in a document) / (Total number of terms in the document).

Inverse Document Frequency (idf)

  • It measures how important a term is.

  • While computing TF, all terms are considered equally important.

  • However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance.

  • Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

    • IDF(t) = log_e(Total number of documents / Number of documents with term t in it).corpus.

It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.

Let’s see an example:

  • Consider a document containing 100 words wherein the word cat appears 3 times.

  • The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03.

  • Now, assume we have 10 million documents and the word cat appears in one thousand of these.

  • Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.

  • Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12

Previous model start from vectorizer, here we will use TfidVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer()
Xset_train_tv = tv.fit_transform(Xset_train)
Xset_test_tv = tv.transform(Xset_test)

Model Creation:

from sklearn.naive_bayes import BernoulliNB

model1 = BernoulliNB(alpha = 0.01)

model1.fit(Xset_train_tv,yset_train)

yset_predict1 = model1.predict(Xset_test_tv)
from sklearn.metrics import accuracy_score

AccuracyScore = accuracy_score(yset_test,yset_predict1)*100
AccuracyScore
Output:
98.56373429084381

Application of Naive Bayes

  • Text classification / spam filtering / Sentiment analysis:

    • Naive Bayes classifiers mostly used in text classification
    • News article classification SPORTS, TECHNOLOGY etc.
    • Spam or Ham: Naive Bayes is the most popular method for mail filtering
    • Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).
  • Recommendation System:

    • Naive Bayes classifier and Collabrative filtering together buids a recommendation system that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

© Data4Fashion 2023-2024

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies