Naive Bayes

What is Naive Bayes? 

Naive Bayes is a powerful algorithm for text classification based on Bayes’ Theorem. For this reason, it uses the prior probability and posterior probability. The posterior probability equals the prior probability times the likelihood ratio. It allows you to calculate the probability of stolen cars or similar events based on evidence, even if the forecast example is not in our data set. Naive Bayes with Python allows you to apply the Naive Bayes algorithm to detect whether a text message belong to a certain group, for example detect if it is spam.

The “Naive” part of the classifier assumes that the feature set in a class is unrelated to any other feature (independent), even if these features depend on each other or upon the existence of the other features. It is assumed that all features independently and equally contribute to the probability that a particular hypothesis is predicted to occur as A or B

Bayes’ theorem describes the conditional probability of an event because it is based on prior knowledge of the conditions that might be related to the event. Bayes’ Theorem states that the relationship between the probability of hypothesis before getting the evidence P(H) and the probability of the hypothesis after getting the evidence P(H|E) is equal P(E|H)*P(H)/P(E). For this reason, P(E) is called the prior probability, while P(H|E) is called the posterior probability. The factor that relates the two, P(H|E) / P(E), is called the likelihood ratio.

This leads us to the practical statement: “The posterior probability equals the prior probability times the likelihood ratio”.

Not all smoke is fire

Example: What is the probability of disaster fire when you see smoke?

  • Disaster fire is rare = 1% of all fires, as fireplaces and barbeque
  • Smoke is commonly seen, 50% of total fires
  • Disaster fire usually has visible smoke, in 90% cases

P(fire when you see a smoke) = P(Fire) * P(Smoke | Fire) / P(Smoke) = 1% * 90% / 50% = 1.5%

More examples here


What is the Probability that your car will be stollen?

Here is a dataset, where we need to classify whether the car with the given features will be stolen or not. The columns represent features, and the rows represent results of incredibly special individuals owning those cars. If we take the first row of the dataset, we can observe that the car is stolen if the Color is Red, the Type is Sports and Model is BMW. So, we want to classify whether a Red BMW SUV is likely to be stolen or not and what is that probability. Note that there is no example of a Red BMW SUV in our data set.

The posterior probability P(y|X) can be calculated by first, creating a Frequency Table for each attribute against the target. Then, molding the frequency tables to Likelihood Tables and finally, use the Naïve Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction. Below are the Frequency and likelihood tables for all three predictors

Stollen Cars can be estimated. So, we make calculation of probability of stolen cars for several options. As you can see Red BMW SUV probability to be stolen is 17%. It comes as the result of 65% of general probability, but the red probability is less than general, equal (54%) and then SUV and BMW is 38% and 54% accordingly.


Naive Bayes with Python

You can apply the Naive Bayes algorithm to text classification and detect whether a text message belongs to a certain group, for example detect if it is spam.

Data sample: Kagle/Quora

import numpy as np
import pandas as pd
import os

f_csv = "C:/Users/prac/Documents/Programy/NaiveBayes/"
data_set = pd.read_csv(f_csv+"train.csv")
print(data_set.sample(10))

print(data_set.info())
print(data_set.groupby('target').count())

#we would like to make our experiment on the same number of positive and negative data
t1 = data_set[data_set["target"]==0].sample(80810).copy(deep=True)
print(t1.shape)
print(t1.groupby('target').count())

t1 = t1.append(data_set[data_set.target==1],ignore_index=True)
print(t1.shape)
print(t1.groupby('target').count())
print(t1.sample(3))
print(t1.columns)

We would like to divide dataset to train and test sample

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

X_tarin, X_test, y_train, y_test = train_test_split( t1['question_text'],t1['target'], test_size=0.25 )

print ('Taking a look at Sincere Questions')
train.loc[train['target'] == 0].sample(5)['question_text']

print ('Taking a look at Insincere Questions')
train.loc[train['target'] == 1].sample(5)['question_text']
Looking at Sincere Questions
1032251    How is the infrastructure at Allen Kota and re...
717865     How do I convince my younger sibling that read...
204382     Should I visit a doctor after being kicked in ...
486891                   What is side reaction in chemistry?
1178030              Which is the best horror movie of 2017?

Looking at Insincere Questions
877440     Why are men more likely to accept sex from ran...
363728     Why do Indians have such a cheap thinking when...
1248044    Why do anti-gunners trust the government enough...
73326      Why do Indian moms hate sex life of her son or...
786647     During partition, why didn't Sikhs support Pak...
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_tarin, y_train)

pred_labels = model.predict(X_test)

from sklearn.metrics import confusion_matrix
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()


mat = confusion_matrix(y_test, pred_labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,  xticklabels=[0,1], yticklabels=[0,1])
plt.xlabel('true label')
plt.ylabel('predicted label');

In text classification, as we can see, most of our sentences have been classified as 1:1 or 0:0, what means that they were recognized as true positive or true negatives. We have gotten 4,282 false positive, which means they were negative in training sample but classified as positive and 1,478 false negative which means they were positive but classified as negative. Thus, we have accuracy 0,857443.


Word cloud

Let’s see how we can see the differences between those data sets in wordclouds

from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
stop_words = set(stopwords.words('English'))

We built a function to count words frequency

def word_freq_dict(text):
    wordList = [x for x in text.split()] # if x not in stop_words] # Convert text into word list
    vocab = Counter(wordList) # Generate word freq dictionary
    freq_dict = dict(vocab.most_common())
    return freq_dict

we select data for word frequency counting

dft = pd.DataFrame()
dft['txt'] = X_test
dft['target'] = y_test
dft['pred_labels'] = pred_labels
tp = dft.loc[(dft.target==1) & (dft.pred_labels==1)]['txt'].to_list()
tn = dft.loc[(dft.target==0) & (dft.pred_labels==0)]['txt'].to_list()
fp = dft.loc[(dft.target==0) & (dft.pred_labels==1)]['txt'].to_list()
fn = dft.loc[(dft.target==1) & (dft.pred_labels==0)]['txt'].to_list()

Naive Bayes, True positives
False positives

True negatives, text classification
False negatives
words_sample = " ".join(fn)
words_freq = word_freq_dict(words_sample)

X_words_freq = dict(list(words_freq.items()))
wordcloud = WordCloud(width= 5000,
                      height=3000,
                      max_words=200,
                      colormap='Oranges',
                      background_color='white')

wordcloud.generate_from_frequencies(X_words_freq)
figure_size=(10,6)
plt.figure(figsize=figure_size)
plt.axis("off")
title = 'fn'
plt.title(title)
plt.imshow(wordcloud)
plt.show()

How do you like it, please comment

Similar Posts