Naive Bayes
What is Naive Bayes?
Naive Bayes is a powerful algorithm for text classification based on Bayes’ Theorem. For this reason, it uses the prior probability and posterior probability. The posterior probability equals the prior probability times the likelihood ratio. It allows you to calculate the probability of stolen cars or similar events based on evidence, even if the forecast example is not in our data set. Naive Bayes with Python allows you to apply the Naive Bayes algorithm to detect whether a text message belong to a certain group, for example detect if it is spam.
The “Naive” part of the classifier assumes that the feature set in a class is unrelated to any other feature (independent), even if these features depend on each other or upon the existence of the other features. It is assumed that all features independently and equally contribute to the probability that a particular hypothesis is predicted to occur as A or B.
Bayes’ theorem describes the conditional probability of an event because it is based on prior knowledge of the conditions that might be related to the event. Bayes’ Theorem states that the relationship between the probability of hypothesis before getting the evidence P(H) and the probability of the hypothesis after getting the evidence P(H|E) is equal P(E|H)*P(H)/P(E). For this reason, P(E) is called the prior probability, while P(H|E) is called the posterior probability. The factor that relates the two, P(H|E) / P(E), is called the likelihood ratio.
This leads us to the practical statement: “The posterior probability equals the prior probability times the likelihood ratio”.

Example: What is the probability of disaster fire when you see smoke?
- Disaster fire is rare = 1% of all fires, as fireplaces and barbeque
- Smoke is commonly seen, 50% of total fires
- Disaster fire usually has visible smoke, in 90% cases
P(fire when you see a smoke) = P(Fire) * P(Smoke | Fire) / P(Smoke) = 1% * 90% / 50% = 1.5%
More examples here
What is the Probability that your car will be stollen?

Here is a dataset, where we need to classify whether the car with the given features will be stolen or not. The columns represent features, and the rows represent results of incredibly special individuals owning those cars. If we take the first row of the dataset, we can observe that the car is stolen if the Color is Red, the Type is Sports and Model is BMW. So, we want to classify whether a Red BMW SUV is likely to be stolen or not and what is that probability. Note that there is no example of a Red BMW SUV in our data set.
The posterior probability P(y|X) can be calculated by first, creating a Frequency Table for each attribute against the target. Then, molding the frequency tables to Likelihood Tables and finally, use the Naïve Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction. Below are the Frequency and likelihood tables for all three predictors


Stollen Cars can be estimated. So, we make calculation of probability of stolen cars for several options. As you can see Red BMW SUV probability to be stolen is 17%. It comes as the result of 65% of general probability, but the red probability is less than general, equal (54%) and then SUV and BMW is 38% and 54% accordingly.
Naive Bayes with Python
You can apply the Naive Bayes algorithm to text classification and detect whether a text message belongs to a certain group, for example detect if it is spam.
Data sample: Kagle/Quora
import numpy as np
import pandas as pd
import os
f_csv = "C:/Users/prac/Documents/Programy/NaiveBayes/"
data_set = pd.read_csv(f_csv+"train.csv")
print(data_set.sample(10))
print(data_set.info())
print(data_set.groupby('target').count())
#we would like to make our experiment on the same number of positive and negative data
t1 = data_set[data_set["target"]==0].sample(80810).copy(deep=True)
print(t1.shape)
print(t1.groupby('target').count())
t1 = t1.append(data_set[data_set.target==1],ignore_index=True)
print(t1.shape)
print(t1.groupby('target').count())
print(t1.sample(3))
print(t1.columns)
We would like to divide dataset to train and test sample
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
X_tarin, X_test, y_train, y_test = train_test_split( t1['question_text'],t1['target'], test_size=0.25 )
print ('Taking a look at Sincere Questions')
train.loc[train['target'] == 0].sample(5)['question_text']
print ('Taking a look at Insincere Questions')
train.loc[train['target'] == 1].sample(5)['question_text']
Looking at Sincere Questions 1032251 How is the infrastructure at Allen Kota and re... 717865 How do I convince my younger sibling that read... 204382 Should I visit a doctor after being kicked in ... 486891 What is side reaction in chemistry? 1178030 Which is the best horror movie of 2017? Looking at Insincere Questions 877440 Why are men more likely to accept sex from ran... 363728 Why do Indians have such a cheap thinking when... 1248044 Why do anti-gunners trust the government enough... 73326 Why do Indian moms hate sex life of her son or... 786647 During partition, why didn't Sikhs support Pak...
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_tarin, y_train)
pred_labels = model.predict(X_test)
from sklearn.metrics import confusion_matrix
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
mat = confusion_matrix(y_test, pred_labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, xticklabels=[0,1], yticklabels=[0,1])
plt.xlabel('true label')
plt.ylabel('predicted label');

In text classification, as we can see, most of our sentences have been classified as 1:1 or 0:0, what means that they were recognized as true positive or true negatives. We have gotten 4,282 false positive, which means they were negative in training sample but classified as positive and 1,478 false negative which means they were positive but classified as negative. Thus, we have accuracy 0,857443.
Word cloud
Let’s see how we can see the differences between those data sets in wordclouds
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
stop_words = set(stopwords.words('English'))
We built a function to count words frequency
def word_freq_dict(text):
wordList = [x for x in text.split()] # if x not in stop_words] # Convert text into word list
vocab = Counter(wordList) # Generate word freq dictionary
freq_dict = dict(vocab.most_common())
return freq_dict
we select data for word frequency counting
dft = pd.DataFrame()
dft['txt'] = X_test
dft['target'] = y_test
dft['pred_labels'] = pred_labels
tp = dft.loc[(dft.target==1) & (dft.pred_labels==1)]['txt'].to_list()
tn = dft.loc[(dft.target==0) & (dft.pred_labels==0)]['txt'].to_list()
fp = dft.loc[(dft.target==0) & (dft.pred_labels==1)]['txt'].to_list()
fn = dft.loc[(dft.target==1) & (dft.pred_labels==0)]['txt'].to_list()




words_sample = " ".join(fn)
words_freq = word_freq_dict(words_sample)
X_words_freq = dict(list(words_freq.items()))
wordcloud = WordCloud(width= 5000,
height=3000,
max_words=200,
colormap='Oranges',
background_color='white')
wordcloud.generate_from_frequencies(X_words_freq)
figure_size=(10,6)
plt.figure(figsize=figure_size)
plt.axis("off")
title = 'fn'
plt.title(title)
plt.imshow(wordcloud)
plt.show()
How do you like it, please comment
