Similarity measure in text corpus – python secrets for analytics
Similarity measure in text corpus is the distance between two vectors where the vector dimensions represent the features of two objects. Similarity is the measure of how similar or how different two objects are. The measure of similarity is the degree of similarity and vice versa. Usually, it is measured in the range 0 to 1 and it is called the similarity score.classification, automated document linking, spell correction makes use of similarity measures.
Inspired by: Ethan Brown: Product categorization and named entity recognition
Data
I will practice on following dataset: Product-data.csv. Before processing dataset has been converted to excel and devided into three sheets: names, descriptions and prices.
The problem of similarity measure in text corpus
The problem is to extract or create product groups and differentiation factors from product’s titles = product names. The idea cames from marketplaces where based on this information program can group products to category or compare prices based on titlte during simmilarity measure to other products from different offerers.
Preprocessing
import pandas as pd
# input/output files
input_xls_file = 'product_dataset.xlsx'
output_xls_file = 'product_groups.xlsx'
dfp = pd.read_excel(input_xls_file,sheet_name='product_names')
print(dfp.columns)
#preprocessing
def preprocessing_text(text):
'''The function is to remove punctuation, stopwords and apply stemming'''
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stop_words = stopwords.words('english')
porter = PorterStemmer()
# eliminate punctuation, stopwords, convert to lowercase and apply stemming
words = re.sub("[^a-zA-Z]", " ", text)
words = [word.lower() for word in text.split() if word.lower() not in stop_words]
words = [porter.stem(word) for word in words]
return " ".join(words)
dfp['name'] = dfp['name'].astype(str)
dfp['name_clr'] = dfp['name'].apply(preprocessing_text)
Groupping
Jaccard similarity or intersection over union is defined as size of intersection divided by size of union of two sets. Let’s take example of two sentences. Similarity measure in text corpus.
Sentence 1: President greets the press in Chicago and Sentence 2: Obama speaks in Illinois – these sentences have no common words and will have a Jaccard score of 0. This is a terrible distance score because the 2 sentences have very similar meanings. Here Jaccard similarity is neither able to capture semantic similarity nor lexical semantic of these two sentences.
Moreover, this approach has an inherent flaw. That is, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.
def tokenize_text(text):
tokens_set = set(text.split())
return ' '.join(tokens_set)
# tokenize and sort
dfp['name_set'] = dfp['name_clr'].apply(tokenize_text)
dfps = dfp.sort_values(by='name_clr')
def jaccard_similarity(query, document):
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
return len(intersection)/len(union)
i = 0
for q in dfps['name_clr']:
dfps[i] = dfps['name_clr'].apply(lambda x: jaccard_similarity(q,x))
i += 1
print(i/len(dfps['name_clr']))
best_score = []
i=0
for i in range(len(dfps['name_clr'])):
bsn = dfps[dfps[i]>0.9]['name_clr'].to_list()
bss = dfps[dfps[i]>0.9][i].to_list()
best_score.append({'q':dfps['name_clr'][i],'bsn':bsn, 'bss':bss})
xx = pd.DataFrame(best_score)
xx.to_excel('xx.xlsx')
Results
You can print results with this code. Similarity measure in text corpus
for i, r in xx.iterrows():
print(r['q'])
for k,j in zip(r.bsn, r.bss):
if j<1:
print('-',k,j)
it prints results as follows
soni vaio cs seri red notebook comput - vgncs110er
- soni vaio cs seri red notebook comput - vgncs180jr 0.9130434782608695
- soni vaio rt seri black all-in-on desktop comput - vgcrt150i 0.9130434782608695
speck clear 13' macbook see thru hard shell case - mb13clrseev2
- speck green 13' macbook see thru hard shell case - mb13grnseev2 0.9166666666666666
- speck seethru aqua hard shell case 13' macbook - mb13aquseev2 0.9565217391304348
- speck seethru blue hard shell case 15' macbook - mb15bluseev2 0.9130434782608695
- speck seethru clear hard shell case 15' macbook - mb15clrseev2 0.9130434782608695
- speck seethru orang hard shell case 13' macbook - mb13orgseev2 0.9166666666666666
- speck seethru pink hard shell case 13' macbook - mb13pnkseev2 0.9166666666666666
- speck seethru purpl hard shell case 15' macbook - mb15purseev2 0.9130434782608695
Optimisation
q1 = 'yamaha rx-v363bl 5.1 channel digit home theater receiv black - rxv363bk'
d1 = 'nikon black 13.5 megapixel coolpix digit camera - coolpixp6000bk'
q1 = 'speck black toughskin case iphon 3g - iph3gblkt'
d1 = 'speck black toughskin ipod classic case - icblkt'
#set(q1.split())
jaccard_similarity(q1.split(), d1.split())
i = 0
dfpst = []
bsnt = []
for q in dfps['name_clr']:
print(i/len(dfps['name_clr']),q)
dfpst = dfps['name_clr'].apply(lambda x: jaccard_similarity(q.split(),x.split()))
bs = [(k,j) for j, k in zip(dfpst, dfps['name_clr']) if j >= 0.4]
bsnt.append({'q':q, 'bsn': bs})
i+=1
xx1 = pd.DataFrame(bsnt)
xx1.to_excel('xx1.xlsx')
Case study
The problem of extracting features from unstructured textual data can be given different names depending on the circumstances and desired outcome. Generally, we can split tasks into two camps: sequence classification and sequence tagging. Similarity measure in text corpus.
In sequence classification, we take a text fragment (usually a sentence up to an entire document), and try to project it into a categorical space. This is considered a many-to-one classification in that we are taking a set of many features and producing a single output. Similarity measure in text corpus.
Sequence tagging, on the other hand, is often considered a many-to-many problem since you take in an entire sequence and attempt to apply a label to each element of the sequence. An example of sequence tagging is part of speech labeling, where one attempts to label the part of speech of each word in a sentence. Other methods that fall into this camp include chunking (breaking a sentence into relational components) and named entity recognition (extracting pre-specified features like geographic locations or proper names). Similarity measure in text corpus.
En example of text mining you can find here in article Organizational aspiration for social impact. Similarity measure in text corpus.