Similarity measure in text corpus
|

Similarity measure in text corpus – python secrets for analytics

Similarity measure in text corpus is the distance between two vectors where the vector dimensions represent the features of two objects. Similarity is the measure of how similar or how different two objects are. The measure of similarity is the degree of similarity and vice versa. Usually, it is measured in the range 0 to 1 and it is called the similarity score.classification, automated document linking, spell correction makes use of similarity measures.

Inspired by: Ethan Brown: Product categorization and named entity recognition

Data

I will practice on following dataset: Product-data.csv. Before processing dataset has been converted to excel and devided into three sheets: names, descriptions and prices.

The problem of similarity measure in text corpus

The problem is to extract or create product groups and differentiation factors from product’s titles = product names. The idea cames from marketplaces where based on this information program can group products to category or compare prices based on titlte during simmilarity measure to other products from different offerers.

Preprocessing

import pandas as pd


# input/output files
input_xls_file = 'product_dataset.xlsx'
output_xls_file = 'product_groups.xlsx'

dfp = pd.read_excel(input_xls_file,sheet_name='product_names')
print(dfp.columns)

#preprocessing

def preprocessing_text(text):    
    '''The function is to remove punctuation, stopwords and apply stemming'''
    import re
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer

    stop_words = stopwords.words('english')    
    porter = PorterStemmer()
    
    # eliminate punctuation, stopwords, convert to lowercase and apply stemming
    words = re.sub("[^a-zA-Z]", " ", text) 
    words = [word.lower() for word in text.split() if word.lower() not in stop_words]
    words = [porter.stem(word) for word in words]
    return " ".join(words)


dfp['name'] = dfp['name'].astype(str)
dfp['name_clr'] = dfp['name'].apply(preprocessing_text)

Groupping

Jaccard similarity or intersection over union is defined as size of intersection divided by size of union of two sets. Let’s take example of two sentences. Similarity measure in text corpus.

Sentence 1: President greets the press in Chicago and Sentence 2: Obama speaks in Illinois – these sentences have no common words and will have a Jaccard score of 0. This is a terrible distance score because the 2 sentences have very similar meanings. Here Jaccard similarity is neither able to capture semantic similarity nor lexical semantic of these two sentences.

Moreover, this approach has an inherent flaw. That is, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.

def tokenize_text(text):
    tokens_set = set(text.split())
    return ' '.join(tokens_set)

# tokenize and sort    
dfp['name_set'] = dfp['name_clr'].apply(tokenize_text)

dfps = dfp.sort_values(by='name_clr')


def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

i = 0
for q in dfps['name_clr']: 
    dfps[i] = dfps['name_clr'].apply(lambda x: jaccard_similarity(q,x))
    i += 1
    print(i/len(dfps['name_clr']))

best_score = []
i=0
for i in range(len(dfps['name_clr'])):
    bsn = dfps[dfps[i]>0.9]['name_clr'].to_list()
    bss = dfps[dfps[i]>0.9][i].to_list()
    best_score.append({'q':dfps['name_clr'][i],'bsn':bsn, 'bss':bss})
                    
xx = pd.DataFrame(best_score)
xx.to_excel('xx.xlsx')

Results

You can print results with this code. Similarity measure in text corpus

for i, r in xx.iterrows():
    print(r['q'])
    for k,j in zip(r.bsn, r.bss):
        if j<1:
            print('-',k,j) 

it prints results as follows

soni vaio cs seri red notebook comput - vgncs110er
- soni vaio cs seri red notebook comput - vgncs180jr 0.9130434782608695
- soni vaio rt seri black all-in-on desktop comput - vgcrt150i 0.9130434782608695

speck clear 13' macbook see thru hard shell case - mb13clrseev2
- speck green 13' macbook see thru hard shell case - mb13grnseev2 0.9166666666666666
- speck seethru aqua hard shell case 13' macbook - mb13aquseev2 0.9565217391304348
- speck seethru blue hard shell case 15' macbook - mb15bluseev2 0.9130434782608695
- speck seethru clear hard shell case 15' macbook - mb15clrseev2 0.9130434782608695
- speck seethru orang hard shell case 13' macbook - mb13orgseev2 0.9166666666666666
- speck seethru pink hard shell case 13' macbook - mb13pnkseev2 0.9166666666666666
- speck seethru purpl hard shell case 15' macbook - mb15purseev2 0.9130434782608695

Optimisation

q1 = 'yamaha rx-v363bl 5.1 channel digit home theater receiv black - rxv363bk'
d1 = 'nikon black 13.5 megapixel coolpix digit camera - coolpixp6000bk'

q1 = 'speck black toughskin case iphon 3g - iph3gblkt'
d1 = 'speck black toughskin ipod classic case - icblkt'
#set(q1.split())
jaccard_similarity(q1.split(), d1.split())


i = 0
dfpst = []
bsnt = []
for q in dfps['name_clr']:
    print(i/len(dfps['name_clr']),q)
    dfpst = dfps['name_clr'].apply(lambda x: jaccard_similarity(q.split(),x.split())) 
    bs = [(k,j) for j, k in zip(dfpst, dfps['name_clr']) if j >= 0.4]
    bsnt.append({'q':q, 'bsn': bs})
    i+=1    
            
xx1 = pd.DataFrame(bsnt)
xx1.to_excel('xx1.xlsx')

Case study

The problem of extracting features from unstructured textual data can be given different names depending on the circumstances and desired outcome. Generally, we can split tasks into two camps: sequence classification and sequence tagging. Similarity measure in text corpus.

In sequence classification, we take a text fragment (usually a sentence up to an entire document), and try to project it into a categorical space. This is considered a many-to-one classification in that we are taking a set of many features and producing a single output. Similarity measure in text corpus.

Sequence tagging, on the other hand, is often considered a many-to-many problem since you take in an entire sequence and attempt to apply a label to each element of the sequence. An example of sequence tagging is part of speech labeling, where one attempts to label the part of speech of each word in a sentence. Other methods that fall into this camp include chunking (breaking a sentence into relational components) and named entity recognition (extracting pre-specified features like geographic locations or proper names). Similarity measure in text corpus.

Similarity measure in text corpus

En example of text mining you can find here in article Organizational aspiration for social impact. Similarity measure in text corpus.

Similar Posts