<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Data Science &#8211; Customer Experience Management</title>
	<atom:link href="https://mietwood.com/category/data-science/feed" rel="self" type="application/rss+xml" />
	<link>https://mietwood.com</link>
	<description>Customer Experience Can Be Managed</description>
	<lastBuildDate>Wed, 31 Dec 2025 11:44:55 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://mietwood.com/wp-content/uploads/2022/09/cropped-Fav7-32x32.png</url>
	<title>Data Science &#8211; Customer Experience Management</title>
	<link>https://mietwood.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Product Grouping in Noisy E-commerce Datasets</title>
		<link>https://mietwood.com/product-grouping-in-noisy-e-commerce-datasets</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Tue, 30 Dec 2025 09:34:33 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3438</guid>

					<description><![CDATA[<p>Hybrid Approaches to Product Grouping in Noisy E-commerce Datasets: A Comparative Analysis of Set-Theoretic vs. Vector Space Models. In scientific literature, this problem is formally known as Short Text Clustering (STC) or Product Entity Resolution. The product similarity measures can be read from here: Measuring product similarity &#8211; 5 important secrets of python programming. Machine...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/product-grouping-in-noisy-e-commerce-datasets">Product Grouping in Noisy E-commerce Datasets</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Hybrid Approaches to Product Grouping in Noisy E-commerce Datasets: A Comparative Analysis of Set-Theoretic vs. Vector Space Models. In scientific literature, this problem is formally known as <strong>Short Text Clustering (STC)</strong> or <strong>Product Entity Resolution</strong>. The product similarity measures can be read from here: <a href="https://mietwood.com/measuring-product-similarity">Measuring product similarity &#8211; 5 important secrets of python programming</a>. Machine learning of product clustering you can find here: <a href="https://mietwood.com/hierarchical-agglomerative-clustering-for-product-grouping">Hierarchical Agglomerative Clustering for Product Grouping</a>.</p>



<h3 class="wp-block-heading">The &#8220;Marketplace Deduplication&#8221; Problem</h3>



<p>Context: Large marketplaces (Amazon, eBay, Alibaba, or a niche aggregator) allow thousands of third-party sellers to upload their own product feeds. Product Grouping for E-commerce Datasets indeed.</p>



<p>The Problem:</p>



<ul class="wp-block-list">
<li>Seller A uploads: <em>&#8220;Apple iPhone 13, 128GB, Midnight&#8221;</em></li>



<li>Seller B uploads: <em>&#8220;iPhone 13 128 GB Black Unlocked&#8221;</em></li>



<li>Seller C uploads: <em>&#8220;Smartfon Apple iPhone 13 128GB (MLPF3PM/A)&#8221;</em></li>
</ul>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>Why it&#8217;s hard &#8211; The Scientific Challenge</strong></summary>
<ul class="wp-block-list">
<li><strong>Missing Identifiers:</strong> Sellers often omit EAN/UPC codes to avoid price comparisons.</li>



<li><strong>Attribute Noise:</strong> &#8220;Midnight&#8221; vs. &#8220;Black&#8221; (synonyms).</li>



<li><strong>Goal:</strong> You must cluster these into a <strong>single catalog entry</strong> (The &#8220;Golden Record&#8221;) to show the user one product page with a list of 3 sellers, rather than 3 separate search results.</li>



<li>** Metric:** <em>False Positives</em> are costly here (grouping an iPhone 13 <strong>Pro</strong> with a regular iPhone 13 causes returns).</li>
</ul>
</details>



<h3 class="wp-block-heading">The &#8220;Omnichannel Customer Stitching&#8221; Problem (Single Customer View)</h3>



<p>Context: A retailer sells through a Website, a Mobile App, and Physical Stores. They want to know if the person browsing the app is the same person buying in the store.</p>



<p>The Problem:</p>



<ul class="wp-block-list">
<li><strong>Record A (Online):</strong> <code>email: j.smith@gmail.com</code>, <code>cookie_id: xyz123</code>, <code>behavior: viewed running shoes</code></li>



<li><strong>Record B (In-Store POS):</strong> <code>card_hash: ****-1234</code>, <code>loyalty_id: 998877</code>, <code>name: John Smith</code></li>



<li><strong>Record C (Customer Support):</strong> <code>phone: +48 500...</code>, <code>name: Johnny Smith</code>, <code>complaint: "Shoes size 42 too small"</code></li>
</ul>



<p><strong>Why it&#8217;s hard:</strong></p>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>Why it&#8217;s hard</strong></summary>
<ul class="wp-block-list">
<li><strong>Disjoint Attributes:</strong> Record A has no phone number; Record C has no email. You need &#8220;transitive linking&#8221; (A links to B, B links to C $\rightarrow$ A links to C).</li>



<li><strong>Privacy/Hashing:</strong> You are often matching hashed values or partial PII (Personally Identifiable Information).</li>



<li><strong>Goal:</strong> Create a <strong>Customer 360</strong> profile to send a targeted email: <em>&#8220;Hi John, sorry the size 42 didn&#8217;t fit. Here is a discount for size 43.&#8221;</em></li>
</ul>
</details>



<h3 class="wp-block-heading">The &#8220;Competitor Price Monitoring&#8221; Problem</h3>



<p>Context: An e-commerce store wants to automatically adjust their prices to be $1 cheaper than their biggest competitor.</p>



<p>The Problem:</p>



<ul class="wp-block-list">
<li><strong>Your Product:</strong> <em>&#8220;Samsung Galaxy S20 FE 5G Cloud Navy&#8221;</em></li>



<li><strong>Competitor Site:</strong> <em>&#8220;Samsung S20 Fan Edition (Navy) &#8211; 5G Compatible&#8221;</em></li>
</ul>



<p><strong>Why it&#8217;s hard:</strong></p>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>Why it&#8217;s hard &#8211; The Scientific Challenge:</strong></summary>
<ul class="wp-block-list">
<li><strong>Adversarial Data:</strong> Competitors intentionally slightly alter names or use unique internal SKUs to prevent scraping and matching.</li>



<li><strong>Asymmetry:</strong> You have your full database (structured), but the competitor data is scraped (unstructured, noisy HTML).</li>



<li><strong>Goal:</strong> Map competitor SKUs to your SKUs with high precision. If you map to the wrong product (e.g., a cheaper &#8220;Lite&#8221; version), your dynamic pricing algorithm will lower your price too much and you lose money.</li>
</ul>
</details>



<h4 class="wp-block-heading"><strong>Abstract</strong></h4>



<ul class="wp-block-list">
<li><strong>Problem:</strong> E-commerce catalogs suffer from redundancy (same product, different sizes/variants).</li>



<li><strong>Gap:</strong> Manual grouping is impossible; Deep Learning is overkill/imprecise for strict SKU grouping.</li>



<li><strong>Method:</strong> We compare Jaccard (Set) vs. TF-IDF (Vector) and propose a normalization pipeline.</li>



<li><strong>Result:</strong> Our method achieved X% accuracy with Y% reduction in computational time.</li>
</ul>



<h4 class="wp-block-heading"><strong>Set-Theoretic Approaches for Product Grouping (Jaccard)</strong></h4>



<ul class="wp-block-list">
<li><strong>Concept:</strong> Treats text as a &#8220;Bag of Words&#8221; (BoW) without weights.</li>



<li><strong>Key Papers/Concepts:</strong>
<ul class="wp-block-list">
<li><em>Cohen et al. (2003)</em> often discuss string metrics for entity matching.</li>



<li><strong>Shingling / MinHash:</strong> In large datasets, calculating Jaccard for all pairs is $O(N^2)$. Literature focuses on <strong>Locality Sensitive Hashing (LSH)</strong> (MinHash) to approximate Jaccard similarity efficiently.</li>



<li><strong>Pros in Literature:</strong> High interpretability, excellent for &#8220;near-duplicate&#8221; detection.</li>



<li><strong>Cons:</strong> Fails when synonyms are used (e.g., &#8220;pants&#8221; vs &#8220;trousers&#8221;) or when word importance varies.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading"><strong>Vector Space Models (TF-IDF)</strong></h4>



<ul class="wp-block-list">
<li><strong>Concept:</strong> Maps text to a high-dimensional Euclidean space.</li>



<li><strong>Key Papers/Concepts:</strong>
<ul class="wp-block-list">
<li><em>Salton et al. (1975)</em> (The foundational VSM paper).</li>



<li><strong>Character n-grams:</strong> Papers often cite that for noisy user-generated content (UGC), character n-grams outperform word tokens because they handle misspellings morphologically.</li>



<li><strong>Pros:</strong> Handles &#8220;rare words&#8221; (like model numbers) better due to IDF (Inverse Document Frequency) weighting.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading"><strong>The State-of-the-Art (Deep Learning / Embeddings)</strong></h4>



<ul class="wp-block-list">
<li>If you publish, reviewers will ask: <em>&#8220;Why not BERT?&#8221;</em></li>



<li><strong>SBERT (Sentence-BERT):</strong> Current SOTA uses transformer models to generate dense vector embeddings.</li>



<li><strong>Your Counter-Argument:</strong> Deep learning is computationally expensive and &#8220;black-box&#8221;. For industrial product grouping where <em>exact</em> feature matching (like &#8220;Samsung&#8221; + &#8220;Galaxy&#8221;) is critical, classical methods (TF-IDF/Jaccard) often offer better precision and control than semantic embeddings which might group &#8220;iPhone 12&#8221; with &#8220;Samsung S20&#8221; because they are both &#8220;phones&#8221;.</li>
</ul>



<h4 class="wp-block-heading"><strong>II. Related Work</strong></h4>



<ul class="wp-block-list">
<li>Mention <strong>LSH</strong> (Locality Sensitive Hashing) for Jaccard.</li>



<li>Mention <strong>DBSCAN</strong> and <strong>Agglomerative Clustering</strong> as standard algorithms.</li>



<li>Cite limitations of <strong>BERT</strong> in high-precision SKU matching.</li>



<li>Product Grouping for E-commerce</li>
</ul>



<h4 class="wp-block-heading"><strong>III. Methodology (The Core)</strong></h4>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Group definition in dataset</summary>
<ul class="wp-block-list">
<li>Define <strong>SKU</strong> vs <strong>Product Group</strong> (Parent-Child relationship).</li>



<li>The challenge: &#8220;Noise&#8221; in titles (e.g., <code>500ml</code>, <code>XL</code>, <code>Pack of 2</code>).</li>
</ul>
</details>



<ol start="1" class="wp-block-list">
<li><strong>Preprocessing &#8211;  The Normalization Filter for Product Grouping for E-commerce:</strong>
<ul class="wp-block-list">
<li>Define your Regex rules mathematically.</li>



<li>$T_{clean} = f(T_{raw})$ where $f$ removes tokens $t \in \{Dimensions, Colors, Stopwords\}$.</li>
</ul>
</li>



<li><strong>Representation:</strong>
<ul class="wp-block-list">
<li><strong>Approach A (Jaccard):</strong> $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$</li>



<li><strong>Approach B (TF-IDF):</strong> Cosine Similarity $\cos(\theta) = \frac{A \cdot B}{||A|| ||B||}$</li>
</ul>
</li>



<li><strong>Clustering Algorithm:</strong>
<ul class="wp-block-list">
<li>Explain why you chose <strong>Connected Components</strong> (Graph theory) for Jaccard or <strong>Hierarchical Clustering</strong> for TF-IDF.</li>
</ul>
</li>
</ol>



<h4 class="wp-block-heading"><strong>IV. Experiments</strong> (Product Grouping for E-commerce)</h4>



<ul class="wp-block-list">
<li><strong>Dataset:</strong> Your dataset of &#8220;several thousand products&#8221;.</li>



<li><strong>Metrics:</strong> You <em>must</em> measure quality.
<ul class="wp-block-list">
<li><strong>Precision:</strong> Are elements in the cluster actually the same product?</li>



<li><strong>Recall:</strong> Did we find <em>all</em> sizes of that product?</li>



<li><strong>F1-Score:</strong> Harmonic mean of the two.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading"><strong>V. Results &amp; Discussion</strong></h4>



<ul class="wp-block-list">
<li><em>Hypothesis:</em> Jaccard works better for clean data; TF-IDF works better for noisy data (typos).</li>



<li><em>Observation:</em> Jaccard is faster but TF-IDF + N-grams is more robust.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Novel Algorithm:</strong></p>



<ol start="1" class="wp-block-list">
<li><strong>Stage 1 (Blocking):</strong> Use <strong>Jaccard</strong> on tokens to quickly find &#8220;Candidate Pairs&#8221; (fast, rough filter).</li>



<li><strong>Stage 2 (Refinement):</strong> Use <strong>TF-IDF with Character N-grams</strong> on the candidates to calculate a precise similarity score (handles typos).</li>



<li><strong>Stage 3 (Decision):</strong> Hard threshold (e.g., >0.85).</li>
</ol>



<h3 class="wp-block-heading">Relevant Search Terms &amp; Papers</h3>



<p>Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. </p>



<p>However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper (<a href="https://dl.acm.org/doi/pdf/10.1145/1559845.1559870" target="_blank" rel="noopener">https://dl.acm.org/doi/pdf/10.1145/1559845.1559870</a>), authors propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records.</p>



<p>Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks.</p>



<ul class="wp-block-list">
<li><strong>&#8220;Short Text Clustering for E-commerce&#8221;</strong></li>



<li><strong>&#8220;Product Entity Resolution with Noise&#8221;</strong></li>



<li><strong>&#8220;Comparison of Jaccard and Cosine Similarity in Text Mining&#8221;</strong></li>



<li><strong>&#8220;Blocking techniques for Entity Resolution&#8221;</strong> (This is crucial for scaling to thousands/millions of products). Product Grouping for E-commerce. </li>
</ul>



<p><strong>Specific types of papers to look for:</strong></p>



<ol start="1" class="wp-block-list">
<li><em>Koporec et al.</em>: Papers on combining Jaccard with other metrics. <a href="https://www.sciencedirect.com/science/article/pii/S1364815225002981" target="_blank" rel="noopener">https://www.sciencedirect.com/science/article/pii/S1364815225002981</a></li>



<li><em>Ganesan et al.</em>: Research on &#8220;abstractive summarization&#8221; or short-text clustering in retail.</li>
</ol>



<h3 class="wp-block-heading"><strong>FAQ: Product Grouping in Noisy E-Commerce Datasets</strong></h3>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>1. What does “noisy data” mean in e-commerce?</strong></summary>
<p>Noisy data refers to inconsistencies, errors, or irrelevant information in product listings—such as misspellings, incomplete descriptions, or duplicate entries—that make grouping products challenging.</p>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>2. Why is product grouping important?</strong></summary>
<p>Grouping similar products improves search accuracy, recommendation quality, and overall user experience. It also helps businesses manage inventory and pricing strategies more effectively</p>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>3. What are common challenges in product grouping?</strong></summary>
<ul class="wp-block-list">
<li>Variations in product names and descriptions</li>



<li>Missing or incorrect attributes</li>



<li>Multiple languages or regional differences</li>



<li>Inconsistent categorization by sellers</li>
</ul>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>4. Which techniques are used to handle noisy datasets?</strong></summary>
<ul class="wp-block-list">
<li><strong>Text normalization</strong> (removing special characters, standardizing case)</li>



<li><strong>Tokenization and similarity measures</strong> (e.g., cosine similarity, Jaccard index)</li>



<li><strong>Machine learning models</strong> for clustering and classification</li>



<li><strong>Attribute-based matching</strong> using structured data</li>
</ul>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>5. Can AI improve product grouping accuracy?</strong></summary>
<p><br>Yes. AI models like BERT or domain-specific embeddings can capture semantic meaning in product descriptions, making grouping more accurate even with noisy data.</p>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>6. How do I start implementing product grouping?</strong></summary>
<p><br>Begin with data cleaning and normalization, then apply similarity-based clustering or train a supervised model if labeled data is available.</p>
</details>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/product-grouping-in-noisy-e-commerce-datasets">Product Grouping in Noisy E-commerce Datasets</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>15 Essential SQL Tips You Can&#8217;t Live Without</title>
		<link>https://mietwood.com/sql-tips</link>
					<comments>https://mietwood.com/sql-tips#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sat, 11 Oct 2025 15:54:17 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Business Analytics]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3353</guid>

					<description><![CDATA[<p>Whether you&#8217;re optimizing performance or simplifying your queries, these SQL tips from mietwood.com will help you write cleaner, faster, and more efficient code. SQL is the backbone of data-driven decision-making, and mastering it can dramatically improve how you interact with databases. Whether you&#8217;re a seasoned developer or just starting out, writing efficient, readable, and scalable...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/sql-tips">15 Essential SQL Tips You Can&#8217;t Live Without</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Whether you&#8217;re optimizing performance or simplifying your queries, these SQL tips from <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter" target="_blank" rel="noreferrer noopener">mietwood.com</a> will help you write cleaner, faster, and more efficient code.</p>



<p>SQL is the backbone of data-driven decision-making, and mastering it can dramatically improve how you interact with databases. Whether you&#8217;re a seasoned developer or just starting out, writing efficient, readable, and scalable SQL queries is a skill that pays off daily. In this post, I’ve compiled ten essential tips that will help you write smarter SQL—tips that I’ve learned, refined, and shared over time. These aren’t just theoretical best practices; they’re practical techniques that can make your queries faster, your code cleaner, and your debugging easier.</p>



<p>You can try Microsoft SQL Server from here: <a href="https://www.microsoft.com/pl-pl/sql-server/sql-server-downloads" target="_blank" rel="noopener">https://www.microsoft.com/pl-pl/sql-server/sql-server-downloads</a>. And developer edition is <a href="https://go.microsoft.com/fwlink/p/?linkid=2215158&amp;clcid=0x415&amp;culture=pl-pl&amp;country=pl" target="_blank" rel="noopener">here</a> </p>



<p>From avoiding <code>SELECT *</code> to choosing the right join types, each tip is designed to help you think critically about how your queries perform and how they scale. You’ll also learn how to use indexes effectively, filter data early, and make smart choices between <code>EXISTS</code> and <code>IN</code>. Each section includes a short summary and a link to a full post where you can dive deeper into the topic. Whether you&#8217;re optimizing a legacy system or building something new, these tips will help you get the most out of SQL—and avoid common pitfalls that slow down your work.</p>



<h3 class="wp-block-heading"><strong>Select Only What You Need</strong>, SQL Tips no 1</h3>



<p>Avoid <code>SELECT *</code> and specify only the columns you need. This reduces data transfer, memory usage, and improves query speed. For example, instead of pulling all employee data, just select <code>employee_id</code>, <code>first_name</code>, and <code>last_name</code>.  <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter" target="_blank" rel="noreferrer noopener">Read more</a></p>



<h3 class="wp-block-heading" id="1optimizejoinconditionsforperformance"><strong>Optimize Join Conditions for Performance</strong></h3>



<p>Avoid non-SARGable joins that prevent index usage. Instead of applying functions to columns in join conditions, restructure the logic to preserve index efficiency. This dramatically improves query speed. <a href="https://mietwood.com/query-optimization-with-join-condition">Read the full guide</a></p>



<h3 class="wp-block-heading" id="2usethepivotoperatorforbetterreporting"><strong>Use the PIVOT Operator for Better Reporting</strong></h3>



<p>Transform row-based data into columnar format using <code>PIVOT</code>. This is ideal for cross-tab reports and trend analysis, especially when comparing metrics across time or categories. <a href="https://mietwood.com/the-pivot-operator-in-sql">Explore the PIVOT tutorial</a></p>



<h3 class="wp-block-heading" id="3masterrecursivectesforhierarchicaldata"><strong>Master Recursive CTEs for Hierarchical Data</strong></h3>



<p>Recursive Common Table Expressions (CTEs) allow you to elegantly query hierarchical or tree-structured data. They’re powerful for tasks like organizational charts or category trees. <a href="https://mietwood.com/blog">Learn about recursive CTEs</a></p>



<h3 class="wp-block-heading" id="4setthefirstdayoftheweekwithdatefirst"><strong>Set the First Day of the Week with DATEFIRST</strong></h3>



<p>Use <code>SET DATEFIRST</code> to control how SQL Server interprets weekday numbers. This is crucial for accurate time-based reporting and week-based aggregations. <a href="https://mietwood.com/category/sql">See how to use DATEFIRST</a></p>



<h3 class="wp-block-heading" id="5updatemultipletableswithconditions"><strong>Update Multiple Tables with Conditions</strong></h3>



<p>Learn how to structure multi-table updates using joins and conditional logic. This technique is essential for synchronizing data across related tables. <a href="https://mietwood.com/category/sql">Read the multi-table update example</a></p>



<h3 class="wp-block-heading" id="7filterearlywithwhereclauses"><strong>Filter Early with WHERE Clauses</strong></h3>



<p>Apply filters as early as possible to reduce the number of rows processed in joins and aggregations. <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter">Optimize your filtering</a></p>



<h3 class="wp-block-heading" id="8useunionallinsteadofunion"><strong>Use UNION ALL Instead of UNION</strong></h3>



<p><code>UNION ALL</code> is faster than <code>UNION</code> because it skips duplicate elimination. Use it when duplicates aren’t a concern. <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter">Performance tip explained</a></p>



<h3 class="wp-block-heading" id="9avoidfunctionsonindexedcolumns"><strong>Avoid Functions on Indexed Columns</strong></h3>



<p>Using functions like <code>LOWER()</code> or <code>DATEADD()</code> on indexed columns disables index usage. Rewrite conditions to preserve index paths. <a href="https://mietwood.com/query-optimization-with-join-condition">Join optimization example</a></p>



<h3 class="wp-block-heading" id="10exploresqlforbusinessanalytics"><strong>Explore SQL for Business Analytics</strong></h3>



<p>Advanced SQL techniques like statistical analysis, predictive modeling, and time series forecasting are essential for business analysts. Learn how to combine SQL with Python for deeper insights. <a href="https://mietwood.com/advanced-programming-in-sql-and-python">Check out the full course</a></p>



<h2 class="wp-block-heading"><strong>5 additional SQL tips</strong></h2>



<h3 class="wp-block-heading"><strong>Use CTEs for Readability and Reuse</strong></h3>



<p>Common Table Expressions (CTEs) make complex queries easier to read and maintain. They allow you to define temporary result sets that can be referenced multiple times. SQL Tips</p>



<pre class="wp-block-code"><code>WITH recentorders AS (
  SELECT customerid, orderdate
  FROM orders
  WHERE orderdate > CURRENTDATE - INTERVAL '30 days'
)
SELECT customerid, COUNT(*) AS ordercount
FROM recentorders
GROUP BY customer_id;</code></pre>



<h3 class="wp-block-heading" id="12avoidfunctionsonindexedcolumns"><strong>Avoid Functions on Indexed Columns</strong></h3>



<p>Using functions on indexed columns disables index usage, slowing down queries. Instead, transform the value before comparison. SQL Tips</p>



<pre class="wp-block-code"><code>-- Avoid
SELECT FROM users WHERE LOWER(email) = 'test@example.com';
-- Better
SELECT FROM users WHERE email = 'test@example.com';</code></pre>



<h3 class="wp-block-heading" id="13usecaseforconditionallogic"><strong>Use CASE for Conditional Logic</strong></h3>



<p><code>CASE</code> lets you embed conditional logic directly in your queries, useful for categorizing or transforming data.</p>



<pre class="wp-block-code"><code>SELECT name,
  CASE
    WHEN score >= 90 THEN 'Excellent'
    WHEN score >= 75 THEN 'Good'
    ELSE 'Needs Improvement'
  END AS performance
FROM students;</code></pre>



<h3 class="wp-block-heading" id="14optimizeaggregationswithgroupby"><strong>Optimize Aggregations with GROUP BY</strong></h3>



<p>When aggregating data, ensure you&#8217;re grouping only necessary columns to avoid performance hits and incorrect results.</p>



<pre class="wp-block-code"><code>SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;</code></pre>



<h3 class="wp-block-heading" id="15useparameterizedqueriestopreventsqlinjection"><strong>Use Parameterized Queries to Prevent SQL Injection</strong></h3>



<p>Always use parameterized queries in application code to protect against SQL injection.</p>



<pre class="wp-block-code"><code>-- Example in Python with psycopg2
cursor.execute("SELECT * FROM users WHERE username = %s", (username,))</code></pre>



<h2 class="wp-block-heading">Remember &#8211; tips summary</h2>



<h3 class="wp-block-heading"><strong>Use Joins Efficiently</strong></h3>



<p>Choose the right join type—<code>INNER JOIN</code> for matched rows, and avoid <code>CROSS JOIN</code> unless necessary. Efficient joins reduce unnecessary data processing and improve clarity.</p>



<h3 class="wp-block-heading"><strong>Filter Data Early</strong></h3>



<p>Apply filters as soon as possible in your query using <code>WHERE</code> clauses. This minimizes the number of rows processed in joins and aggregations, leading to faster execution. SQL Tips no 3</p>



<h3 class="wp-block-heading"><strong>Use Indexes Wisely</strong></h3>



<p>Indexes speed up data retrieval, especially in <code>WHERE</code>, <code>JOIN</code>, and <code>ORDER BY</code> clauses. But don’t over-index—too many can slow down <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> operations.</p>



<h3 class="wp-block-heading"><strong>Avoid Subqueries in WHERE Clauses</strong></h3>



<p>Correlated subqueries can be slow. Replace them with joins when possible to improve performance and readability. SQL Tips no 4</p>



<h3 class="wp-block-heading"><strong>Use UNION ALL Instead of UNION</strong></h3>



<p><code>UNION</code> removes duplicates, which is costly. If duplicates aren’t a concern, use <code>UNION ALL</code> for faster results.</p>



<h3 class="wp-block-heading"><strong>Limit Your Results</strong></h3>



<p>Use <code>LIMIT</code> or <code>TOP</code> to restrict the number of rows returned. This is especially useful for pagination or sampling large datasets.</p>



<h3 class="wp-block-heading"><strong>Be Cautious with LIKE and Functions</strong></h3>



<p>Avoid leading wildcards in <code>LIKE</code> and functions in <code>WHERE</code> clauses—they prevent index usage. Instead, use indexed columns and consistent casing.</p>



<h3 class="wp-block-heading"><strong>Use EXISTS Instead of IN</strong></h3>



<p><code>EXISTS</code> is often faster than <code>IN</code> because it stops scanning once a match is found. Use it for subqueries checking row existence.</p>



<h3 class="wp-block-heading"><strong>Use Appropriate Data Types</strong></h3>



<p>Choosing the right data type—like <code>TINYINT</code> over <code>INT</code> or <code>CHAR</code> over <code>VARCHAR</code>—can save space and improve performance.</p>



<h2 class="wp-block-heading">SQL Server Management Studio</h2>



<p><strong>SSMS as a Comprehensive SQL Environment</strong><br>SQL Server Management Studio (<a href="https://learn.microsoft.com/en-us/ssms/" target="_blank" rel="noopener">https://learn.microsoft.com/en-us/ssms/</a>) is a powerful, integrated environment for managing SQL Server infrastructure. It provides tools for writing, executing, and debugging SQL queries, as well as managing databases, tables, views, and stored procedures. SSMS supports both on-premises and cloud-based SQL Server instances, making it versatile for hybrid environments. Its intuitive interface includes Object Explorer for navigating server components and Query Editor for crafting and testing SQL scripts. Whether you&#8217;re a database administrator or developer, SSMS offers a unified workspace that streamlines daily tasks and enhances productivity through built-in templates, syntax highlighting, and error diagnostics.</p>



<p><strong>Security, Performance, and Monitoring Tools</strong><br>SSMS includes robust features for security management, such as configuring roles, permissions, and auditing access. It also provides performance tuning tools like the Database Engine Tuning Advisor and graphical execution plans to help identify bottlenecks. With Activity Monitor, users can track real-time server performance, view active sessions, and analyze resource usage. These tools empower teams to maintain optimal database health and ensure compliance with organizational policies. SSMS also integrates with SQL Server Agent for scheduling jobs and alerts, making it a central hub for automation and proactive monitoring across enterprise environments.</p>



<p><strong>Integration, Extensibility, and Collaboration</strong><br>SSMS supports integration with source control systems like Git, enabling versioning and collaborative development. It allows exporting and importing data via wizards, scripting database objects, and generating reports for documentation. Users can extend SSMS functionality through add-ins and connect to Azure services for cloud-based analytics and storage. Its support for multiple query windows and tabbed editing enhances multitasking, while customizable keyboard shortcuts and themes improve user experience. SSMS continues to evolve with regular updates, ensuring compatibility with the latest SQL Server features and providing a stable platform for modern data operations.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/sql-tips">15 Essential SQL Tips You Can&#8217;t Live Without</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/sql-tips/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Query optimization with join condition</title>
		<link>https://mietwood.com/query-optimization-with-join-condition</link>
					<comments>https://mietwood.com/query-optimization-with-join-condition#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Mon, 06 Oct 2025 18:04:33 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3349</guid>

					<description><![CDATA[<p>Microsoft SQL Server (MS SQL Server) is a relational database management system (RDBMS) developed by Microsoft. Transact-SQL (T-SQL) is the specific dialect of the SQL (Structured Query Language) that you use to communicate with a Microsoft SQL Server database. The query to be optimized Here there is a query. The table has no index. How...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/query-optimization-with-join-condition">Query optimization with join condition</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Microsoft SQL Server (MS SQL Server) is a <strong>relational database management system</strong> (RDBMS) developed by Microsoft. Transact-SQL (T-SQL) is the <strong>specific dialect of the SQL</strong> (Structured Query Language) that you use to communicate with a Microsoft SQL Server database. </p>



<h2 class="wp-block-heading">The query to be optimized</h2>



<p>Here there is a query. The table has no index. How can we optimize the query and the table reading performance setting up indices? select k.*, </p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>select k.*, 
       c.&#91;Top_segment&#93;
      ,c.&#91;CompanyType&#93;
      ,c.&#91;Potencial&#93; from (
    select 
          a.&#91;Dt_ym&#93;
          ,a.&#91;CustomerIdEx&#93;
          ,sum(a.ErpRegisterDoc_count) ErpRegisterDoc_count
          ,sum(a.ErpValueNet_sum) ErpValueNet_sum, sum(b.ErpValueNet_sum) ErpValueNet_sum_py
    FROM &#91;onn&#93;.&#91;DBCust_stat_baskets&#93; a with (nolock)
      left join &#91;onn&#93;.&#91;DBCust_stat_baskets&#93; b with (nolock) 
          on dateadd(year, 1, b.dt_ym) = a.Dt_ym and b.&#91;CustomerIdEx&#93; = a.CustomerIdEx
group by           a.&#91;Dt_ym&#93;
          ,a.&#91;CustomerIdEx&#93;
 ) as k 
  left join &#91;onn&#93;.&#91;DBCust_erp&#93; c with (nolock) on k.CustomerIdEx = c.Cust_nr</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">select</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">k</span><span style="color: #ECEFF4">.</span><span style="color: #81A1C1">*</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">       </span><span style="color: #D8DEE9">c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">Top_segment</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9">c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">CompanyType</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9">c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">Potencial</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #88C0D0">from</span><span style="color: #D8DEE9FF"> (</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">select</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">Dt_ym</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">CustomerIdEx</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #ECEFF4">,</span><span style="color: #88C0D0">sum</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">ErpRegisterDoc_count</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">ErpRegisterDoc_count</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #ECEFF4">,</span><span style="color: #88C0D0">sum</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">ErpValueNet_sum</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">ErpValueNet_sum</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sum</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">b</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">ErpValueNet_sum</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">ErpValueNet_sum_py</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">onn</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">DBCust_stat_baskets</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">with</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">nolock</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">      </span><span style="color: #D8DEE9">left</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">join</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">onn</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">DBCust_stat_baskets</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">b</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">with</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">nolock</span><span style="color: #D8DEE9FF">) </span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #D8DEE9">on</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">dateadd</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">year</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">b</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">dt_ym</span><span style="color: #D8DEE9FF">) </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">Dt_ym</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">b</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">CustomerIdEx</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">CustomerIdEx</span></span>
<span class="line"><span style="color: #D8DEE9">group</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">by</span><span style="color: #D8DEE9FF">           </span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">Dt_ym</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9">a</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">CustomerIdEx</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF"> ) </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> k </span></span>
<span class="line"><span style="color: #D8DEE9FF">  </span><span style="color: #D8DEE9">left</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">join</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">onn</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">DBCust_erp</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">c</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">with</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">nolock</span><span style="color: #D8DEE9FF">) </span><span style="color: #D8DEE9">on</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">k</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">CustomerIdEx</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">c</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">Cust_nr</span></span></code></pre></div>



<p>The biggest performance issues here are the <strong>complete lack of indexes</strong> and a <strong>join condition that prevents the use of an index</strong> (even if one existed). You can dramatically speed this up by creating the right indexes and making a small but critical change to the query.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Setting Up the Right Indices</h3>



<p>An index acts like a book&#8217;s index, allowing SQL Server to find data quickly without scanning the entire table. Based on your query&#8217;s <code>JOIN</code> and <code>GROUP BY</code> clauses, these are the indexes you need.</p>



<h4 class="wp-block-heading">On the <code>DBCust_stat_baskets</code> table:</h4>



<p>This table is joined to itself and grouped by <code>CustomerIdEx</code> and <code>Dt_ym</code>. A <strong>covering index</strong> will be most effective here. This special type of index not only includes the key columns for searching but also the data columns the query needs, allowing the entire subquery to be satisfied from the index alone.</p>



<p>SQL</p>



<pre class="wp-block-code"><code>CREATE NONCLUSTERED INDEX IX_DBCust_stat_baskets_Customer_Date
ON &#91;onn].&#91;DBCust_stat_baskets] (&#91;CustomerIdEx], &#91;Dt_ym])
INCLUDE (&#91;ErpRegisterDoc_count], &#91;ErpValueNet_sum]);
</code></pre>



<ul class="wp-block-list">
<li><strong><code>ON ([CustomerIdEx], [Dt_ym])</code></strong>: This covers your <code>JOIN</code> and <code>GROUP BY</code> columns. The order is important; <code>CustomerIdEx</code> is more selective, so it comes first.</li>



<li><strong><code>INCLUDE (...)</code></strong>: This &#8220;covers&#8221; the <code>SUM</code> calculations, so SQL Server doesn&#8217;t have to look up the data in the main table, which is a massive performance boost.</li>
</ul>



<h4 class="wp-block-heading">On the <code>DBCust_erp</code> table:</h4>



<p>This table is joined on the <code>Cust_nr</code> column. It also needs an index to make the final <code>LEFT JOIN</code> fast.</p>



<p>SQL</p>



<pre class="wp-block-code"><code>CREATE NONCLUSTERED INDEX IX_DBCust_erp_Cust_nr
ON &#91;onn].&#91;DBCust_erp] (&#91;Cust_nr])
INCLUDE (&#91;Top_segment], &#91;CompanyType], &#91;Potencial]);
</code></pre>



<ul class="wp-block-list">
<li><strong><code>ON ([Cust_nr])</code></strong>: Speeds up the join to the subquery <code>k</code>.</li>



<li><strong><code>INCLUDE (...)</code></strong>: Covers the columns you are selecting, again avoiding a second lookup to the main table.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Optimizing the T-SQL Query 🚀</h3>



<p>Your query has a &#8220;non-SARGable&#8221; join condition. SARGable means <strong>S</strong>earch <strong>Arg</strong>ument<strong>able</strong>, allowing SQL to use an index.</p>



<p>The problem is this line: <code>dateadd(year, 1, b.dt_ym) = a.Dt_ym</code>. By applying the <code>dateadd</code> function to the column <code>b.dt_ym</code>, you force SQL Server to calculate a new value for <strong>every single row in the table</strong> before it can perform the join. This guarantees a slow table scan.</p>



<p>The fix is to apply the function to the other side of the equation, so the column itself remains &#8220;clean.&#8221;</p>



<p><strong>Change this:</strong> <code>ON dateadd(year, 1, b.dt_ym) = a.Dt_ym</code></p>



<p><strong>To this:</strong> <code>ON b.dt_ym = dateadd(year, -1, a.Dt_ym)</code></p>



<p>This small change allows SQL Server to take a value from <code>a</code>, calculate the prior year&#8217;s date <em>once</em>, and then perform a highly efficient index seek on <code>b.dt_ym</code> to find the matching row.</p>



<h4 class="wp-block-heading">The Final Optimized Query</h4>



<p>For better readability, I&#8217;ve rewritten your query using a Common Table Expression (CTE). The logic is the same, but it&#8217;s easier to follow. This version incorporates the SARGable join fix.</p>



<p>SQL</p>



<pre class="wp-block-code"><code>-- Use a CTE for better readability to get the current and prior year sums
WITH CustomerMonthlyStats AS (
    SELECT
        a.&#91;Dt_ym],
        a.&#91;CustomerIdEx],
        SUM(a.ErpRegisterDoc_count) AS ErpRegisterDoc_count,
        SUM(a.ErpValueNet_sum) AS ErpValueNet_sum,
        SUM(b.ErpValueNet_sum) AS ErpValueNet_sum_py -- Sum from the previous year
    FROM
        &#91;onn].&#91;DBCust_stat_baskets] AS a
    LEFT JOIN
       &#91;onn].&#91;DBCust_stat_baskets] AS b
        -- OPTIMIZED: The function is now on the outer side of the join, making it SARGable
        ON a.&#91;CustomerIdEx] = b.&#91;CustomerIdEx]
        AND b.dt_ym = DATEADD(year, -1, a.Dt_ym)
    GROUP BY
        a.&#91;Dt_ym],
        a.&#91;CustomerIdEx]
)
-- Final SELECT joining the aggregated stats with customer details
SELECT
    k.*,
    c.&#91;Top_segment],
    c.&#91;CompanyType],
    c.&#91;Potencial]
FROM
    CustomerMonthlyStats AS k
LEFT JOIN
    &#91;onn].&#91;DBCust_erp] AS c ON k.CustomerIdEx = c.Cust_nr;

</code></pre>



<p>By creating these indexes and applying this simple query fix, you should see a dramatic improvement in performance, going from potentially minutes to seconds. I&#8217;d also recommend removing the <code>WITH (NOLOCK)</code> hints after these changes, as they won&#8217;t be necessary and can lead to reading &#8220;dirty&#8221; or incorrect data.</p>



<p><em>You also converted my query to CTE, but I cant use cte in Power BI direct query, that is why I used like that. Do I don&#8217;t know something or CTE is really better then my version of &#8220;cascade&#8221; querying?</em></p>



<p>That&#8217;s an excellent point about Power BI DirectQuery&#8217;s limitations with CTEs. You are correct; you often have to use subqueries (also called derived tables) instead.</p>



<p>To answer your main question: for this specific query, <strong>a CTE is not better than your subquery in terms of performance.</strong> The SQL Server query optimizer is smart enough to see that both your subquery version and the CTE version are logically identical. It will generate the exact same execution plan for both, meaning they will run at the same speed.<sup></sup></p>



<p>The primary advantage of a CTE is <strong>readability and maintainability</strong>, not performance. It helps break down complex logic into clean, sequential steps, making the query easier for a human to understand and debug later.<sup></sup></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">The Important Part: The Optimization Itself</h3>



<p>The crucial takeaway from the previous answer was not the switch to a CTE, but these two points:</p>



<ol start="1" class="wp-block-list">
<li><strong>Creating the right indexes</strong> on the tables.</li>



<li><strong>Fixing the non-SARGable join condition</strong>.</li>
</ol>



<p>As long as you apply those two changes, your original query structure is perfectly fine and will be just as fast as the CTE version.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Your Original Query, Optimized</h3>



<p>Here is your original query with the single, critical change to the <code>JOIN</code> condition. This version avoids CTEs and is ready for Power BI DirectQuery, while still containing the essential performance fix.</p>



<p>SQL</p>



<pre class="wp-block-code"><code>SELECT
    k.*,
    c.&#91;Top_segment],
    c.&#91;CompanyType],
    c.&#91;Potencial]
FROM (
    -- This subquery is functionally the same as the CTE
    SELECT
        a.&#91;Dt_ym],
        a.&#91;CustomerIdEx],
        SUM(a.ErpRegisterDoc_count) AS ErpRegisterDoc_count,
        SUM(a.ErpValueNet_sum) AS ErpValueNet_sum,
        SUM(b.ErpValueNet_sum) AS ErpValueNet_sum_py
    FROM
        &#91;onn].&#91;DBCust_stat_baskets] AS a WITH (NOLOCK)
    LEFT JOIN
        &#91;onn].&#91;DBCust_stat_baskets] AS b WITH (NOLOCK)
        -- The single most important optimization is right here:
        ON a.CustomerIdEx = b.CustomerIdEx AND b.dt_ym = DATEADD(year, -1, a.Dt_ym)
    GROUP BY
        a.&#91;Dt_ym],
        a.&#91;CustomerIdEx]
) AS k
LEFT JOIN
   &#91;onn].&#91;DBCust_erp] AS c WITH (NOLOCK) ON k.CustomerIdEx = c.Cust_nr;

</code></pre>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/query-optimization-with-join-condition">Query optimization with join condition</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/query-optimization-with-join-condition/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Student Profiling for your lecture</title>
		<link>https://mietwood.com/profiling</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Tue, 30 Sep 2025 08:55:38 +0000</pubDate>
				<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3334</guid>

					<description><![CDATA[<p>People profiling problem can be approached as an analysis of co-occurrence, how often lectures are chosen together and correlation, the strength and direction of the relationship between choosing lecture X and choosing other lectures. Frequency Analysis for Profiling This is the most direct approach to identify the most and least selected lectures by your L6...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/profiling">Student Profiling for your lecture</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>People profiling problem can be approached as an analysis of <strong>co-occurrence</strong>, how often lectures are chosen together and <strong>correlation,</strong> the strength and direction of the relationship between choosing lecture X and choosing other lectures.</p>



<h2 class="wp-block-heading">Frequency Analysis for Profiling</h2>



<p>This is the most direct approach to identify the most and least selected lectures by your L6 students.</p>



<ol start="1" class="wp-block-list">
<li><strong>Filter Data:</strong> Isolate the rows (students) who selected <strong>L6</strong>.</li>



<li><strong>Calculate Frequencies:</strong> For this subset of L6 students, count how many of them selected each of the <em>other</em> available lectures (L1, L2, L3, etc.).
<ul class="wp-block-list">
<li><strong>Most Selected:</strong> The lectures with the highest counts.</li>



<li><strong>Least Selected:</strong> The lectures with the lowest counts (or those not selected at all).</li>
</ul>
</li>
</ol>



<h2 class="wp-block-heading">Association Rule Mining &#8211; Co-occurrence</h2>



<p>This sophisticated approach, often used in market basket analysis, can determine which lectures are most <strong>frequently chosen together</strong> with L6.</p>



<ul class="wp-block-list">
<li><strong>Support:</strong> The proportion of L6 students who also selected a specific lecture (e.g., L3).</li>



<li><strong>Confidence:</strong> The likelihood that a student selected L6 <em>given</em> that they selected another lecture (e.g., L3 → L6), or vice-versa.</li>



<li><strong>Lift:</strong> A measure of how much more likely a student is to select L3 if they also selected L6, compared to the overall likelihood of selecting L3. A Lift >1 suggests a <strong>positive association</strong> (students who take one tend to take the other).</li>
</ul>



<h2 class="wp-block-heading">Correlation Analysis &#8211; the Strength and Direction of Relation &#8211; application in Profiling</h2>



<p>This method quantifies the relationship between selecting L6 and selecting any other lecture (Lx). Since the selection data is <strong>binary</strong> (0 for not selected, 1 for selected), you would use a correlation measure suitable for binary variables. Profiling.</p>



<ul class="wp-block-list">
<li><strong>Phi Coefficient (ϕ):</strong> This is a measure of association for two binary variables. It ranges from −1 to +1.
<ul class="wp-block-list">
<li><strong>Strong Positive Correlation (ϕ≈+1):</strong> Students who select <strong>L6</strong> are highly likely to also select <strong>Lx</strong>. This suggests the lectures are perhaps complementary or targeted at the same student group.</li>



<li><strong>Strong Negative Correlation (ϕ≈−1):</strong> Students who select <strong>L6</strong> are highly likely to <strong>not</strong> select <strong>Lx</strong>. This suggests the lectures might be alternatives, require conflicting time slots, or appeal to entirely different student interests.</li>



<li><strong>Weak/No Correlation (ϕ≈0):</strong> Selection of L6 has little to no impact on the selection of Lx.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading">Dimensionality Reduction &#8211; Clustering</h2>



<p>For a very large number of lectures, you could use methods like <strong>Principal Component Analysis (PCA)</strong> or <strong>clustering algorithms</strong> to group similar students or lectures together. This can identify underlying student profiles (e.g., &#8220;The Data Science Crowd&#8221; or &#8220;The Humanities Enthusiasts&#8221;) that include L6 as part of their typical selection.</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="956" height="647" src="https://mietwood.com/wp-content/uploads/2025/09/image-8.jpg" alt="data sample for profiling" class="wp-image-3336" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-8.jpg 956w, https://mietwood.com/wp-content/uploads/2025/09/image-8-300x203.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-8-768x520.jpg 768w" sizes="(max-width: 956px) 100vw, 956px" /><figcaption class="wp-element-caption">Dataset for profiling</figcaption></figure>



<p>In this table there are students (in rows) selection of lectures (in column). My lecture is L6. I would like to know a profile of my students. so which lectures they selected the most and which the least. How strong the relations are (positive and negative if they omit some lectures.</p>



<h2 class="wp-block-heading">Profile of L6 Students: Most and Least Selected Lectures</h2>



<p>This is a <strong>Frequency Analysis</strong> of the lectures selected by the 11 students who chose L6. The percentages are based on the total number of L6 students (15).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="271" height="357" src="https://mietwood.com/wp-content/uploads/2025/09/image-9.jpg" alt="frequency analysis for profiling" class="wp-image-3337" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-9.jpg 271w, https://mietwood.com/wp-content/uploads/2025/09/image-9-228x300.jpg 228w" sizes="(max-width: 271px) 100vw, 271px" /><figcaption class="wp-element-caption">Frequency analysis</figcaption></figure>
</div>


<p><strong>Most Selected Lectures &#8211; The &#8220;Typical Package&#8221;:</strong> Your L6 students most frequently select <strong>L4</strong> (67%), <strong>L3</strong> (53%), and <strong>L5</strong> (53%). These three form the core lecture package with L6. <strong>Least Selected Lecture:</strong> <strong>L12</strong> is the least popular choice, selected by only 3 out of 15 students (20%).</p>



<h2 class="wp-block-heading">Strength of Relation: Correlation Analysis</h2>



<p>The <strong>Phi Coefficient</strong> quantifies the strength and direction of the relationship between choosing L6 and choosing any other lecture, using all 25 students in the dataset.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="716" height="513" src="https://mietwood.com/wp-content/uploads/2025/09/image-10.jpg" alt="Strength of Relation in profiling: Correlation Analysis" class="wp-image-3338" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-10.jpg 716w, https://mietwood.com/wp-content/uploads/2025/09/image-10-300x215.jpg 300w" sizes="(max-width: 716px) 100vw, 716px" /><figcaption class="wp-element-caption">Strength of Relation: Correlation Analysis</figcaption></figure>
</div>


<p><strong>Strongest Positive Relations (Complementary):</strong> <strong>L4</strong> (ϕ=0.263) and <strong>L5</strong> (ϕ=0.230) show the strongest positive correlation with L6. This suggests that students interested in L6 are often those who also select L4 and L5.</p>



<p><strong>Strongest Negative Relation (Alternative/Avoided):</strong> <strong>L12</strong> (ϕ=−0.218) shows the only notable negative correlation. This confirms the frequency finding, suggesting L12 may be an alternative path or have a conflicting time/prerequisite with L6.</p>



<p><strong>Weak/No Relation:</strong> Lectures like L3 and L7 have a high selection frequency but a very weak (L3) or zero (L7) correlation. This indicates that while many L6 students <em>do</em> take these, they are likely popular lectures chosen by many students across the board, and the choice of L6 is not a significant predictor of their selection.</p>



<h2 class="wp-block-heading"><strong>Association Rule Mining</strong></h2>



<p>By analyzing the entire student population, we can discover <strong>general student curriculum patterns</strong> that exist beyond your specific L6 cohort. I used <strong>Association Rule Mining</strong> metrics (<strong>Support</strong>, <strong>Confidence</strong>, and <strong>Lift</strong>) to find lecture pairs that are frequently selected together.</p>



<ul class="wp-block-list">
<li><strong>Support:</strong> The percentage of all 25 students who selected both lectures.</li>



<li><strong>Lift:</strong> A measure of how much the selection of one lecture <em>increases</em> the chance of selecting the other. A Lift&gt;1.2 indicates a strong, meaningful positive association.</li>
</ul>



<p>Here are the top co-selected lecture groups (pairs) among the entire student population, filtered for those selected by at least 16% of students and showing a strong positive association (Lift&gt;1.2):</p>



<p><strong>Python</strong></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd
from io import StringIO
import itertools

# Read the data
df = pd.read_csv(StringIO(csv_data), sep=';')
N_students = len(df)

lecture_cols = &#91;col for col in df.columns if col.startswith('L')&#93;
lecture_df = df&#91;lecture_cols&#93;.fillna(0).astype(int)

# --- Association Rule Mining (Pairs) ---
# 1. Calculate Support for individual lectures
support_single = lecture_df.sum() / N_students

# 2. Calculate Support and Lift for all pairs
association_rules = []
for l1, l2 in itertools.combinations(lecture_cols, 2):
    # Calculate Support for the pair: count students who selected both
    co_selection_count = (lecture_df&#91;l1&#93; * lecture_df&#91;l2&#93;).sum()
    support_pair = co_selection_count / N_students

    # Calculate Lift: (Support(L1 and L2)) / (Support(L1) * Support(L2))
    # Handle division by zero if single support is 0, though unlikely here
    if support_single&#91;l1&#93; > 0 and support_single&#91;l2&#93; > 0:
        lift = support_pair / (support_single&#91;l1&#93; * support_single&#91;l2&#93;)
    else:
        lift = 0

    # Calculate Confidence (L1 -> L2)
    confidence_l1_to_l2 = support_pair / support_single&#91;l1&#93; if support_single&#91;l1&#93; > 0 else 0

    association_rules.append({
        'Antecedent': l1,
        'Consequent': l2,
        'Support': support_pair,
        'Confidence (L1 -> L2)': confidence_l1_to_l2,
        'Lift': lift
    })

# Convert to DataFrame
rules_df = pd.DataFrame(association_rules)

# Filter for meaningful associations:
# 1. Minimum Support: Selected by at least 4 students (4/25 = 0.16)
# 2. Lift > 1.2: A strong positive relationship
min_support = 4 / N_students  # 0.16

# Filter and sort the results by Lift
top_associations = rules_df[
    (rules_df&#91;'Support'&#93; >= min_support) &amp;
    (rules_df&#91;'Lift'&#93; > 1.2)
].sort_values(by='Lift', ascending=False).reset_index(drop=True)

# Add the reverse rules (L2 -> L1) to the table where Lift is high.
# Since Lift is symmetrical, only one direction needs to be calculated, but Confidence is not.

# Helper function to get Confidence (L2 -> L1) for presentation
def get_confidence_l2_to_l1(row):
    l1 = row&#91;'Antecedent'&#93;
    l2 = row&#91;'Consequent'&#93;
    support_pair = row&#91;'Support'&#93;
    return support_pair / support_single&#91;l2&#93; if support_single&#91;l2&#93; > 0 else 0

top_associations&#91;'Confidence (L2 -> L1)'&#93; = top_associations.apply(get_confidence_l2_to_l1, axis=1)

# Reorder columns for presentation
top_associations = top_associations[&#91;'Antecedent', 'Consequent', 'Support', 'Confidence (L1 -> L2)', 'Confidence (L2 -> L1)', 'Lift'&#93;]

print("Top Co-Selected Lecture Groups (Pairs):")
print(top_associations.to_markdown(index=False, floatfmt=".3f"))</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">io</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">StringIO</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">itertools</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Read</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span></span>
<span class="line"><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">read_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">StringIO</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">csv_data</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sep</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">N_students</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">len</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">lecture_cols</span><span style="color: #D8DEE9FF"> = &#91;</span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">startswith</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">L</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)&#93;</span></span>
<span class="line"><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">lecture_cols</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">fillna</span><span style="color: #D8DEE9FF">(0).</span><span style="color: #8FBCBB">astype</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">int</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># --- </span><span style="color: #8FBCBB">Association</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Rule</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Mining</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">Pairs</span><span style="color: #D8DEE9FF">) ---</span></span>
<span class="line"><span style="color: #D8DEE9FF"># 1. </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">individual</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">lectures</span></span>
<span class="line"><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sum</span><span style="color: #D8DEE9FF">() / </span><span style="color: #8FBCBB">N_students</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># 2. </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">all</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pairs</span></span>
<span class="line"><span style="color: #8FBCBB">association_rules</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">l1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">itertools</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">combinations</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">lecture_cols</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 2):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pair</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">count</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">students</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">who</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">selected</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">both</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">co_selection_count</span><span style="color: #D8DEE9FF"> = (</span><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93;).</span><span style="color: #8FBCBB">sum</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">co_selection_count</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">N_students</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF">: (</span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">)) / (</span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF">) </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Handle</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">division</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">zero</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">single</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> 0</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">though</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">unlikely</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">here</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; &gt; 0 </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93; &gt; 0:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">lift</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> / (</span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93;)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">lift</span><span style="color: #D8DEE9FF"> = 0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">confidence_l1_to_l2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; &gt; 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> 0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">association_rules</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Antecedent</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">l1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Consequent</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">l2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">support_pair</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">)&#39;: </span><span style="color: #8FBCBB">confidence_l1_to_l2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">lift</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Convert</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span></span>
<span class="line"><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">association_rules</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Filter</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">meaningful</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">associations</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF"># 1. </span><span style="color: #8FBCBB">Minimum</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">Selected</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">at</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">least</span><span style="color: #D8DEE9FF"> 4 </span><span style="color: #8FBCBB">students</span><span style="color: #D8DEE9FF"> (4/25 = 0.16)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># 2. </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> &gt; 1.2: </span><span style="color: #8FBCBB">A</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">strong</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">positive</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">relationship</span></span>
<span class="line"><span style="color: #8FBCBB">min_support</span><span style="color: #D8DEE9FF"> = 4 / </span><span style="color: #8FBCBB">N_students</span><span style="color: #D8DEE9FF">  # 0.16</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Filter</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sort</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">results</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span></span>
<span class="line"><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF">[</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Support</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; &gt;= </span><span style="color: #8FBCBB">min_support</span><span style="color: #D8DEE9FF">) &amp;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Lift</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; &gt; 1.2)</span></span>
<span class="line"><span style="color: #D8DEE9FF">].</span><span style="color: #8FBCBB">sort_values</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Lift</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ascending</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">False</span><span style="color: #D8DEE9FF">).</span><span style="color: #8FBCBB">reset_index</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">drop</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Add</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">reverse</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">rules</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">table</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">where</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">high</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Since</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">symmetrical</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">only</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">one</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">direction</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">needs</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">be</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">calculated</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">but</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Helper</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">get</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">presentation</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">get_confidence_l2_to_l1</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Antecedent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Consequent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Support</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93; &gt; 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> 0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Confidence (L2 -&gt; L1)</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; = </span><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apply</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">get_confidence_l2_to_l1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">axis</span><span style="color: #D8DEE9FF">=1)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Reorder</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">presentation</span></span>
<span class="line"><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">[&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Antecedent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Consequent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Support</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Confidence (L1 -&gt; L2)</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Confidence (L2 -&gt; L1)</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Lift</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Top Co-Selected Lecture Groups (Pairs):</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">to_markdown</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">False</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">floatfmt</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">.3f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">))</span></span></code></pre></div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="655" height="384" src="https://mietwood.com/wp-content/uploads/2025/09/image-11.jpg" alt="" class="wp-image-3339" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-11.jpg 655w, https://mietwood.com/wp-content/uploads/2025/09/image-11-300x176.jpg 300w" sizes="auto, (max-width: 655px) 100vw, 655px" /></figure>
</div>


<h2 class="wp-block-heading">Interpretation of Lecture Groups</h2>



<p>The <strong>Lift</strong> values indicate the strength of the relationships:</p>



<h3 class="wp-block-heading">The Strongest Cohort (The 2.2+ Lift Groups)</h3>



<p>These are the three strongest, non-obvious combinations. A Lift≈2.2 means that students who take one lecture are more than <strong>twice as likely</strong> to take the associated lecture compared to the general student population.</p>



<ul class="wp-block-list">
<li><strong>L9 ↔ L12 ↔ L8:</strong> These three lectures form a tight cluster. Specifically, <strong>L8 → L9</strong> and <strong>L9 → L12</strong> have extremely high association scores. This suggests a dedicated academic track or a highly correlated set of topics.</li>
</ul>



<h3 class="wp-block-heading">High Confidence Groups (The 1.5+ Lift Groups)</h3>



<p>These are groups where the selection of one lecture is a very strong predictor for the other:</p>



<ul class="wp-block-list">
<li><strong>L2 → L4 (87.5% Confidence):</strong> If a student selects L2, there is a very high, <strong>87.5% chance</strong> they also select L4. This suggests L2 might be a prerequisite, a foundational course, or a direct complement to L4.</li>



<li><strong>L2 → L3 (75.0% Confidence):</strong> Similarly, L2 and L3 are frequently chosen together, indicating a strong connection.</li>



<li><strong>L1 ↔ L2:</strong> These two lectures are highly associated, suggesting they are often taken in tandem.</li>
</ul>



<pre class="wp-block-verse">Does this mean that I should go to lecturers L5 and L2 and ask them for promotion of my L6, because it is most likely they find satisfactory experience on my lecture?</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Data-Driven Strategy for L6 Promotion</h2>



<h3 class="wp-block-heading">1. <strong>Prioritize L5 (High Co-Selection &amp; High Confidence)</strong></h3>



<p>You should absolutely focus on the L5 lecturer.</p>



<ul class="wp-block-list">
<li><strong>L5 → L6 Confidence: 72.7%</strong>
<ul class="wp-block-list">
<li>This means nearly 3 out of every 4 students who take L5 also choose your L6 lecture.</li>



<li><strong>Action:</strong> The L5 lecturer is teaching the same target audience as you. You could ask them to mention L6 as a <strong>natural follow-up</strong> or <strong>complementary course</strong> to their students.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">2. <strong>Prioritize L2 (High Predictive Power)</strong></h3>



<p>The relationship with L2 is even more predictive of a student landing in your lecture.</p>



<ul class="wp-block-list">
<li><strong>L2 → L6 Confidence: 75.0%</strong>
<ul class="wp-block-list">
<li>This means 3 out of every 4 students who take L2 end up in L6.</li>



<li><strong>Action:</strong> The L2 lecturer is essentially teaching a foundational course for a majority of your class. Ask them to promote L6 as the <strong>direct next step</strong> or <strong>most relevant application course</strong> for their content.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">3. <strong>The Importance of Satisfaction (What the Data Doesn&#8217;t Say)</strong></h3>



<p>The data confirms a strong <strong>selection link</strong>, but not a <strong>satisfaction link</strong>.</p>



<ul class="wp-block-list">
<li><strong>Selection:</strong> L5 and L2 students <em>are</em> your target market.</li>



<li><strong>Satisfaction:</strong> To confirm they find your lecture satisfactory, you&#8217;d need student feedback/evaluation data. A student who disliked L6 might still be highly likely to take it if it&#8217;s a required course for a specific program, for example.</li>
</ul>



<p>The strongest rationale for promotion is simply the high overlap: you are addressing a student cohort that already has a demonstrated interest pattern (L5/L2 → L6).</p>



<h3 class="wp-block-heading">Secondary Focus: L4 and L3</h3>



<p>While L4 and L3 have high total co-selection with L6 (L4: 66.7%, L3: 53.3%), their Lift and correlation scores are weaker. This means they are generally popular lectures, and while many of your students take them, the choice of L4 or L3 <strong>does not strongly predict</strong> the choice of L6.</p>



<p>Focusing your limited promotional time on the lecturers for <strong>L5 and L2</strong> will yield the highest return.</p>



<h2 class="wp-block-heading"><strong>Country</strong> and <strong>University</strong> as moderating factors</h2>



<p>The factors of <strong>Country</strong> and <strong>University</strong> do show clear patterns in lecture selection among your L6 students, suggesting that existing relationships or shared academic paths likely influence their choices.</p>



<p>Here is the analysis of the moderating factors, based on the 15 students in your L6 lecture:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="724" height="195" src="https://mietwood.com/wp-content/uploads/2025/09/image-12.jpg" alt="" class="wp-image-3340" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-12.jpg 724w, https://mietwood.com/wp-content/uploads/2025/09/image-12-300x81.jpg 300w" sizes="auto, (max-width: 724px) 100vw, 724px" /></figure>
</div>


<p>Analyzing the two largest university groups shows even sharper differences, which is expected as they are likely organized groups of students who know each other.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="716" height="256" src="https://mietwood.com/wp-content/uploads/2025/09/image-13.jpg" alt="" class="wp-image-3341" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-13.jpg 716w, https://mietwood.com/wp-content/uploads/2025/09/image-13-300x107.jpg 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>
</div>


<h2 class="wp-block-heading">Theoretical Background for Moderating Factors</h2>



<h3 class="wp-block-heading">The Influence of Country: Cultural and Institutional Homophily</h3>



<p>The tendency for students from the same country (e.g., Spain or Morocco) to share similar lecture profiles can be explained by <strong>Homophily</strong> and <strong>Institutional Alignment</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Theory/Concept</td><td>Explanation</td><td>Application to Your Data</td></tr></thead><tbody><tr><td><strong>Cultural Homophily</strong></td><td>The principle that <em>&#8220;birds of a feather flock together.&#8221;</em> Individuals prefer to associate and bond with others who are similar to themselves (e.g., same nationality, language, cultural background).</td><td>Students from the same country are likely to <strong>communicate about their choices</strong> primarily in their shared native language (e.g., Spanish for Spain, Arabic/French for Morocco). This exchange promotes the selection of a common set of lectures (e.g., Spanish students favoring <strong>L4</strong> and <strong>L1</strong>).</td></tr><tr><td><strong>Institutional Alignment / Mobility Programs</strong></td><td>The structured academic agreements between home and host institutions dictate which courses are approved for credit.</td><td>Exchange programs often pre-approve specific &#8220;study packages.&#8221; If the University of Malaga exchange agreement primarily covers a field requiring <strong>L4</strong> and <strong>L7</strong>, those students will select that bundle. Your finding that Malaga students disproportionately select <strong>L7</strong> strongly supports this institutional influence.</td></tr><tr><td><strong>Country-Level Curriculum/Prerequisites</strong></td><td>Students from the same country may have completed similar foundational courses at home, making a certain set of lectures (like L6) compatible.</td><td>If Spanish universities standardize a curriculum where L4 is a logical next step to a prerequisite, those Spanish students will follow that path, leading to the high <strong>L4</strong> selection.</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">The Influence of University/Peer Group: Social Network Effects</h3>



<p>The even stronger, more granular influence of the specific university groups (like the unique L7 selection by U. Malaga students) is supported by <strong>Social Influence Theory</strong> and <strong>Bounded Rationality</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Theory/Concept</td><td>Explanation</td><td>Application to Your Data</td></tr></thead><tbody><tr><td><strong>Social Proof / Herd Behavior</strong></td><td>A type of conformity where individuals assume the actions of a large group in an attempt to reflect correct behavior. When students face <strong>uncertainty</strong> in a new academic environment, they look to their trusted peers for guidance.</td><td>Students arriving from the same university (e.g., U. Cadi Ayyad) form a <strong>tight local network</strong>. When one or two students select a lecture (like <strong>L5</strong>), the rest of the group follows to reduce the perceived risk and workload associated with taking classes alone.</td></tr><tr><td><strong>Network Constraint / Information Asymmetry</strong></td><td>The limited size of a local network dictates the information flow. Students only trust information from their close peers, ignoring broader student trends.</td><td>The U. Malaga students may not interact with the U. Cadi Ayyad students. Therefore, they only hear about the &#8220;safe&#8221; lecture bundle (e.g., L4, <strong>L7</strong>, L1) from their U. Malaga friends, leading to the highly specific, unique selection of <strong>L7</strong>.</td></tr><tr><td><strong>Convenience/Logistics</strong></td><td>At a practical level, coordinating lecture selection allows peers to have the same class schedule, making group projects, social planning, and travel easier.</td><td>The high co-selection of <strong>L4, L7, L1, and L2</strong> among U. Malaga students suggests they are coordinating their choices to minimize logistical friction, effectively treating the selection as a <strong>group decision</strong>.</td></tr></tbody></table></figure>



<pre class="wp-block-verse">By applying these concepts, you can elevate your analysis from mere correlation to a discussion of <strong>student decision-making under conditions of uncertainty and social constraint.</strong></pre>



<h2 class="wp-block-heading">The lecture title informational role</h2>



<p>Assuming that students red only the lecture titles can you analyze the whole population and discover which lectures students select as a group (l1 &amp; l2 … ect.) based on title informational role.</p>



<p>This is an excellent way to use the data to understand the <strong>thematic logic</strong> that guides students&#8217; decisions, rather than just the frequency. By analyzing the corpus of words in the high-Lift lecture titles, we can discover the &#8220;driving wording&#8221; that creates strong, predictable co-selection patterns.</p>



<p>The primary driving themes are <strong>Economics</strong>, <strong>Management</strong>, and <strong>Business</strong>, but the strongest associations occur when these words are combined with <strong>qualifying terms</strong> like <em>International</em>, <em>Applied</em>, or <em>Process</em>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Analysis of Driving Wording in Lecture Titles</h2>



<p>Based on the highest Lift scores (strongest association patterns), the lecture groups cluster into four distinct thematic tracks driven by specific keywords:</p>



<h3 class="wp-block-heading">1. Driving Theme: International &amp; Political Economy 🌍</h3>



<p>This is the strongest thematic driver in the entire dataset, creating three of the top four co-selection groups.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Lecture Group</td><td>Titles &amp; Key Wording</td><td>Driving Wording Logic</td></tr></thead><tbody><tr><td><strong>L9 &amp; L12</strong> (Lift&nbsp;2.232)</td><td>L9: <strong>International Economics</strong> / L12: <strong>Political economy</strong></td><td>Students seek a deep understanding of how <strong>global power (Political)</strong> and <strong>global markets (International)</strong> interact. The co-selection is driven by the desire to merge theoretical macroeconomics with political strategy.</td></tr><tr><td><strong>L8 &amp; L9</strong> (Lift&nbsp;2.232)</td><td>L8: <strong>International Competitiveness</strong> / L9: <strong>International Economics</strong></td><td>The term <strong>&#8220;International&#8221;</strong> is the central driver. Students are selecting a specialized track in global trade, where L9 provides the foundational theory and L8 provides the policy application (Competitiveness).</td></tr></tbody></table></figure>



<p></p>



<p><strong>Driving Wording:</strong> <strong>International</strong>, <strong>Economy</strong>, <strong>Political</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">2. Driving Theme: Applied Economic Analysis</h3>



<p>This theme links foundational economic knowledge with quantitative skills and real-world application.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Lecture Group</td><td>Titles &amp; Key Wording</td><td>Driving Wording Logic</td></tr></thead><tbody><tr><td><strong>L1 &amp; L2</strong> (Lift&nbsp;2.083)</td><td>L1: <strong>Analysis</strong> of&#8230; <strong>Economic</strong> and Social Indicators / L2: <strong>Applied Economics</strong> Real-World Challenges</td><td>The core terms <strong>&#8220;Analysis&#8221;</strong> and <strong>&#8220;Applied&#8221;</strong> signal a curriculum path focused on practical data skills (L1) to solve real-world problems (L2), appealing to students who want measurable, deployable skills.</td></tr><tr><td><strong>L2 &amp; L3</strong> (Lift&nbsp;1.562)</td><td>L2: <strong>Applied Economics</strong> / L3: <strong>Business Analytics</strong> for <strong>Financial Decisions</strong></td><td>The combination of <strong>&#8220;Applied&#8221;</strong> and <strong>&#8220;Analytics&#8221;</strong> defines a quantitative financial student. They select L2 for the general economic context and L3 for the specific financial toolset.</td></tr></tbody></table></figure>



<p></p>



<p><strong>Driving Wording:</strong> <strong>Applied</strong>, <strong>Analysis</strong>, <strong>Economics</strong>, <strong>Decisions</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">3. Driving Theme: Business Management &amp; Strategy</h3>



<p>This group is driven by a focus on business processes and the organizational changes brought by technology.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Lecture Group</td><td>Titles &amp; Key Wording</td><td>Driving Wording Logic</td></tr></thead><tbody><tr><td><strong>L7 &amp; L11</strong> (Lift&nbsp;1.786)</td><td>L7: <strong>Economics of Innovation</strong> / L11: <strong>People management</strong> in the <strong>digital economy</strong></td><td>The terms <strong>&#8220;Innovation&#8221;</strong> and <strong>&#8220;Digital&#8221;</strong> are the semantic link. Students are building a profile focused on managing organizations in a rapidly changing, technology-driven environment, linking macro strategy (L7) with HR/people skills (L11).</td></tr><tr><td><strong>L5 &amp; L6</strong> (Lift&nbsp;1.212)</td><td>L5: <strong>Business process management</strong> / L6: <strong>Customer Experience Management</strong></td><td>The recurring term <strong>&#8220;Management&#8221;</strong> creates the link. L5 focuses on the <strong>internal</strong> view (Process) and L6 focuses on the <strong>external</strong> view (Customer), showing students seek comprehensive skills for managing the entire value chain.</td></tr></tbody></table></figure>



<p></p>



<p><strong>Driving Wording:</strong> <strong>Management</strong>, <strong>Process</strong>, <strong>Digital</strong>, <strong>Innovation</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Summary of Driving Wording</h2>



<p>The strongest selection drivers are not single words but rather <strong>qualified thematic phrases</strong> that define an academic specialization:</p>



<p><strong>Modern Business Specialization:</strong> Driven by <strong>Management</strong> combined with <strong>Process</strong> or <strong>Digital/Innovation</strong> (L5, L6, L7, L11).</p>



<p><strong>Macro/Global Specialization:</strong> Driven by <strong>International</strong> and <strong>Political Economy</strong> (L9, L12, L8).</p>



<p><strong>Quantitative Specialization:</strong> Driven by <strong>Applied Economics</strong> and <strong>Business Analytics</strong> (L2, L3, L1).</p>



<p>About wording analysis you can read here: <a href="https://mietwood.com/organizational-aspiration-for-social-impact">Organizational aspiration for social impact</a></p>



<h2 class="wp-block-heading">What recommendation to lecture L6 in wording. How can I modify or extend title of L6 to experiment for more students in next period.</h2>



<p>The goal is to maximize enrollment by aligning the L6 title with the <strong>thematic drivers</strong> of the most predictive lectures (L5, L2, L4). Your current title is <strong>L6: Customer Experience Management</strong>.</p>



<p>Here are the recommendations for wording modifications, structured by the target student profile you want to attract:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Recommended Title Modifications for L6</h2>



<h3 class="wp-block-heading">Strategy 1: Attract <strong>L5</strong> Students (Strongest Existing Link)</h3>



<p>The L5 title is <em>Business process management</em>. These students seek <strong>internal efficiency</strong> as a foundation for external success.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Recommendation</td><td>Suggested Title (Experiment 1)</td><td>Rationale</td></tr></thead><tbody><tr><td><strong>Integrate &#8220;Process&#8221;</strong></td><td><strong>L6: Customer Experience Management and Service Design Process</strong></td><td>By including the word <strong>&#8220;Process,&#8221;</strong> you explicitly link L6 to the operational skills L5 students value, making it the logical <strong>next step</strong> for their expertise.</td></tr><tr><td><strong>Focus on Value</strong></td><td><strong>L6: Managing Business Processes for Customer Value and Experience</strong></td><td>This title frames L6 as the <strong>culmination</strong> of L5, showing how mastering L5&#8217;s internal processes directly leads to the high-value outcome of great customer experience.</td></tr></tbody></table></figure>



<p></p>



<h3 class="wp-block-heading">Strategy 2: Attract <strong>L2/L4</strong> Students (Applied &amp; Analytical) </h3>



<p>The L2 title is <em>Applied Economics Real-World Challenges and Solutions</em>. L4 is <em>Business plan</em>. These students are <strong>practical and analytical</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Recommendation</td><td>Suggested Title (Experiment 2)</td><td>Rationale</td></tr></thead><tbody><tr><td><strong>Integrate &#8220;Analytics&#8221;</strong></td><td><strong>L6: Customer Experience Analytics: Data-Driven Strategies for Service Improvement</strong></td><td>The term <strong>&#8220;Analytics&#8221;</strong> strongly attracts L2/L4 students (who also take L3: <em>Business Analytics&#8230;</em>). This signals that L6 is a <strong>quantitative course</strong>, not just a soft skill.</td></tr><tr><td><strong>Focus on &#8220;Metrics/KPIs&#8221;</strong></td><td><strong>L6: Customer Experience Management: Measuring and Optimizing Key Service Metrics</strong></td><td>This appeals to the <strong>Applied/Solutions</strong> mindset, promising tools to measure CX performance and directly influence business outcomes, fitting the L2/L4 focus on solutions and planning.</td></tr></tbody></table></figure>



<p></p>



<h3 class="wp-block-heading">Strategy 3: Attract <strong>L7/L11</strong> Students (Future/Digital Focus) </h3>



<p>The L7/L11 titles feature <strong>Innovation</strong> and <strong>Digital Economy</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Recommendation</td><td>Suggested Title (Experiment 3)</td><td>Rationale</td></tr></thead><tbody><tr><td><strong>Integrate &#8220;Digital&#8221;</strong></td><td><strong>L6: Digital Customer Experience (DCX) Management</strong></td><td>The term <strong>&#8220;Digital&#8221;</strong> is a powerful modern driver. If your course includes any digital touchpoints (apps, online service, social media), using &#8220;DCX&#8221; will immediately pull in the students from the L7/L11 &#8220;Innovation&#8221; track.</td></tr></tbody></table></figure>



<p></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Final Recommendation: The Best Title to Test</h3>



<p>The most balanced and powerful title that leverages multiple co-selection themes is: <strong>Customer Experience Management: Data, Process, and Digital Strategy</strong></p>



<p>This phrase:</p>



<ul class="wp-block-list">
<li>Includes <strong>Management</strong> (L5, L6, L11 theme).</li>



<li>Includes <strong>Process</strong> (L5 link).</li>



<li>Includes <strong>Data</strong> (L2/L3 Analytics link).</li>



<li>Includes <strong>Digital Strategy</strong> (L7/L11 Innovation link).</li>
</ul>



<p>Thank you for reading.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/profiling">Student Profiling for your lecture</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Fuzzy-Set Qualitative Comparative Analysis</title>
		<link>https://mietwood.com/qualitative-comparative-analysis</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 28 Sep 2025 12:16:51 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3329</guid>

					<description><![CDATA[<p>Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches. The symmetric data analysis (e.g., correlation and multiple regression analysis) &#8230; The asymmetric data analysis (i.e., individual case outcome forecasts) &#8230; Based on: Fuzzy-set Qualitative Comparative Analysis (fsQCA): Guidelines for research practice in Information Systems...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/qualitative-comparative-analysis">Fuzzy-Set Qualitative Comparative Analysis</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="461" src="https://mietwood.com/wp-content/uploads/2025/09/image-7.jpg" alt="Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches" class="wp-image-3330" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-7.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/image-7-300x135.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-7-768x346.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches</figcaption></figure>



<p>The symmetric data analysis (e.g., correlation and multiple regression analysis) &#8230;</p>



<p>The asymmetric data analysis (i.e., individual case outcome forecasts) &#8230;</p>



<p>Based on: <a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037" target="_blank" rel="noopener">Fuzzy-set Qualitative Comparative Analysis (fsQCA): Guidelines for research practice in Information Systems and marketing</a> (<a href="https://www.sciencedirect.com/author/55387371600/ilias-o-pappas" target="_blank" rel="noopener">Ilias O. Pappas</a>, <a href="https://www.sciencedirect.com/author/7006553735/arch-g-woodside" target="_blank" rel="noopener">Arch G. Woodside</a>)</p>



<p><strong>Qualitative </strong>inductive reasoning with data being analyzed “by case’’ and not “by variable’’, is combined with quantitative empirical testing, as sufficient and necessary conditions identify outcomes through statistical methods. In most cases, QCA are useful in quantitative studies, as it allows the researcher to get a deep view of their data through a quantitative analysis. That has also several characteristics of qualitative analysis.</p>



<p><strong>Case studies</strong> focus on describing, explaining, and forecasting, single and combinatorial conditional antecedents on outcomes while variable studies focus on the similarities of variances of two or more variables. A “condition” is a point or interval range of antecedent or outcome; a “variable” characteristic varies. </p>



<p>Here are few examples of conditions versus variables: “Male” is a condition; “gender” is a variable. “Swedish” is a condition; “nationality” is a variable. “Expert” is a condition; “expertise” is a variable.</p>



<p>The <strong>goal of QCA</strong> is to explain causality in complex real life phenomena. QCA goes through “multiple-conjunctural causation, which refers to “nonlinear, nonadditive, non-probabilistic conception that rejects any form of permanent causality. That stresses different paths which can lead to the same outcome. QCA investigate complex combinations of conditions and diversity. QCA uses Boolean algebra and Boolean minimization algorithms to capture patterns of multiple-conjunctural causation and to simplify complex data structures. </p>



<h2 class="wp-block-heading" id="sect0025">Types qualitative comparative analysis (QCA)</h2>



<h3 class="wp-block-heading" id="sect0030">CsQCA and mvQCA</h3>



<p><strong>CsQCA</strong> is the first variation of QCA. It is a tool created to deal with <strong>complex sets</strong> of binary data. The use of Boolean algebra means that QCA has as input binary data (0 or 1). That make QCA uses logical operations for the procedure. Thus it is very important to dichotomize the use of variables in a useful and meaningful manner.</p>



<p><strong>mvQCA</strong>, treats variables as <strong>multi-valued</strong> instead of dichotomous. MvQCA retains the idea of performing a synthesis of the dataset and cases with the same value on the outcome variable. They are explained by a solution, which contains combinations of variables that explain a number of cases with the outcome.</p>



<p><strong>FsQCA</strong> addresses an important limitation of csQCA, the fact that variables are binary, thus restricting the analysis as it cannot fully capture the complexity in cases that naturally vary by level or degree. This restriction of csQCA is likely an important reason that QCA has not been widely adopted in multiple contexts, including IS and marketing research. FsQCA extends csQCA by integrating fuzzy-sets and fuzzy-logic principles with QCA. The variables can get all the values within the range of 0–1. FsQCA is able to overcome several limitations of both csQCA and mvQCA, and has received increased attention recently. FsQCA applies together with complexity theory, it provides the opportunity to gain deeper and richer insight into data.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-1024x576.jpg" alt="fsQCA and cluster analysis" class="wp-image-3331" srcset="https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-1024x576.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-300x169.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-768x432.jpg 768w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-1536x864.jpg 1536w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-2048x1152.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">fsQCA and cluster analysis</figcaption></figure>



<h3 class="wp-block-heading" id="sect0040">FsQCA and cluster analysis</h3>



<p>Case-based techniques, such as fsQCA and cluster analysis, have been employed as a way of moving beyond variance-based methods. These two techniques have similarities as they both employ multidimensional spaces and often people ask how fsQCA differs from cluster analysis and why do we need it. A main difference between the two methods is the kind of research questions they are able to address.</p>



<p>Specifically, <strong>cluster analysis</strong> answers questions such as which cases are more similar to each other, while fsQCA can identify the different configurations that constitute sufficient and/or necessary conditions for the outcome of interest. Depending on the focus of the study the researcher should choose the most appropriate method. Their differences stem from the fact that <em>“QCA addresses the positioning of cases in [multidimensional] spaces via set theoretic operations while cluster analysis relies on geometric distance measures and concepts of variance minimization”</em> . To this end, prior studies compare fsQCA with cluster analysis and show how fsQCA can handle causal complexity with fine-grained level data, or how it can identify more solutions compared to cluster analysis. A discussion exists in the literature regarding QCA and cluster analysis, and both approaches have differences making them suitable for different types of studies.</p>



<p>read example here: <a href="https://mietwood.com/hierarchical-agglomerative-clustering-for-product-grouping">Hierarchical Agglomerative Clustering for Product Grouping</a></p>



<h2 class="wp-block-heading" id="sect0045">Adoption of fsQCA in relevant studies</h2>



<p>Configurational approaches are becoming more popular over the past few years in different areas, with fsQCA playing a large part in this as most studies will prefer fuzzy-set over crisp-set and multi-value QCA (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0365" target="_blank" rel="noopener">Thiem &amp; Dusa, 2013</a>). In detail, fsQCA has been employed in&nbsp;<a href="https://www.sciencedirect.com/topics/computer-science/information-system" target="_blank" rel="noopener">information systems</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0095" target="_blank" rel="noopener">Fedorowicz, Sawyer, &amp; Tomasino, 2018</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0165" target="_blank" rel="noopener">Liu et al., 2017</a>), online business and marketing (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0235" target="_blank" rel="noopener">Pappas et al., 2016</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0225" target="_blank" rel="noopener">Pappas, 2018</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0420" target="_blank" rel="noopener">Woodside, 2017</a>),&nbsp;<a href="https://www.sciencedirect.com/topics/psychology/consumer-psychology" target="_blank" rel="noopener">consumer psychology</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0325" target="_blank" rel="noopener">Schmitt, Grawe, &amp; Woodside, 2017</a>), strategy and organizational research (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0105" target="_blank" rel="noopener">Fiss, 2011</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0135" target="_blank" rel="noopener">Greckhamer et al., 2018</a>), education (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0245" target="_blank" rel="noopener">Pappas, Giannakos et al., 2017</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0270" target="_blank" rel="noopener">Plewa, Ho, Conduit, &amp; Karpen, 2016</a>),&nbsp;<a href="https://www.sciencedirect.com/topics/social-sciences/data-science" target="_blank" rel="noopener">data science</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0375" target="_blank" rel="noopener">Vatrapu, Mukkamala, Hussain, &amp; Flesch, 2016</a>) and&nbsp;<a href="https://www.sciencedirect.com/topics/computer-science/learning-analytics" target="_blank" rel="noopener">learning analytics</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0215" target="_blank" rel="noopener">Papamitsiou et al., 2018</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0335" target="_blank" rel="noopener">Sergis, Sampson, &amp; Giannakos, 2018</a>). This tutorial aims to increase the adoption of fsQCA in IS and marketing studies following the call for more empirical work in the area (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0085" target="_blank" rel="noopener">El Sawy, Malhotra, Park, &amp; Pavlou, 2010</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0105" target="_blank" rel="noopener">Fiss, 2011</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0415" target="_blank" rel="noopener">Woodside, 2014</a>,&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0420" target="_blank" rel="noopener">2017</a>).</p>



<h2 class="wp-block-heading">Adoption of fsQCA in relevant studies</h2>



<p>FsQCA is useful for both inductive and deductive reasoning for theory building, elaboration, and testing. This analysis allows to identify specific cases in the sample. With this knowledge, the researcher can go back to the cases and use contextual information, not including in the analysis, to further explain and discuss the findings.</p>



<p>A typical variance-based analysis would identify a single best solution, thus limiting the results. FsQCA studies can compare the findings between different data analysis techniques to describe how different stories are hidden in the same dataset. It is recommended to combine fsQCA with other data analysis techniques if possible.</p>



<h2 class="wp-block-heading">How to use fsQCA in a typical e-commerce study</h2>



<h3 class="wp-block-heading">Sampling</h3>



<p>The study examined cognitive and affective perceptions as antecedents of online shopping behavior in personalized e-commerce environments. We used a typical a snowball sampling methodology to recruit participants and controlled for respondents’ previous experience with both online shopping and personalized services. Eventually, the sample comprises 582 individuals with experience in online shopping and personalized services. We collected data through a questionnaire built with measures that have been adopted from the literature. Appendix A (as presented in the original study) lists construct definitions, the questionnaire items used to measure each construct, along with descriptive statistics and loadings.</p>



<h3 class="wp-block-heading">Evaluate constructs for reliability and validity</h3>



<p>Typical with similar quantitative studies, first we<strong> evaluate constructs for reliability and validity</strong>. This is a step that should be always performed when it is appropriate, and it is not directly related with the fsQCA analysis as it depends on the type of variables that are used in the study. Construct reliability and validify, as the name implies, refer to the construct itself and not to the method of analysis used to examine relations between constructs. </p>



<p>Reliability testing, based on the Cronbach alpha indicator, showed acceptable indices of internal consistency since all constructs exceed the cut-off threshold of 0.70. The AVE for all constructs ranged between 0.55 and 0.84, all correlations were lower than 0.80, and square root AVEs for all constructs were larger than their correlations. The findings in detail for the confirmatory analysis may be found in the original paper.</p>



<h3 class="wp-block-heading" id="sect0070">Contrarian case analysis</h3>



<p>Contrarian case analysis is performed outside fsQCA, but we present it here because it can serve as an easy and quick way to examine how many cases in our sample are not explained by main effects, and thus they would not be included in the outcome of a typical variance-based approach, e.g., correlation or regression analysis.</p>



<h3 class="wp-block-heading">Data Calibration</h3>



<p>In fsQCA, different from traditional methods, instead of working with probabilities <strong>data are transformed from ordinal or interval scales into degrees of membership in the target set</strong>, which shows if and how much a case belongs into a specific set. “In essence, a fuzzy membership score attaches a truth value, not a probability, to a statement”. </p>



<p>For example, the variable intention to purchase can be coded as “high intention to purchase”, and we will be looking for the presence or absence of the condition high intention to purchase (“intention to purchase” is the variable; “high intention to purchase” is a condition). Similarly, we code the rest of the variables. </p>



<p>The method computes the presence of a condition or its opposite (i.e., negation). The negation of a condition is referred in the literature as the absence of a condition, and the two terms have been used interchangeably based on how the absence is computed. The term absence has been also used to describe when the condition is irrelevant in a configuration. It is similar to the “do not care” term that is also often used in the literature. </p>



<p>This distinction is not often addressed or clarified, thus we suggest researchers to clearly define these terms in future works to avoid misunderstandings.</p>



<h3 class="wp-block-heading" id="sect0085">Transform data into fuzzy-sets</h3>



<p>In fsQCA we need to calibrate our variables to form fuzzy sets with their values ranging from 0 to 1. Consider a fuzzy set as a group, then the values from 0 to 1 define if and at what amount a case belongs to this group. The fact that all values range from 0 to 1 means that a case with a fuzzy membership score of 1 is a <em>full member</em> of a fuzzy set (fully in the set), and a case with a membership score of 0 is a <em>full non-member</em> of the set (fully out of the set). A membership score of 0.5 is exactly in the middle, thus a case would be both a member of the fuzzy set and a non-member, and is therefore a member of what is known as the <em>intermediate</em> set. The intermediate-set point is the value where there is maximum ambiguity as to whether a case is more in or more out of the target set.</p>



<p>Data calibration may be either <strong>direct or indirect</strong>. In the <strong>direct calibration</strong> the researcher needs to choose exactly three qualitative breakpoints, which define the level of membership in the fuzzy set for each case (fully in, intermediate, fully out). In the <strong>indirect method</strong>, the measurements need to be rescaled based on qualitative assessments. The researcher may choose to calibrate a measure differently, depending on what one is investigating. Either method may be chosen, depending on researcher’s substantive knowledge of both data and underlying theory. The direct method is recommended and is more common, in which the researcher sets three values corresponding to full-set membership, full-set non-membership, and intermediate-set membership. This can lead to more rigorous studies which are easier to be replicated and validated, since it is clearer on how the thresholds have been chosen.</p>



<p>The percentiles allow the calibration of any measure regardless of its original values. In detail, we can compute the 95 %, 50 %, and 5 % of our measures and use these values as the three thresholds in fsQCA software.</p>



<p>Especially in the case of the widely used seven-point Likert scales (1=Not at all, 7=Very much), previous studies suggest that the values of 6, 4, and 2 can be used as thresholds. Similarly, for a five-point Likert scale the thresholds could be 4,3, and 2. </p>



<h3 class="wp-block-heading" id="sect0105">Interpreting and presenting the solutions</h3>



<p>FsQCA software provides all three solutions every time. Complex and parsimonious solutions are computed regardless of any simplifying assumptions employed by the researcher (e.g., choosing the presence or absence/negation of a variables) while the intermediate solution depends on these assumptions. While the intermediate solution includes both core and peripheral conditions, we need an easy way to make the distinction that will help us interpret and present the solutions in a better manner.</p>



<p>To improve the presentation of the findings we can transform the solutions from fsQCA output into a table that is easier to read. Typically, </p>



<ol class="wp-block-list">
<li>the presence of a condition is indicated with a black circle (●), </li>



<li>the absence/negation with a crossed-out circle (⊗), </li>



<li>and the “do not care” condition with a blank space. </li>
</ol>



<p>The negation of a condition is referred in the literature also as absence, and the two terms have been used interchangeably. The distinction between core and peripheral is made by using large and small circles, respectively. The researcher needs to present the overall solution consistency and the overall solution coverage. The overall coverage describes the extent to which the outcome of interest may be explained by the configurations, and is comparable with the R-square reported on regression-based methods. In our example, the results indicate an overall solution coverage of 0.84, which suggests that a substantial proportion of the outcome is covered by the nine solutions.</p>



<p>All graphics and futher explanation you find here: Pappas, I. O., &amp; Woodside, A. G. (2021). Fuzzy-set Qualitative Comparative Analysis (fsQCA): Guidelines for research practice in Information Systems and marketing. <em>International journal of information management</em>, <em>58</em>, 102310.</p>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/qualitative-comparative-analysis">Fuzzy-Set Qualitative Comparative Analysis</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>What are independent variables?</title>
		<link>https://mietwood.com/independent-variables</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 18:44:48 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3316</guid>

					<description><![CDATA[<p>Independent variables, also called predictors, features, or explanatory variables, are the variables in a statistical or machine learning model that are used to explain or predict changes in another variable — the dependent variable, also called the outcome or target. Independent variables in simple terms: Example of independent variables in Customer Management (RFM Model): Suppose...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Independent variables, </strong>also called <strong>predictors</strong>, <strong>features</strong>, or <strong>explanatory variables</strong>, are the variables in a statistical or machine learning model that are used to <strong>explain or predict</strong> changes in another variable — the <strong>dependent variable</strong>, also called the outcome or target.</p>



<h3 class="wp-block-heading" id="insimpleterms"><strong>Independent variables</strong> in simple terms:</h3>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>: Inputs you control or observe.</li>



<li><strong>Dependent variable</strong>: Output you want to understand or predict.</li>
</ul>



<h3 class="wp-block-heading" id="exampleincustomermanagementrfmmodel">Example of i<strong>ndependent variables</strong> in Customer Management (RFM Model):</h3>



<p>Suppose you&#8217;re analyzing customer behavior to predict <strong>churn</strong> (whether a customer will stop buying).</p>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>:
<ul class="wp-block-list">
<li><strong>Recency</strong>: How recently a customer made a purchase.</li>



<li><strong>Frequency</strong>: How often they purchase.</li>



<li><strong>Monetary</strong>: How much they spend.</li>
</ul>
</li>



<li><strong>Dependent variable</strong>:
<ul class="wp-block-list">
<li><strong>Churn</strong>: 1 if the customer churned, 0 if they stayed.</li>
</ul>
</li>
</ul>



<p>In this case, <strong>Recency</strong>, <strong>Frequency</strong>, and <strong>Monetary</strong> are independent variables used to predict the likelihood of <strong>churn</strong>. See also <a href="https://mietwood.com/python-for-business-analytics-2">Python for business analytics &#8211; rfm analysis</a></p>



<h3 class="wp-block-heading" id="whycheckforindependenceamongindependentvariables">Why check for independence among independent variables?</h3>



<p>If independent variables are <strong>highly correlated with other variables, </strong>i.e., it is not truly independent, it can cause <strong>multicollinearity</strong>, which makes model coefficients unstable, reduces interpretability, and can lead to misleading conclusions.</p>



<h2 class="wp-block-heading"><strong>Multicollinearity</strong></h2>



<p><strong>Multicollinearity</strong> refers to a statistical phenomenon in which two or more independent variables in a regression model are highly correlated. This makes it difficult to determine the individual effect of each variable on the dependent variable because they essentially carry overlapping information.</p>



<h3 class="wp-block-heading"><strong>Assessment of Multicollinearity</strong></h3>



<p>To assess multicollinearity, you can use following methods:</p>



<ol class="wp-block-list">
<li><strong>Correlation Matrix</strong>
<ul class="wp-block-list">
<li>Check pairwise correlations between independent variables.</li>



<li>High correlation (e.g., > 0.8 or &lt; -0.8) may indicate multicollinearity.</li>
</ul>
</li>



<li><strong>Variance Inflation Factor (VIF)</strong>
<ul class="wp-block-list">
<li>Measures how much the variance of a regression coefficient is inflated due to multicollinearity.</li>



<li><strong>VIF > 5 or 10</strong> is often considered problematic.</li>
</ul>
</li>



<li><strong>Tolerance</strong>
<ul class="wp-block-list">
<li>Tolerance = 1 / VIF.</li>



<li>Low tolerance values (close to 0) indicate high multicollinearity.</li>
</ul>
</li>



<li><strong>Condition Index and Eigenvalues</strong>
<ul class="wp-block-list">
<li>Part of a more advanced diagnostic using matrix decomposition.</li>



<li>A <strong>condition index > 30</strong> may suggest serious multicollinearity.</li>
</ul>
</li>
</ol>



<h3 class="wp-block-heading"><strong>How to deal with multicollinearity?</strong></h3>



<ul class="wp-block-list">
<li><strong>Remove one of the correlated variables.</strong></li>



<li><strong>Combine variables</strong> (e.g., using PCA or creating an index).</li>



<li><strong>Regularization techniques</strong> like Ridge or Lasso regression.</li>



<li><strong>Centering variables</strong> (subtracting the mean) can help in some cases.</li>
</ul>



<h2 class="wp-block-heading">Calculation example</h2>



<p>Assume, you have data similar to this sample.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="488" height="337" src="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg" alt="" class="wp-image-3317" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg 488w, https://mietwood.com/wp-content/uploads/2025/09/image-4-300x207.jpg 300w" sizes="auto, (max-width: 488px) 100vw, 488px" /><figcaption class="wp-element-caption">RFM data sample &#8211; for testing independent variables </figcaption></figure>
</div>


<h2 class="wp-block-heading">Variable independence testing</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd

# Load the data with the specified delimiter
df = pd.read_csv("RFM_analysis_614.csv", delimiter=",")

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Select the independent variables (RFM)
X = df[&#91;'recency', 'freq', 'monetary'&#93;]

# Add a constant for the VIF calculation (required by the statsmodels function)
X = add_constant(X)

# Create a DataFrame to hold the VIF results
vif_data = pd.DataFrame()
vif_data&#91;"Variable"&#93; = X.columns
vif_data&#91;"VIF"&#93; = [variance_inflation_factor(X.values, i) for i in range(X.shape&#91;1&#93;)]

# Exclude the constant row from the final output since it's not a true variable
vif_data = vif_data&#91;vif_data.Variable != 'const'&#93;.reset_index(drop=True)

print(vif_data)

# Save the VIF results to a CSV file
vif_data.to_csv("vif_results.csv", index=False)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Load</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">specified</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span></span>
<span class="line"><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">read_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">RFM_analysis_614.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">,</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">stats</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">outliers_influence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variance_inflation_factor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">add_constant</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">independent</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variables</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">RFM</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">[&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">recency</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">freq</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">monetary</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Add</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">calculation</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">required</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">add_constant</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hold</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">results</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Variable</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = </span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">columns</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">VIF</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = [</span><span style="color: #8FBCBB">variance_inflation_factor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">values</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">range</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">shape</span><span style="color: #D8DEE9FF">&#91;1&#93;)]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Exclude</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">final</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">output</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">since</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">it</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">s not a true variabl</span><span style="color: #D8DEE9">e</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">Variable</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">!=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">const</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">reset_index</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">drop</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Save</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">results</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">CSV</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">file</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">to_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">vif_results.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">False</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Finally program prints following results</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="179" height="85" src="https://mietwood.com/wp-content/uploads/2025/09/image-5.jpg" alt="" class="wp-image-3318"/></figure>
</div>


<p>Based on the Variance Inflation Factor (VIF) calculation, the columns <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> are <strong>statistically independent</strong> of each other and <strong>not informationally overlapping</strong>.</p>



<p>This means that you can use all three variables together as independent predictors in a statistical model, such as a Cox Proportional Hazards (Cox PH) model, without concern for severe multicollinearity.</p>



<h2 class="wp-block-heading">Variance Inflation Factor (VIF) Results</h2>



<p>The VIF (<a href="https://en.wikipedia.org/wiki/Variance_inflation_factor" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Variance_inflation_factor</a>)  is a measure of how much the variance of an estimated regression coefficient is increased due to collinearity. A common rule of thumb is that a VIF value <strong>less than 5</strong> or sometimes 10 indicates that the correlation between the variables is not high enough to warrant concern.</p>



<p>The calculated VIF values for your RFM variables are very low:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Variable</td><td>VIF</td></tr></thead><tbody><tr><td><strong>recency</strong></td><td>1.437</td></tr><tr><td><strong>freq</strong></td><td>1.488</td></tr><tr><td><strong>monetary</strong></td><td>1.065</td></tr></tbody></table></figure>



<p></p>



<h2 class="wp-block-heading">Conclusion on Independence</h2>



<p>Since all VIF values are close to 1.0 and well below the 5.0 threshold:</p>



<ul class="wp-block-list">
<li><strong>Independent Variables:</strong> You can confidently treat <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> as independent variables for your statistical analysis (e.g., in a Cox PH model).</li>



<li><strong>No Informational Overlap:</strong> The variables are providing distinct, non-redundant information to the model. For instance, knowing a customer&#8217;s frequency does not allow the model to strongly predict their recency or monetary value.</li>
</ul>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Top IDEs for Python developers</title>
		<link>https://mietwood.com/ides-for-python-developers</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 09:11:31 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3312</guid>

					<description><![CDATA[<p>Here a top graphical IDEs (Integrated Development Environments) for Python developers as of 2024: 1.&#160;PyCharm 2.&#160;Visual Studio Code (VS Code) 3.&#160;Spyder 4.&#160;Thonny 5.&#160;Wing IDE 6.&#160;Eric 7.&#160;IDLE Summary:For professional development,&#160;PyCharm&#160;and&#160;VS Code&#160;are the most popular. For data science,&#160;Spyder&#160;is widely used. For beginners,&#160;Thonny&#160;or&#160;IDLE&#160;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Here a top graphical IDEs (Integrated Development Environments) for Python developers</strong> as of 2024:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">1.&nbsp;<strong>PyCharm</strong></h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="421" src="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg" alt="" class="wp-image-3313" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/image-3-300x123.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-3-768x316.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<ul class="wp-block-list">
<li><strong>Developer:</strong> JetBrains</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Advanced code analysis, smart code completion, integrated debugger, Git support, virtual environment management, Django support.</li>



<li><strong>Community (free) and Professional (paid) editions.</strong></li>



<li><strong>Website:</strong> <a href="https://www.jetbrains.com/pycharm/" target="_blank" rel="noreferrer noopener">PyCharm</a></li>
</ul>



<h3 class="wp-block-heading">2.&nbsp;<strong>Visual Studio Code (VS Code)</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Microsoft</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Lightweight but powerful; excellent Python extension; integrated terminal; rich plugin ecosystem; Git support; Jupyter notebook integration.</li>



<li><strong>Website:</strong> <a href="https://code.visualstudio.com/" target="_blank" rel="noreferrer noopener">VS Code</a></li>
</ul>



<h3 class="wp-block-heading">3.&nbsp;<strong>Spyder</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Scientific Python Development Environment Community</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Focused on scientific computing and data science; variable explorer; integrated IPython console; plotting support.</li>



<li><strong>Website:</strong> <a href="https://www.spyder-ide.org/" target="_blank" rel="noreferrer noopener">Spyder</a></li>
</ul>



<h3 class="wp-block-heading">4.&nbsp;<strong>Thonny</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> University of Tartu</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Beginner-friendly; simple UI; built-in debugger; good for learning and education.</li>



<li><strong>Website:</strong> <a href="https://thonny.org/" target="_blank" rel="noreferrer noopener">Thonny</a></li>
</ul>



<h3 class="wp-block-heading">5.&nbsp;<strong>Wing IDE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Wingware</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Powerful debugger; code intelligence; remote development support.</li>



<li><strong>Website:</strong> <a href="https://wingware.com/" target="_blank" rel="noreferrer noopener">Wing IDE</a></li>
</ul>



<h3 class="wp-block-heading">6.&nbsp;<strong>Eric</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Detlev Offenbach</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Full-featured Python and Ruby IDE; integrated debugger; plugin support.</li>



<li><strong>Website:</strong> <a href="https://eric-ide.python-projects.org/" target="_blank" rel="noreferrer noopener">Eric Python IDE</a></li>
</ul>



<h3 class="wp-block-heading">7.&nbsp;<strong>IDLE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Python Software Foundation (bundled with Python)</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Basic, lightweight, good for quick scripts and learning.</li>



<li><strong>Website:</strong> <a href="https://docs.python.org/3/library/idle.html" target="_blank" rel="noreferrer noopener">IDLE Documentation</a></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Summary:</strong><br>For professional development,&nbsp;<strong>PyCharm</strong>&nbsp;and&nbsp;<strong>VS Code</strong>&nbsp;are the most popular. For data science,&nbsp;<strong>Spyder</strong>&nbsp;is widely used. For beginners,&nbsp;<strong>Thonny</strong>&nbsp;or&nbsp;<strong>IDLE</strong>&nbsp;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Scrape a Website and Search Inside PDFs with Python</title>
		<link>https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sat, 30 Aug 2025 09:13:30 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3292</guid>

					<description><![CDATA[<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if you could automate the entire process with just a few lines of code?</p>



<p>In this tutorial, we&#8217;ll show you exactly how to do that. We’ll build a powerful yet simple Python script that automatically scans a webpage, finds all the PDF links, and searches for specific text inside each one. Using popular libraries like <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you&#8217;ll learn a practical skill that can save you hours of manual work. Let&#8217;s get started!</p>



<p>Python, Web Scraping, PDF, Automation, BeautifulSoup, PyPDF, requests, Data Extraction, Python Projects, Text Search</p>



<h2 class="wp-block-heading">Scrape a Website</h2>



<p>in the script we <strong>Find</strong> all links on the initial page. <strong>Filter</strong> for links that end with <code>.pdf</code>. For each PDF link: <strong>Download</strong> the PDF file into memory. <strong>Extract</strong> text from every page of the PDF. <strong>Search</strong> the extracted text for your <code>search_string</code>. And finally <strong>Report</strong> which PDF files contain the phrase. Scrape a Website</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import requests
from bs4 import BeautifulSoup
from pypdf import PdfReader
import io

Scrape a Website
def find_linked_pdfs(url):
    """
    Scans a webpage for PDF links and searches for a string within each PDF.

    Args:
        url: The URL of the webpage to scan.
        search_string: The string to search for inside the PDFs.
    """
    print(f"Scanning {url} for PDF links...")
    try:
        # 1. Get the main page to find all links
        base_url_parts = requests.utils.urlparse(url)
        base_url = f"{base_url_parts.scheme}://{base_url_parts.netloc}"
        
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [a&#91;'href'&#93; \
          for a in soup.find_all('a', href=True) \
          if a&#91;'href'&#93;.endswith('.pdf')]
        
        if not pdf_links:
            print("No PDF links found on the page.")
            return

        print(f"Found {len(pdf_links)} PDF files. Now searching inside them...")

    return pdf_links</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">requests</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bs4</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BeautifulSoup</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pypdf</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PdfReader</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">io</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">Scrape</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Website</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    Scans a webpage for PDF links and searches for a string within each PDF</span><span style="color: #D8DEE9">.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">Args</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">webpage</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">scan</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">search_string</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">string</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDFs</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    print(f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #8FBCBB">Scanning</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span><span style="color: #8FBCBB">url</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span><span style="color: #D8DEE9FF">...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        # 1. </span><span style="color: #8FBCBB">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">main</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">all</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url_parts</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">urlparse</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url_parts.scheme}://{base_url_parts.netloc}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BeautifulSoup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">text</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">html.parser</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF"> = [</span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">find_all</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">a</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">href</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">) \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">endswith</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">.pdf</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)]</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">No PDF links found on the page.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">return</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Found {len(pdf_links)} PDF files. Now searching inside them...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span></span></code></pre></div>



<p><strong>Handling URLs</strong>: It constructs a full, absolute URL for each PDF, as many links on a page can be relative (e.g., <code>/path/to/file.pdf</code>). Scrape a Website. <a href="https://pypi.org/project/beautifulsoup4/" target="_blank" rel="noopener">https://pypi.org/project/beautifulsoup4/</a></p>



<p><strong>In-Memory Processing</strong>: Instead of saving each PDF to your disk, it uses <code>io.BytesIO</code> to treat the downloaded content as a file in your computer&#8217;s memory. This is faster and cleaner.</p>



<p><strong>Text Extraction</strong>: The <code>pypdf</code> library&#8217;s <code>PdfReader</code> opens this in-memory file. The script then loops through each page, calls <code>extract_text()</code>, and combines the text from all pages.</p>



<p><strong>Searching and Reporting</strong>: Finally, it performs a case-insensitive search on the extracted text and prints the URL of any PDF that contains your search term.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>Search Inside PDF
def find_text_in_pdfs(pdf_links, search_string):

        # 2. Loop through each PDF link

        found_in_files = []
        for pdf_path in pdf_links:
            # Construct absolute URL if the link is relative
            if not pdf_path.startswith(('http://', 'https://')):
                pdf_url = f"{base_url}{pdf_path}"
            else:
                pdf_url = pdf_path

            try:
                # 3. Download the PDF content
                pdf_response = requests.get(pdf_url)
                pdf_response.raise_for_status()

                # Use an in-memory buffer to read the PDF
                pdf_file = io.BytesIO(pdf_response.content)
                reader = PdfReader(pdf_file)
                
                # 4. Extract text and search
                full_text = ""
                for page in reader.pages:
                    full_text += page.extract_text() or ""
                
                if search_string.lower() in full_text.lower():
                    print(f"✔️ Found '{search_string}' in: {pdf_url}")
                    found_in_files.append(pdf_url)

            except Exception as e:
                print(f"⚠️ Could not process {pdf_url}. Reason: {e}")
        
        if not found_in_files:
            print(f"\nSearch complete. The string '{search_string}' was not found in any of the PDFs.")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred fetching the main URL: {e}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        # </span><span style="color: #B48EAD">2.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Loop</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">through</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> []</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> pdf_links</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            # </span><span style="color: #D8DEE9">Construct</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">absolute</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">relative</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">startswith</span><span style="color: #D8DEE9FF">((</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">https://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url}{pdf_path}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">3.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Download</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">content</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #D8DEE9">Use</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">an</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in-</span><span style="color: #D8DEE9">memory</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">buffer</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">read</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">io</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">BytesIO</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">content</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">reader</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">PdfReader</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">4.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reader</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">pages</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">+=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">extract_text</span><span style="color: #D8DEE9FF">() </span><span style="color: #D8DEE9">or</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">() </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">full_text</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">✔️ Found &#39;{search_string}&#39; in: {pdf_url}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Exception</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">⚠️ Could not process {pdf_url}. Reason: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> found_in_files</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Search complete. The string &#39;{search_string}&#39; was not found in any of the PDFs.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">exceptions</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">RequestException</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">An error occurred fetching the main URL: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>x</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>if __name__ == "__main__":
    target_url = "https://www.umcs.pl/pl/plany-zajec,10795.htm"
    search_term = "programming"
    pdf_links = find_linked_pdfs(target_url)
    find_text_in_pdfs(pdf_links, search_string)
    </textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__name__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">==</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">__main__</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">https://www.umcs.pl/pl/plany-zajec,10795.htm</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">search_term</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">programming</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">pdf_links</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span></code></pre></div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="769" height="255" src="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg" alt="Scrape a Website and Search Inside PDFs with Python" class="wp-image-3293" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg 769w, https://mietwood.com/wp-content/uploads/2025/08/image-8-300x99.jpg 300w" sizes="auto, (max-width: 769px) 100vw, 769px" /><figcaption class="wp-element-caption">Scrape a Website and Search Inside PDFs with Python</figcaption></figure>
</div>


<h3 class="wp-block-heading"><strong>Wrapping Up and Next Steps</strong></h3>



<p>Congratulations! You&#8217;ve successfully built a powerful automation script that bridges the gap between web scraping and document analysis. By combining the strengths of <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you can now programmatically find information that was previously locked away inside PDF files on any website. This not only saves an incredible amount of time but also opens up new possibilities for data collection and analysis. Feel free to adapt the code for your own projects and take your web scraping skills to the next level. Scrape a Website.</p>



<p>The applications for this technique extend far beyond a single use case. Imagine using this script for <strong>academic research</strong>, automatically scanning university archives for papers mentioning a specific topic. You could adapt it for <strong>financial analysis</strong> by pulling keywords from dozens of quarterly earnings reports, or for <strong>legal work</strong> by searching through court filings for a particular case name. Job seekers could even use it to scan company websites for PDF job descriptions that contain key skills. </p>



<p>To perform a statistical analysis of the overall economy, you can leverage a variety of online resources, including government and intergovernmental data portals, as well as academic publications. These sources often provide data in structured formats like CSVs and APIs, but also in less-structured formats like HTML tables and PDFs, which can be parsed using Python libraries like Beautiful Soup and pypdf.</p>



<h3 class="wp-block-heading"><strong>Government and Intergovernmental Data Sources</strong></h3>



<p>For raw, official economic data, these are your most reliable sources. They offer a wealth of information on everything from GDP and inflation to employment rates and international trade. Scrape a Website. Search Inside PDF</p>



<ul class="wp-block-list">
<li><strong>Federal Reserve Economic Data (FRED)</strong>: A fantastic resource from the St. Louis Fed, FRED offers over 800,000 economic time series from more than 100 sources. It&#8217;s a goldmine for anyone doing macroeconomic analysis.</li>



<li><strong>The World Bank Open Data</strong>: This portal provides comprehensive global development data, including indicators on economic policy, poverty, gender, and more, making it perfect for cross-country comparisons.</li>



<li><strong>Data.gov</strong>: The home of U.S. government open data, this site aggregates datasets from various federal agencies, including the Bureau of Economic Analysis (BEA) and the Bureau of Labor Statistics (BLS).</li>



<li><strong>United Nations Statistics Division (UNSD)</strong>: The UNSD offers a wide array of international statistics, including the UNdata portal which provides free access to over 60 million statistical records from various UN agencies.</li>



<li><strong>The Bureau of Economic Analysis (BEA)</strong>: The BEA produces some of the most critical U.S. economic statistics, such as GDP, personal income, and corporate profits.</li>
</ul>



<p>You can read about Business analyst carrier path <a href="https://mietwood.com/the-allure-of-business-analysis-as-a-career-path">here</a></p>



<p>The core principle remains the same: automate the discovery of information, no matter the format. Search Inside PDF. Happy coding! 🚀</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Semantic Search</title>
		<link>https://mietwood.com/semantic-search</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 20 Aug 2025 11:33:03 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3279</guid>

					<description><![CDATA[<p>Since the advent of ChatGPT in November 2022, there is not a single day goes by without hearing or reading about vector or semantic search. It’s everywhere and so prevalent that we often get the impression this is a new cutting-edge technology. Vector vs lexical search An easy way to introduce vector search is by...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search">Semantic Search</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Since the advent of ChatGPT in November 2022, there is not a single day goes by without hearing or reading about vector or semantic search. It’s everywhere and so prevalent that we often get the impression this is a new cutting-edge technology.</p>



<h2 class="wp-block-heading">Vector vs lexical search</h2>



<p>An easy way to introduce <strong>vector search</strong> is by comparing it to the more conventional <strong>lexical search</strong> that you’re probably used to. <strong>Vector search, also commonly known as semantic search</strong>, and lexical search work very differently. </p>



<p><strong>Lexical search</strong> is the kind of search that we’ve all been using for years. To summarize it very briefly, it doesn’t try to understand the real meaning of what is indexed and queried, instead, it makes a big effort to <strong>lexically</strong> match the literals of the words or variants of them like stemming words, or synonyms, etc.. That makes what the user types in a query with all the literals that have been previously indexed into the database. The similarity is replaced by ranking algorithm, such as TF-IDF.</p>



<p>Documents are tokenized and analyzed. Then, the resulting terms are indexed in an inverted index, which simply maps the analyzed terms to the documents containing them. Searching for “yellow texas roses” will match all documents with varying scores.</p>



<p><strong>Semantic search</strong> &#8211; the whole purpose of semantic search is to index data in such a way that it can be searched based on the meaning it represents.</p>



<h5 class="wp-block-heading">What&#8217;s the difference between semantic search and lexical search?</h5>



<p>Lexical search doesn’t try to understand the real meaning of what is indexed and queried- it matches the literals of the words or their variants. In contrast, vector search indexes data in a way that allows it to be searched based on the meaning it represents.</p>



<p>Read more about vector similarity <a href="https://www.elastic.co/search-labs/blog/introduction-to-vector-search" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/introduction-to-vector-search</a></p>



<h2 class="wp-block-heading">Semantic search with elasticsearch</h2>



<p>Load model. See details <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/04-multilingual.ipynb" target="_blank" rel="noopener">here</a> and <a href="https://huggingface.co/intfloat/multilingual-e5-base" target="_blank" rel="noopener">here</a>. </p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-base")
VECTOR_DIMENSION = model.get_sentence_embedding_dimension()
print(f"Model loaded successfully.VECTOR_DIMENSION {VECTOR_DIMENSION}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence_transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SentenceTransformer</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">SentenceTransformer</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">intfloat/multilingual-e5-base</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">VECTOR_DIMENSION</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">get_sentence_embedding_dimension</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Model loaded successfully.VECTOR_DIMENSION {VECTOR_DIMENSION}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>Start elastic server</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>from elasticsearch import Elasticsearch
ES_HOST = "http://localhost:9000"
es = Elasticsearch(hosts=&#91;ES_HOST&#93;)
print(f"Connection successful: {es.info().body&#91;'cluster_name'&#93;}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #8FBCBB">ES_HOST</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">http://localhost:9000</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">hosts</span><span style="color: #D8DEE9FF">=&#91;</span><span style="color: #8FBCBB">ES_HOST</span><span style="color: #D8DEE9FF">&#93;)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Connection successful: {es.info().body&#91;&#39;cluster_name&#39;&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>Index product data to elastic database</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Fetch products from database
def select_products():
    q = """
        SELECT *
        FROM &#91;DB_Products&#93; 
        where &#91;BrandName&#93; in ('PCE','De Walt')
        """
    dfp = read_from_sql_server(q, odbc_conect='DSN=SQLxx')
    return dfp

# >-------------------------------------------------
df_docs = select_products()
print(df_docs.info())
# -------------------------------------------------</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Fetch</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">database</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">q</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">        SELECT </span><span style="color: #D8DEE9">*</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">DB_Products</span><span style="color: #D8DEE9FF">&#93; </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">where</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">BrandName</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> (</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">PCE</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">De Walt</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    dfp = read_from_sql_server(q, odbc_conect=&#39;DSN=SQLxx&#39;</span><span style="color: #D8DEE9">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">dfp</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">&gt;-------------------------------------------------</span></span>
<span class="line"><span style="color: #D8DEE9">df_docs</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">df_docs</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">info</span><span style="color: #D8DEE9FF">())</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">-------------------------------------------------</span></span></code></pre></div>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Index product names into elastic
index_mapping = {
        "properties": {
            "embedding": {
                "type": "dense_vector",
                "dims":  VECTOR_DIMENSION,
                "index": True,
                "similarity": "cosine"
                },

            "product_name": { 
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256 
                    }
                  }
                },
            "product_name_org": {
                "type": "keyword",
                "ignore_above": 256
                },
            "prodidx": {
                "type": "keyword",
                "ignore_above": 256
                },
            "img_url": {
                "type": "keyword",
                "ignore_above": 512 
                },
            }
        }

# --- 4. Create the Index -------------------------------------------------

INDEX_NAME = 'prod_names_for_search_hybrid'

import time
es.options(ignore_status=&#91;400, 404&#93;).indices.delete(index=INDEX_NAME, ignore_unavailable=True)
time.sleep(3) # Give a moment for index deletion to propagate

if not es.indices.exists(index=INDEX_NAME):
    es.indices.create(index=INDEX_NAME, mappings=index_mapping)
    print("Index created.")
else:
    print("Index already exists.")
# -----------------------------------------------------------------------

# Indexing product data ------------------------------------------------- 
product_names = df_docs&#91;'ProductName'&#93;.apply(lambda x: x.lower()&#91;:256&#93;).to_list()

i = 1
for name in product_names:
    print(f"Indexing:{i} {name}")
    i += 1
    
    vector = model.encode(f"passage: {name}").tolist()
    doc = {
        "Product_name": name,
        "ProductNameVector": vector
    }
    
    es.index(index=INDEX_NAME, document=doc, refresh=True)

# ---------------------------------------------------------------------</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Index</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">names</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elastic</span></span>
<span class="line"><span style="color: #D8DEE9">index_mapping</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">properties</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dense_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dims</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF">  </span><span style="color: #D8DEE9">VECTOR_DIMENSION</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">similarity</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">cosine</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">fields</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                  </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">256</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                  </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name_org</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">256</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">256</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">img_url</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">512</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">---</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">4.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Index</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">-------------------------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prod_names_for_search_hybrid</span><span style="color: #ECEFF4">&#39;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">time</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">options</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">ignore_status</span><span style="color: #D8DEE9FF">=&#91;400</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 404&#93;).</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">delete</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ignore_unavailable</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">time</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sleep</span><span style="color: #D8DEE9FF">(3) # </span><span style="color: #8FBCBB">Give</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">moment</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">deletion</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">propagate</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">exists</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">create</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mappings</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">index_mapping</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Index created.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Index already exists.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># -----------------------------------------------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Indexing</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> ------------------------------------------------- </span></span>
<span class="line"><span style="color: #8FBCBB">product_names</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df_docs</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">ProductName</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">apply</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">lower</span><span style="color: #D8DEE9FF">()&#91;:256&#93;).</span><span style="color: #8FBCBB">to_list</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> = 1</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">name</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">product_names</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Indexing:{i} {name}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> += 1</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">vector</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">encode</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">passage: {name}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">).</span><span style="color: #8FBCBB">tolist</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">doc</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">Product_name</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">name</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">ProductNameVector</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">vector</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">doc</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">refresh</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># ---------------------------------------------------------------------</span></span></code></pre></div>



<p>You can check your index using diagnostic functions. Semantic search.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># --- Diagnostic functions  ---------------------------------------
def indices_list():
    return [index&#91;'index'&#93; for index in es.cat.indices(format='json')]   

def count_records(index_name):
    return es.count(index=index_name)&#91;'count'&#93;    
# -----------------------------------------------------------------

# Get indices and record count ------------------------------------
for i in indices_list():
    print(i, count_records(i))

# Get mapings
mapping = es.indices.get_mapping(index=INDEX_NAME)
fields = mapping&#91;INDEX_NAME&#93;&#91;'mappings'&#93;&#91;'properties'&#93;
for field, details in fields.items():
    print(f"Model {INDEX_NAME}, fields {field}: {details&#91;'type'&#93;}")
#------------------------------------------------------------------</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">---</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Diagnostic</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">functions</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">---------------------------------------</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">index</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">cat</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">indices</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">format</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">json</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)]   </span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">count_records</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index_name</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">count</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">index_name</span><span style="color: #D8DEE9FF">)&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">count</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;    </span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">-----------------------------------------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">indices</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">record</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">count</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">------------------------------------</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">i</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">i</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">count_records</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">i</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">mapings</span></span>
<span class="line"><span style="color: #D8DEE9">mapping</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">indices</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get_mapping</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">fields</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">mapping</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">mappings</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">properties</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">field</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">details</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">fields</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">items</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Model {INDEX_NAME}, fields {field}: {details&#91;&#39;type&#39;&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">#</span><span style="color: #81A1C1">------------------------------------------------------------------</span></span></code></pre></div>



<h2 class="wp-block-heading">Semantic search</h2>



<p>Now when you have index you can search. You can read more in my post <a href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Semantic Search --- 

query_text = "yellow texas rose"
query_vector = model.encode(f"query: {query_text}").tolist()
knn_query = {
    "field": "embedding", #"ProductNameVector",
    "query_vector": query_vector,
    "k": 10,
    "num_candidates": 50
}
response = es.search(index=INDEX_NAME, knn=knn_query, source=&#91;"product_name"&#93;)
for hit in response&#91;'hits'&#93;&#91;'hits'&#93;:
    print(f"  - Product: {hit&#91;'_source'&#93;&#91;'product_name'&#93;} (Score: {hit&#91;'_score'&#93;:.4f})")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Semantic</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Search</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">---</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">query_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">yellow texas rose</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">model</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">encode</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query: {query_text}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">tolist</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9">knn_query</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> #</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ProductNameVector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_vector</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">50</span></span>
<span class="line"><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">knn</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">knn_query</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">source</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">  - Product: {hit&#91;&#39;_source&#39;&#93;&#91;&#39;product_name&#39;&#93;} (Score: {hit&#91;&#39;_score&#39;&#93;:.4f})</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>You can perform also lexical search or finnali hybrid search.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>def search_score_script(query_h):
    response = es.search(
        index=INDEX_NAME,
        body={
            "_source": &#91;'product_name','prodidx','img_url'&#93;,
            "query": {
                "function_score": {
                    "query": {
                        "bool": {
                            "should": &#91;
                                {
                                    "match": {
                                        "product_name": {
                                            "query": query_h,
                                            'operator': 'and',
                                            "_name": "text_match"
                                        }
                                    }
                                },
                                {
                                    "knn": {
                                        "field": "embedding",
                                        "query_vector": model.encode(f"query: {query_text}").tolist(),
                                        "k": 30,
                                        "num_candidates": 300,
                                        "_name": "semantic_search"
                                    }
                                }
                            &#93;
                        }
                    },                 
                }
            },
            "size": 100
        }
    )
    return response

response = search_score_script(query_h)   
products = [
         {
             "product_name": hit&#91;"_source"&#93;&#91;"product_name"&#93;,
             "score": hit&#91;"_score"&#93;,
             "matched_queries": hit.get("matched_queries", []),
             "prodidx": hit&#91;"_source"&#93;&#91;"prodidx"&#93;,
         }
         for hit in response&#91;"hits"&#93;&#91;"hits"&#93;
     ]

print([(p&#91;"product_name"&#93;,p&#91;"score"&#93; for p in products])
 </textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">search_score_script</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">body</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">img_url</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">function_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">bool</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">should</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_h</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                            </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">operator</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">and</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">knn</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">model</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">encode</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query: {query_text}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">tolist</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">30</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">300</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            &#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">},</span><span style="color: #D8DEE9FF">                 </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">size</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">100</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">    )</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">search_score_script</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">)   </span></span>
<span class="line"><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> [])</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">     ]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">([(</span><span style="color: #D8DEE9">p</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9">p</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">p</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF">])</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span></span></code></pre></div>



<h4 class="wp-block-heading"><strong>Sentence Transformers (e.g., MiniLM, BERT variants)</strong> are the best choice models for semantic search</h4>



<ul class="wp-block-list">
<li><strong>Type</strong>: Dense vector models.</li>



<li><strong>Pros</strong>:
<ul class="wp-block-list">
<li>Rich semantic understanding.</li>



<li>Multilingual support.</li>



<li>Fine-tuning possible for domain-specific needs.</li>
</ul>
</li>



<li><strong>Popular Models</strong>:
<ul class="wp-block-list">
<li><code>msmarco-MiniLM-L-12-v3</code> – optimized for asymmetric search (short queries vs. long product descriptions) <a>3</a>.</li>



<li><code>all-MiniLM-L6-v2</code> – fast and lightweight for general semantic tasks.</li>
</ul>
</li>



<li><strong>Use Case</strong>: Ideal for large-scale product catalogs and multilingual e-commerce platforms <a>3</a> <a>4</a>.</li>
</ul>



<h4 class="wp-block-heading">3.&nbsp;<strong>Hybrid Search (BM25 + Semantic)</strong></h4>



<ul class="wp-block-list">
<li>Combine <strong>BM25</strong> (keyword relevance) with <strong>semantic embeddings</strong> using <strong>Reciprocal Rank Fusion (RRF)</strong>.</li>



<li>Delivers highly relevant results by balancing literal matches and contextual meaning <a>1</a>.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search">Semantic Search</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Google Ads Conversion and GA4 Revenue Difference</title>
		<link>https://mietwood.com/google-ads-conversion-and-ga4-revenue-difference</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 13 Aug 2025 12:54:53 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3257</guid>

					<description><![CDATA[<p>Google Ads Conversion and GA4 Revenue Difference &#8211; It&#8217;s common and frustrating issue for digital marketers. Difference between Google Ads Total Conversion Value and Google Analytics GA4) CPC Revenue can really demolish your day. While a small discrepancy (10-20%) is often considered normal, a large difference signals a need for investigation. Why Google Ads Conversion...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/google-ads-conversion-and-ga4-revenue-difference">Google Ads Conversion and GA4 Revenue Difference</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Google Ads Conversion and GA4 Revenue Difference &#8211; It&#8217;s common and frustrating issue for digital marketers. Difference between Google Ads Total Conversion Value and Google Analytics GA4) CPC Revenue can really demolish your day. While a small discrepancy (10-20%) is often considered normal, a large difference signals a need for investigation. Why Google Ads Conversion Value and GA4 CPC Revenue Difference exists.</p>



<h3 class="wp-block-heading">Attribution Models and Credit</h3>



<p><strong>Google Ads Attribution:</strong> By default, Google Ads uses a data-driven attribution model, which gives credit to various touchpoints along the conversion path. By default ist is 30 days. It will take credit for a conversion as long as a user interacted with one of your ads within 30 days. You can defined shorter lookback window.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="385" src="https://mietwood.com/wp-content/uploads/2025/08/image-1.jpg" alt="" class="wp-image-3258" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-1.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/08/image-1-300x113.jpg 300w, https://mietwood.com/wp-content/uploads/2025/08/image-1-768x289.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Why Google Ads Conversion Value and GA4 CPC Revenue Difference &#8211; credit days</figcaption></figure>



<p><strong>GA4 Attribution:</strong> GA4&#8217;s default is also data-driven, but it&#8217;s &#8220;cross-channel.&#8221; This means it considers all marketing channels (organic search, direct, social, email, etc.) in the customer journey, not just Google Ads. If a user clicks a Google ad, then later comes back to your site via organic search and converts, GA4&#8217;s data-driven model will distribute credit to both channels, while Google Ads will likely claim all or most of the conversion value.</p>



<h3 class="wp-block-heading">Time-Based Reporting Google Ads Conversion and GA4 Revenue Difference</h3>



<p>The way each platform logs a conversion can create a significant reporting gap, especially when analyzing recent data.</p>



<ul class="wp-block-list">
<li><strong>Google Ads:</strong> Attributes a conversion to the <strong>date of the ad click</strong> or impression.</li>



<li><strong>GA4:</strong> Attributes a conversion to the <strong>date of the actual transaction</strong> or conversion event.</li>
</ul>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="810" height="452" src="https://mietwood.com/wp-content/uploads/2025/08/image-3.jpg" alt="" class="wp-image-3260" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-3.jpg 810w, https://mietwood.com/wp-content/uploads/2025/08/image-3-300x167.jpg 300w, https://mietwood.com/wp-content/uploads/2025/08/image-3-768x429.jpg 768w" sizes="auto, (max-width: 810px) 100vw, 810px" /><figcaption class="wp-element-caption"><a href="https://www.youtube.com/watch?v=kJSxckE3E6k" target="_blank" rel="noopener">https://www.youtube.com/watch?v=kJSxckE3E6k</a></figcaption></figure>



<p>For example, if a user clicks a Google ad on September 20th but makes a purchase on October 5th, Google Ads will report the conversion value in September, while GA4 will report it in October. This &#8220;conversion lag&#8221; can cause large differences when comparing monthly or weekly reports. Google Ads Conversion and GA4 Revenue Difference.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="816" height="418" src="https://mietwood.com/wp-content/uploads/2025/08/image-4.jpg" alt="Google Ads Conversion and GA4 Revenue Difference" class="wp-image-3261" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-4.jpg 816w, https://mietwood.com/wp-content/uploads/2025/08/image-4-300x154.jpg 300w, https://mietwood.com/wp-content/uploads/2025/08/image-4-768x393.jpg 768w" sizes="auto, (max-width: 816px) 100vw, 816px" /><figcaption class="wp-element-caption">Google Ads Conversion and GA4 Revenue Difference</figcaption></figure>



<h3 class="wp-block-heading">Conversion Counting and Definitions</h3>



<p>How you set up your conversion counting and conversion values can lead to different numbers. Especially important is conversion values for e-commerce purchase.</p>



<p><strong>Conversion Counting:</strong> In Google Ads, you can choose to count a conversion &#8220;once&#8221; or &#8220;every&#8221; time it happens. For example, if a user submits a form multiple times, you might want to count it only once. However, for a purchase, you&#8217;d want to count every transaction. If these settings are not aligned between Google Ads and GA4, your numbers will not match.</p>



<p><strong>Conversion Actions:</strong> If you have different conversion actions set up in Google Ads and GA4, or if they are not correctly linked, you will see a discrepancy. For example, if you track a &#8220;purchase&#8221; in GA4 but have a different conversion action for &#8220;leads&#8221; in Google Ads, the numbers will naturally be different.</p>



<h3 class="wp-block-heading">Technical and User-Based Factors</h3>



<p>These are often smaller but can add up to a substantial difference.</p>



<ul class="wp-block-list">
<li><strong>Ad Blockers and User Consent:</strong> Some ad blockers and privacy settings can prevent GA4&#8217;s tracking code from firing, meaning a session and conversion might not be recorded in GA4. However, the Google Ads conversion tag is often less affected, so Google Ads may still report the conversion. Similarly, if a user opts out of tracking via a consent banner, GA4 may not receive the data.</li>



<li><strong>Quick Exits:</strong> A user may click a Google ad, be charged for the click, and then hit the back button before the GA4 tracking tag has a chance to load. Google Ads will count the click, but GA4 won&#8217;t record a session, leading to a discrepancy between clicks and sessions, and ultimately, conversions.</li>



<li><strong>Cross-Device Conversions:</strong> Google Ads uses modeled conversions to account for users who start their journey on one device and finish it on another. This can lead to a higher conversion count in Google Ads than in GA4, especially if Google Signals is not enabled in GA4.</li>



<li><strong>View-Through Conversions:</strong> Google Ads counts &#8220;view-through conversions,&#8221; which are conversions that happen after a user sees a display ad but doesn&#8217;t click on it. GA4 does not track these by default, which can cause Google Ads to report more conversions and higher conversion value.</li>
</ul>



<h3 class="wp-block-heading">What to Do About It</h3>



<p><strong>Align Your Attribution Models:</strong> Consider using the same attribution model in both platforms for a more direct comparison. While Google Ads defaults to data-driven, you can use the &#8220;Model Comparison&#8221; tool in GA4 to see how your data would look under a different model, like &#8220;Last Click.&#8221;</p>



<p>The Model Comparison report, also referred to as the Attribution models report, in Google Analytics 4 (GA4) is a tool that allows you to compare how different attribution models distribute credit for conversions. It helps you understand how various marketing channels, like paid search, social media, and organic search, contribute to a user&#8217;s conversion path. This is a crucial report for understanding the true value of your marketing efforts beyond just the last touchpoint.</p>



<p>The report lets you select a conversion event and then view how its value would be allocated to different channels under two different attribution models. You can then see a percentage change, which highlights which channels are being over or undervalued depending on the model you use.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Key Attribution Models to Compare</h3>



<p>GA4 offers a few attribution models to choose from, with <strong>Data-driven</strong> being the default. Comparing this with other models is where the tool&#8217;s value really shines.</p>



<ul class="wp-block-list">
<li><strong>Data-driven:</strong> This model uses machine learning to analyze all the touchpoints in a user&#8217;s journey, including both converting and non-converting paths. It then assigns credit based on the actual impact of each touchpoint. It&#8217;s considered the most accurate and sophisticated model.</li>



<li><strong>Paid and organic last click:</strong> This is a rule-based model that gives 100% of the conversion credit to the last channel a user clicked on before converting, ignoring any direct traffic.</li>



<li><strong>Google paid channels last click:</strong> This model gives 100% of the credit to the last Google Ads click before a conversion. If there was no Google Ads click, it defaults to the &#8220;Paid and organic last click&#8221; model.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Why is this important?</h3>



<p>Understanding how different attribution models impact your data is critical for making smart business decisions. For instance, a <strong>&#8220;last click&#8221;</strong> model might make your paid search campaigns look like they&#8217;re driving all your conversions, leading you to undervalue channels like social media or display ads that introduce users to your brand much earlier in their journey. By using the Model Comparison report, you can see how a channel&#8217;s value changes when you shift from a last-click to a data-driven model, which can lead to better budgeting and optimization.</p>



<p>The video below offers a tutorial on how to use the Model Comparison Tool to analyze your traffic channels and conversion data in Google Analytics. How to Use the Model Comparison Tool in Google Analytics to Compare Your Traffic Channels</p>



<p><strong>Check Your Conversion Settings:</strong> Ensure that your conversion actions are correctly set up and linked, and that the counting method is consistent across both platforms.</p>



<p><strong>Check Your Date Ranges:</strong> When comparing, make sure you&#8217;re using a long enough date range (e.g., a full month or longer) to account for any conversion lag.</p>



<p><strong>Enable Auto-Tagging:</strong> Make sure auto-tagging is enabled in your Google Ads account so that GA4 can properly attribute traffic back to your campaigns.</p>



<p><strong>Don&#8217;t Panic:</strong> It&#8217;s normal to have some level of discrepancy. The key is to understand <em>why</em> the differences exist and use both platforms for what they&#8217;re best at: Google Ads for optimizing your paid campaigns and GA4 for understanding the full, multi-channel customer journey.</p>



<p>Read more <a href="https://mietwood.com/e-commerce-manager-dashboard">here</a></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/google-ads-conversion-and-ga4-revenue-difference">Google Ads Conversion and GA4 Revenue Difference</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
