<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Python &#8211; Customer Experience Management</title>
	<atom:link href="https://mietwood.com/category/python/feed" rel="self" type="application/rss+xml" />
	<link>https://mietwood.com</link>
	<description>Customer Experience Can Be Managed</description>
	<lastBuildDate>Wed, 31 Dec 2025 11:44:55 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://mietwood.com/wp-content/uploads/2022/09/cropped-Fav7-32x32.png</url>
	<title>Python &#8211; Customer Experience Management</title>
	<link>https://mietwood.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Product Grouping in Noisy E-commerce Datasets</title>
		<link>https://mietwood.com/product-grouping-in-noisy-e-commerce-datasets</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Tue, 30 Dec 2025 09:34:33 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3438</guid>

					<description><![CDATA[<p>Hybrid Approaches to Product Grouping in Noisy E-commerce Datasets: A Comparative Analysis of Set-Theoretic vs. Vector Space Models. In scientific literature, this problem is formally known as Short Text Clustering (STC) or Product Entity Resolution. The product similarity measures can be read from here: Measuring product similarity &#8211; 5 important secrets of python programming. Machine...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/product-grouping-in-noisy-e-commerce-datasets">Product Grouping in Noisy E-commerce Datasets</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Hybrid Approaches to Product Grouping in Noisy E-commerce Datasets: A Comparative Analysis of Set-Theoretic vs. Vector Space Models. In scientific literature, this problem is formally known as <strong>Short Text Clustering (STC)</strong> or <strong>Product Entity Resolution</strong>. The product similarity measures can be read from here: <a href="https://mietwood.com/measuring-product-similarity">Measuring product similarity &#8211; 5 important secrets of python programming</a>. Machine learning of product clustering you can find here: <a href="https://mietwood.com/hierarchical-agglomerative-clustering-for-product-grouping">Hierarchical Agglomerative Clustering for Product Grouping</a>.</p>



<h3 class="wp-block-heading">The &#8220;Marketplace Deduplication&#8221; Problem</h3>



<p>Context: Large marketplaces (Amazon, eBay, Alibaba, or a niche aggregator) allow thousands of third-party sellers to upload their own product feeds. Product Grouping for E-commerce Datasets indeed.</p>



<p>The Problem:</p>



<ul class="wp-block-list">
<li>Seller A uploads: <em>&#8220;Apple iPhone 13, 128GB, Midnight&#8221;</em></li>



<li>Seller B uploads: <em>&#8220;iPhone 13 128 GB Black Unlocked&#8221;</em></li>



<li>Seller C uploads: <em>&#8220;Smartfon Apple iPhone 13 128GB (MLPF3PM/A)&#8221;</em></li>
</ul>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>Why it&#8217;s hard &#8211; The Scientific Challenge</strong></summary>
<ul class="wp-block-list">
<li><strong>Missing Identifiers:</strong> Sellers often omit EAN/UPC codes to avoid price comparisons.</li>



<li><strong>Attribute Noise:</strong> &#8220;Midnight&#8221; vs. &#8220;Black&#8221; (synonyms).</li>



<li><strong>Goal:</strong> You must cluster these into a <strong>single catalog entry</strong> (The &#8220;Golden Record&#8221;) to show the user one product page with a list of 3 sellers, rather than 3 separate search results.</li>



<li>** Metric:** <em>False Positives</em> are costly here (grouping an iPhone 13 <strong>Pro</strong> with a regular iPhone 13 causes returns).</li>
</ul>
</details>



<h3 class="wp-block-heading">The &#8220;Omnichannel Customer Stitching&#8221; Problem (Single Customer View)</h3>



<p>Context: A retailer sells through a Website, a Mobile App, and Physical Stores. They want to know if the person browsing the app is the same person buying in the store.</p>



<p>The Problem:</p>



<ul class="wp-block-list">
<li><strong>Record A (Online):</strong> <code>email: j.smith@gmail.com</code>, <code>cookie_id: xyz123</code>, <code>behavior: viewed running shoes</code></li>



<li><strong>Record B (In-Store POS):</strong> <code>card_hash: ****-1234</code>, <code>loyalty_id: 998877</code>, <code>name: John Smith</code></li>



<li><strong>Record C (Customer Support):</strong> <code>phone: +48 500...</code>, <code>name: Johnny Smith</code>, <code>complaint: "Shoes size 42 too small"</code></li>
</ul>



<p><strong>Why it&#8217;s hard:</strong></p>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>Why it&#8217;s hard</strong></summary>
<ul class="wp-block-list">
<li><strong>Disjoint Attributes:</strong> Record A has no phone number; Record C has no email. You need &#8220;transitive linking&#8221; (A links to B, B links to C $\rightarrow$ A links to C).</li>



<li><strong>Privacy/Hashing:</strong> You are often matching hashed values or partial PII (Personally Identifiable Information).</li>



<li><strong>Goal:</strong> Create a <strong>Customer 360</strong> profile to send a targeted email: <em>&#8220;Hi John, sorry the size 42 didn&#8217;t fit. Here is a discount for size 43.&#8221;</em></li>
</ul>
</details>



<h3 class="wp-block-heading">The &#8220;Competitor Price Monitoring&#8221; Problem</h3>



<p>Context: An e-commerce store wants to automatically adjust their prices to be $1 cheaper than their biggest competitor.</p>



<p>The Problem:</p>



<ul class="wp-block-list">
<li><strong>Your Product:</strong> <em>&#8220;Samsung Galaxy S20 FE 5G Cloud Navy&#8221;</em></li>



<li><strong>Competitor Site:</strong> <em>&#8220;Samsung S20 Fan Edition (Navy) &#8211; 5G Compatible&#8221;</em></li>
</ul>



<p><strong>Why it&#8217;s hard:</strong></p>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>Why it&#8217;s hard &#8211; The Scientific Challenge:</strong></summary>
<ul class="wp-block-list">
<li><strong>Adversarial Data:</strong> Competitors intentionally slightly alter names or use unique internal SKUs to prevent scraping and matching.</li>



<li><strong>Asymmetry:</strong> You have your full database (structured), but the competitor data is scraped (unstructured, noisy HTML).</li>



<li><strong>Goal:</strong> Map competitor SKUs to your SKUs with high precision. If you map to the wrong product (e.g., a cheaper &#8220;Lite&#8221; version), your dynamic pricing algorithm will lower your price too much and you lose money.</li>
</ul>
</details>



<h4 class="wp-block-heading"><strong>Abstract</strong></h4>



<ul class="wp-block-list">
<li><strong>Problem:</strong> E-commerce catalogs suffer from redundancy (same product, different sizes/variants).</li>



<li><strong>Gap:</strong> Manual grouping is impossible; Deep Learning is overkill/imprecise for strict SKU grouping.</li>



<li><strong>Method:</strong> We compare Jaccard (Set) vs. TF-IDF (Vector) and propose a normalization pipeline.</li>



<li><strong>Result:</strong> Our method achieved X% accuracy with Y% reduction in computational time.</li>
</ul>



<h4 class="wp-block-heading"><strong>Set-Theoretic Approaches for Product Grouping (Jaccard)</strong></h4>



<ul class="wp-block-list">
<li><strong>Concept:</strong> Treats text as a &#8220;Bag of Words&#8221; (BoW) without weights.</li>



<li><strong>Key Papers/Concepts:</strong>
<ul class="wp-block-list">
<li><em>Cohen et al. (2003)</em> often discuss string metrics for entity matching.</li>



<li><strong>Shingling / MinHash:</strong> In large datasets, calculating Jaccard for all pairs is $O(N^2)$. Literature focuses on <strong>Locality Sensitive Hashing (LSH)</strong> (MinHash) to approximate Jaccard similarity efficiently.</li>



<li><strong>Pros in Literature:</strong> High interpretability, excellent for &#8220;near-duplicate&#8221; detection.</li>



<li><strong>Cons:</strong> Fails when synonyms are used (e.g., &#8220;pants&#8221; vs &#8220;trousers&#8221;) or when word importance varies.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading"><strong>Vector Space Models (TF-IDF)</strong></h4>



<ul class="wp-block-list">
<li><strong>Concept:</strong> Maps text to a high-dimensional Euclidean space.</li>



<li><strong>Key Papers/Concepts:</strong>
<ul class="wp-block-list">
<li><em>Salton et al. (1975)</em> (The foundational VSM paper).</li>



<li><strong>Character n-grams:</strong> Papers often cite that for noisy user-generated content (UGC), character n-grams outperform word tokens because they handle misspellings morphologically.</li>



<li><strong>Pros:</strong> Handles &#8220;rare words&#8221; (like model numbers) better due to IDF (Inverse Document Frequency) weighting.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading"><strong>The State-of-the-Art (Deep Learning / Embeddings)</strong></h4>



<ul class="wp-block-list">
<li>If you publish, reviewers will ask: <em>&#8220;Why not BERT?&#8221;</em></li>



<li><strong>SBERT (Sentence-BERT):</strong> Current SOTA uses transformer models to generate dense vector embeddings.</li>



<li><strong>Your Counter-Argument:</strong> Deep learning is computationally expensive and &#8220;black-box&#8221;. For industrial product grouping where <em>exact</em> feature matching (like &#8220;Samsung&#8221; + &#8220;Galaxy&#8221;) is critical, classical methods (TF-IDF/Jaccard) often offer better precision and control than semantic embeddings which might group &#8220;iPhone 12&#8221; with &#8220;Samsung S20&#8221; because they are both &#8220;phones&#8221;.</li>
</ul>



<h4 class="wp-block-heading"><strong>II. Related Work</strong></h4>



<ul class="wp-block-list">
<li>Mention <strong>LSH</strong> (Locality Sensitive Hashing) for Jaccard.</li>



<li>Mention <strong>DBSCAN</strong> and <strong>Agglomerative Clustering</strong> as standard algorithms.</li>



<li>Cite limitations of <strong>BERT</strong> in high-precision SKU matching.</li>



<li>Product Grouping for E-commerce</li>
</ul>



<h4 class="wp-block-heading"><strong>III. Methodology (The Core)</strong></h4>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary>Group definition in dataset</summary>
<ul class="wp-block-list">
<li>Define <strong>SKU</strong> vs <strong>Product Group</strong> (Parent-Child relationship).</li>



<li>The challenge: &#8220;Noise&#8221; in titles (e.g., <code>500ml</code>, <code>XL</code>, <code>Pack of 2</code>).</li>
</ul>
</details>



<ol start="1" class="wp-block-list">
<li><strong>Preprocessing &#8211;  The Normalization Filter for Product Grouping for E-commerce:</strong>
<ul class="wp-block-list">
<li>Define your Regex rules mathematically.</li>



<li>$T_{clean} = f(T_{raw})$ where $f$ removes tokens $t \in \{Dimensions, Colors, Stopwords\}$.</li>
</ul>
</li>



<li><strong>Representation:</strong>
<ul class="wp-block-list">
<li><strong>Approach A (Jaccard):</strong> $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$</li>



<li><strong>Approach B (TF-IDF):</strong> Cosine Similarity $\cos(\theta) = \frac{A \cdot B}{||A|| ||B||}$</li>
</ul>
</li>



<li><strong>Clustering Algorithm:</strong>
<ul class="wp-block-list">
<li>Explain why you chose <strong>Connected Components</strong> (Graph theory) for Jaccard or <strong>Hierarchical Clustering</strong> for TF-IDF.</li>
</ul>
</li>
</ol>



<h4 class="wp-block-heading"><strong>IV. Experiments</strong> (Product Grouping for E-commerce)</h4>



<ul class="wp-block-list">
<li><strong>Dataset:</strong> Your dataset of &#8220;several thousand products&#8221;.</li>



<li><strong>Metrics:</strong> You <em>must</em> measure quality.
<ul class="wp-block-list">
<li><strong>Precision:</strong> Are elements in the cluster actually the same product?</li>



<li><strong>Recall:</strong> Did we find <em>all</em> sizes of that product?</li>



<li><strong>F1-Score:</strong> Harmonic mean of the two.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading"><strong>V. Results &amp; Discussion</strong></h4>



<ul class="wp-block-list">
<li><em>Hypothesis:</em> Jaccard works better for clean data; TF-IDF works better for noisy data (typos).</li>



<li><em>Observation:</em> Jaccard is faster but TF-IDF + N-grams is more robust.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Novel Algorithm:</strong></p>



<ol start="1" class="wp-block-list">
<li><strong>Stage 1 (Blocking):</strong> Use <strong>Jaccard</strong> on tokens to quickly find &#8220;Candidate Pairs&#8221; (fast, rough filter).</li>



<li><strong>Stage 2 (Refinement):</strong> Use <strong>TF-IDF with Character N-grams</strong> on the candidates to calculate a precise similarity score (handles typos).</li>



<li><strong>Stage 3 (Decision):</strong> Hard threshold (e.g., >0.85).</li>
</ol>



<h3 class="wp-block-heading">Relevant Search Terms &amp; Papers</h3>



<p>Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. </p>



<p>However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper (<a href="https://dl.acm.org/doi/pdf/10.1145/1559845.1559870" target="_blank" rel="noopener">https://dl.acm.org/doi/pdf/10.1145/1559845.1559870</a>), authors propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records.</p>



<p>Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks.</p>



<ul class="wp-block-list">
<li><strong>&#8220;Short Text Clustering for E-commerce&#8221;</strong></li>



<li><strong>&#8220;Product Entity Resolution with Noise&#8221;</strong></li>



<li><strong>&#8220;Comparison of Jaccard and Cosine Similarity in Text Mining&#8221;</strong></li>



<li><strong>&#8220;Blocking techniques for Entity Resolution&#8221;</strong> (This is crucial for scaling to thousands/millions of products). Product Grouping for E-commerce. </li>
</ul>



<p><strong>Specific types of papers to look for:</strong></p>



<ol start="1" class="wp-block-list">
<li><em>Koporec et al.</em>: Papers on combining Jaccard with other metrics. <a href="https://www.sciencedirect.com/science/article/pii/S1364815225002981" target="_blank" rel="noopener">https://www.sciencedirect.com/science/article/pii/S1364815225002981</a></li>



<li><em>Ganesan et al.</em>: Research on &#8220;abstractive summarization&#8221; or short-text clustering in retail.</li>
</ol>



<h3 class="wp-block-heading"><strong>FAQ: Product Grouping in Noisy E-Commerce Datasets</strong></h3>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>1. What does “noisy data” mean in e-commerce?</strong></summary>
<p>Noisy data refers to inconsistencies, errors, or irrelevant information in product listings—such as misspellings, incomplete descriptions, or duplicate entries—that make grouping products challenging.</p>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>2. Why is product grouping important?</strong></summary>
<p>Grouping similar products improves search accuracy, recommendation quality, and overall user experience. It also helps businesses manage inventory and pricing strategies more effectively</p>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>3. What are common challenges in product grouping?</strong></summary>
<ul class="wp-block-list">
<li>Variations in product names and descriptions</li>



<li>Missing or incorrect attributes</li>



<li>Multiple languages or regional differences</li>



<li>Inconsistent categorization by sellers</li>
</ul>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>4. Which techniques are used to handle noisy datasets?</strong></summary>
<ul class="wp-block-list">
<li><strong>Text normalization</strong> (removing special characters, standardizing case)</li>



<li><strong>Tokenization and similarity measures</strong> (e.g., cosine similarity, Jaccard index)</li>



<li><strong>Machine learning models</strong> for clustering and classification</li>



<li><strong>Attribute-based matching</strong> using structured data</li>
</ul>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>5. Can AI improve product grouping accuracy?</strong></summary>
<p><br>Yes. AI models like BERT or domain-specific embeddings can capture semantic meaning in product descriptions, making grouping more accurate even with noisy data.</p>
</details>



<details class="wp-block-details is-layout-flow wp-block-details-is-layout-flow"><summary><strong>6. How do I start implementing product grouping?</strong></summary>
<p><br>Begin with data cleaning and normalization, then apply similarity-based clustering or train a supervised model if labeled data is available.</p>
</details>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/product-grouping-in-noisy-e-commerce-datasets">Product Grouping in Noisy E-commerce Datasets</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to check virtual environments</title>
		<link>https://mietwood.com/virtual-environment</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 19:19:18 +0000</pubDate>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3321</guid>

					<description><![CDATA[<p>You can check virtual environments using a simple for loop in your terminal. This command finds all subdirectories in a specified parent folder, assumes each is a virtual environment, and then uses that environment&#8217;s pip to check for lifelines. How it works: ## For Windows 🪟 You can use a similar loop in either Command...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/virtual-environment">How to check virtual environments</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>You can check virtual environments using a simple <code>for</code> loop in your terminal. This command finds all subdirectories in a specified parent folder, assumes each is a virtual environment, and then uses that environment&#8217;s <code>pip</code> to check for <code>lifelines</code>.</p>



<ol start="1" class="wp-block-list">
<li><strong>Navigate</strong> to the folder that contains all your virtual environments.</li>



<li><strong>Run the following command</strong>: Bash<code> for venv in */ ; do if [ -f "${venv}bin/pip" ]; then echo "--- Checking in '${venv%?}' ---" ${venv}bin/pip list | grep 'lifelines' fi done</code></li>
</ol>



<p><strong>How it works:</strong></p>



<ul class="wp-block-list">
<li>It loops through each subdirectory (e.g., <code>my_project_venv/</code>).</li>



<li>It checks if a <code>pip</code> executable exists inside the <code>bin</code> folder to confirm it&#8217;s likely a venv.</li>



<li>It then runs the <code>pip list</code> command from <em>within</em> that specific environment and uses <code>grep</code> to filter for the line containing &#8220;lifelines&#8221;.</li>



<li>If <code>lifelines</code> is installed, it will print the package name and its version. If not, it will print nothing for that environment.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">## For Windows 🪟</h3>



<p>You can use a similar loop in either Command Prompt (CMD) or PowerShell.</p>



<h4 class="wp-block-heading">In PowerShell:</h4>



<ol start="1" class="wp-block-list">
<li><strong>Open PowerShell</strong> and navigate to the folder containing your virtual environments.</li>



<li><strong>Run the following command</strong>: PowerShe ll <code>Get-ChildItem -Directory | ForEach-Object { $pipPath = Join-Path $_.FullName "Scripts\pip.exe" if (Test-Path $pipPath) { Write-Host "--- Checking in '$($_.Name)' ---" &amp; $pipPath list | findstr "lifelines" } }</code></li>
</ol>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>Get-ChildItem -Directory | ForEach-Object {
    $pipPath = Join-Path $_.FullName "Scripts\pip.exe"
    if (Test-Path $pipPath) {
        Write-Host "--- Checking in '$($_.Name)' ---"
        &amp; $pipPath list | findstr "lifelines"
    }
}</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Get</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">ChildItem</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Directory</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">|</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">ForEach</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Object</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">$pipPath</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Join</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Path</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$_</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">FullName</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Scripts</span><span style="color: #EBCB8B">\p</span><span style="color: #A3BE8C">ip.exe</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">Test</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Path</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$pipPath</span><span style="color: #D8DEE9FF">) </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">Write</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Host</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">--- Checking in &#39;$($_.Name)&#39; ---</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">&amp;</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$pipPath</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">list</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">|</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">findstr</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">lifelines</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #ECEFF4">}</span></span></code></pre></div>



<p>Inspection results</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="969" height="458" src="https://mietwood.com/wp-content/uploads/2025/09/image-6.jpg" alt="" class="wp-image-3322" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-6.jpg 969w, https://mietwood.com/wp-content/uploads/2025/09/image-6-300x142.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-6-768x363.jpg 768w" sizes="(max-width: 969px) 100vw, 969px" /></figure>
<p>The post <a rel="nofollow" href="https://mietwood.com/virtual-environment">How to check virtual environments</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>What are independent variables?</title>
		<link>https://mietwood.com/independent-variables</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 18:44:48 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3316</guid>

					<description><![CDATA[<p>Independent variables, also called predictors, features, or explanatory variables, are the variables in a statistical or machine learning model that are used to explain or predict changes in another variable — the dependent variable, also called the outcome or target. Independent variables in simple terms: Example of independent variables in Customer Management (RFM Model): Suppose...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Independent variables, </strong>also called <strong>predictors</strong>, <strong>features</strong>, or <strong>explanatory variables</strong>, are the variables in a statistical or machine learning model that are used to <strong>explain or predict</strong> changes in another variable — the <strong>dependent variable</strong>, also called the outcome or target.</p>



<h3 class="wp-block-heading" id="insimpleterms"><strong>Independent variables</strong> in simple terms:</h3>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>: Inputs you control or observe.</li>



<li><strong>Dependent variable</strong>: Output you want to understand or predict.</li>
</ul>



<h3 class="wp-block-heading" id="exampleincustomermanagementrfmmodel">Example of i<strong>ndependent variables</strong> in Customer Management (RFM Model):</h3>



<p>Suppose you&#8217;re analyzing customer behavior to predict <strong>churn</strong> (whether a customer will stop buying).</p>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>:
<ul class="wp-block-list">
<li><strong>Recency</strong>: How recently a customer made a purchase.</li>



<li><strong>Frequency</strong>: How often they purchase.</li>



<li><strong>Monetary</strong>: How much they spend.</li>
</ul>
</li>



<li><strong>Dependent variable</strong>:
<ul class="wp-block-list">
<li><strong>Churn</strong>: 1 if the customer churned, 0 if they stayed.</li>
</ul>
</li>
</ul>



<p>In this case, <strong>Recency</strong>, <strong>Frequency</strong>, and <strong>Monetary</strong> are independent variables used to predict the likelihood of <strong>churn</strong>. See also <a href="https://mietwood.com/python-for-business-analytics-2">Python for business analytics &#8211; rfm analysis</a></p>



<h3 class="wp-block-heading" id="whycheckforindependenceamongindependentvariables">Why check for independence among independent variables?</h3>



<p>If independent variables are <strong>highly correlated with other variables, </strong>i.e., it is not truly independent, it can cause <strong>multicollinearity</strong>, which makes model coefficients unstable, reduces interpretability, and can lead to misleading conclusions.</p>



<h2 class="wp-block-heading"><strong>Multicollinearity</strong></h2>



<p><strong>Multicollinearity</strong> refers to a statistical phenomenon in which two or more independent variables in a regression model are highly correlated. This makes it difficult to determine the individual effect of each variable on the dependent variable because they essentially carry overlapping information.</p>



<h3 class="wp-block-heading"><strong>Assessment of Multicollinearity</strong></h3>



<p>To assess multicollinearity, you can use following methods:</p>



<ol class="wp-block-list">
<li><strong>Correlation Matrix</strong>
<ul class="wp-block-list">
<li>Check pairwise correlations between independent variables.</li>



<li>High correlation (e.g., > 0.8 or &lt; -0.8) may indicate multicollinearity.</li>
</ul>
</li>



<li><strong>Variance Inflation Factor (VIF)</strong>
<ul class="wp-block-list">
<li>Measures how much the variance of a regression coefficient is inflated due to multicollinearity.</li>



<li><strong>VIF > 5 or 10</strong> is often considered problematic.</li>
</ul>
</li>



<li><strong>Tolerance</strong>
<ul class="wp-block-list">
<li>Tolerance = 1 / VIF.</li>



<li>Low tolerance values (close to 0) indicate high multicollinearity.</li>
</ul>
</li>



<li><strong>Condition Index and Eigenvalues</strong>
<ul class="wp-block-list">
<li>Part of a more advanced diagnostic using matrix decomposition.</li>



<li>A <strong>condition index > 30</strong> may suggest serious multicollinearity.</li>
</ul>
</li>
</ol>



<h3 class="wp-block-heading"><strong>How to deal with multicollinearity?</strong></h3>



<ul class="wp-block-list">
<li><strong>Remove one of the correlated variables.</strong></li>



<li><strong>Combine variables</strong> (e.g., using PCA or creating an index).</li>



<li><strong>Regularization techniques</strong> like Ridge or Lasso regression.</li>



<li><strong>Centering variables</strong> (subtracting the mean) can help in some cases.</li>
</ul>



<h2 class="wp-block-heading">Calculation example</h2>



<p>Assume, you have data similar to this sample.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="488" height="337" src="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg" alt="" class="wp-image-3317" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg 488w, https://mietwood.com/wp-content/uploads/2025/09/image-4-300x207.jpg 300w" sizes="(max-width: 488px) 100vw, 488px" /><figcaption class="wp-element-caption">RFM data sample &#8211; for testing independent variables </figcaption></figure>
</div>


<h2 class="wp-block-heading">Variable independence testing</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd

# Load the data with the specified delimiter
df = pd.read_csv("RFM_analysis_614.csv", delimiter=",")

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Select the independent variables (RFM)
X = df[&#91;'recency', 'freq', 'monetary'&#93;]

# Add a constant for the VIF calculation (required by the statsmodels function)
X = add_constant(X)

# Create a DataFrame to hold the VIF results
vif_data = pd.DataFrame()
vif_data&#91;"Variable"&#93; = X.columns
vif_data&#91;"VIF"&#93; = [variance_inflation_factor(X.values, i) for i in range(X.shape&#91;1&#93;)]

# Exclude the constant row from the final output since it's not a true variable
vif_data = vif_data&#91;vif_data.Variable != 'const'&#93;.reset_index(drop=True)

print(vif_data)

# Save the VIF results to a CSV file
vif_data.to_csv("vif_results.csv", index=False)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Load</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">specified</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span></span>
<span class="line"><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">read_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">RFM_analysis_614.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">,</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">stats</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">outliers_influence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variance_inflation_factor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">add_constant</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">independent</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variables</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">RFM</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">[&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">recency</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">freq</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">monetary</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Add</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">calculation</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">required</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">add_constant</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hold</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">results</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Variable</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = </span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">columns</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">VIF</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = [</span><span style="color: #8FBCBB">variance_inflation_factor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">values</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">range</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">shape</span><span style="color: #D8DEE9FF">&#91;1&#93;)]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Exclude</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">final</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">output</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">since</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">it</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">s not a true variabl</span><span style="color: #D8DEE9">e</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">Variable</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">!=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">const</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">reset_index</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">drop</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Save</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">results</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">CSV</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">file</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">to_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">vif_results.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">False</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Finally program prints following results</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="179" height="85" src="https://mietwood.com/wp-content/uploads/2025/09/image-5.jpg" alt="" class="wp-image-3318"/></figure>
</div>


<p>Based on the Variance Inflation Factor (VIF) calculation, the columns <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> are <strong>statistically independent</strong> of each other and <strong>not informationally overlapping</strong>.</p>



<p>This means that you can use all three variables together as independent predictors in a statistical model, such as a Cox Proportional Hazards (Cox PH) model, without concern for severe multicollinearity.</p>



<h2 class="wp-block-heading">Variance Inflation Factor (VIF) Results</h2>



<p>The VIF (<a href="https://en.wikipedia.org/wiki/Variance_inflation_factor" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Variance_inflation_factor</a>)  is a measure of how much the variance of an estimated regression coefficient is increased due to collinearity. A common rule of thumb is that a VIF value <strong>less than 5</strong> or sometimes 10 indicates that the correlation between the variables is not high enough to warrant concern.</p>



<p>The calculated VIF values for your RFM variables are very low:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Variable</td><td>VIF</td></tr></thead><tbody><tr><td><strong>recency</strong></td><td>1.437</td></tr><tr><td><strong>freq</strong></td><td>1.488</td></tr><tr><td><strong>monetary</strong></td><td>1.065</td></tr></tbody></table></figure>



<p></p>



<h2 class="wp-block-heading">Conclusion on Independence</h2>



<p>Since all VIF values are close to 1.0 and well below the 5.0 threshold:</p>



<ul class="wp-block-list">
<li><strong>Independent Variables:</strong> You can confidently treat <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> as independent variables for your statistical analysis (e.g., in a Cox PH model).</li>



<li><strong>No Informational Overlap:</strong> The variables are providing distinct, non-redundant information to the model. For instance, knowing a customer&#8217;s frequency does not allow the model to strongly predict their recency or monetary value.</li>
</ul>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Top IDEs for Python developers</title>
		<link>https://mietwood.com/ides-for-python-developers</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 09:11:31 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3312</guid>

					<description><![CDATA[<p>Here a top graphical IDEs (Integrated Development Environments) for Python developers as of 2024: 1.&#160;PyCharm 2.&#160;Visual Studio Code (VS Code) 3.&#160;Spyder 4.&#160;Thonny 5.&#160;Wing IDE 6.&#160;Eric 7.&#160;IDLE Summary:For professional development,&#160;PyCharm&#160;and&#160;VS Code&#160;are the most popular. For data science,&#160;Spyder&#160;is widely used. For beginners,&#160;Thonny&#160;or&#160;IDLE&#160;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Here a top graphical IDEs (Integrated Development Environments) for Python developers</strong> as of 2024:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">1.&nbsp;<strong>PyCharm</strong></h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="421" src="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg" alt="" class="wp-image-3313" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/image-3-300x123.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-3-768x316.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<ul class="wp-block-list">
<li><strong>Developer:</strong> JetBrains</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Advanced code analysis, smart code completion, integrated debugger, Git support, virtual environment management, Django support.</li>



<li><strong>Community (free) and Professional (paid) editions.</strong></li>



<li><strong>Website:</strong> <a href="https://www.jetbrains.com/pycharm/" target="_blank" rel="noreferrer noopener">PyCharm</a></li>
</ul>



<h3 class="wp-block-heading">2.&nbsp;<strong>Visual Studio Code (VS Code)</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Microsoft</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Lightweight but powerful; excellent Python extension; integrated terminal; rich plugin ecosystem; Git support; Jupyter notebook integration.</li>



<li><strong>Website:</strong> <a href="https://code.visualstudio.com/" target="_blank" rel="noreferrer noopener">VS Code</a></li>
</ul>



<h3 class="wp-block-heading">3.&nbsp;<strong>Spyder</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Scientific Python Development Environment Community</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Focused on scientific computing and data science; variable explorer; integrated IPython console; plotting support.</li>



<li><strong>Website:</strong> <a href="https://www.spyder-ide.org/" target="_blank" rel="noreferrer noopener">Spyder</a></li>
</ul>



<h3 class="wp-block-heading">4.&nbsp;<strong>Thonny</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> University of Tartu</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Beginner-friendly; simple UI; built-in debugger; good for learning and education.</li>



<li><strong>Website:</strong> <a href="https://thonny.org/" target="_blank" rel="noreferrer noopener">Thonny</a></li>
</ul>



<h3 class="wp-block-heading">5.&nbsp;<strong>Wing IDE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Wingware</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Powerful debugger; code intelligence; remote development support.</li>



<li><strong>Website:</strong> <a href="https://wingware.com/" target="_blank" rel="noreferrer noopener">Wing IDE</a></li>
</ul>



<h3 class="wp-block-heading">6.&nbsp;<strong>Eric</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Detlev Offenbach</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Full-featured Python and Ruby IDE; integrated debugger; plugin support.</li>



<li><strong>Website:</strong> <a href="https://eric-ide.python-projects.org/" target="_blank" rel="noreferrer noopener">Eric Python IDE</a></li>
</ul>



<h3 class="wp-block-heading">7.&nbsp;<strong>IDLE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Python Software Foundation (bundled with Python)</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Basic, lightweight, good for quick scripts and learning.</li>



<li><strong>Website:</strong> <a href="https://docs.python.org/3/library/idle.html" target="_blank" rel="noreferrer noopener">IDLE Documentation</a></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Summary:</strong><br>For professional development,&nbsp;<strong>PyCharm</strong>&nbsp;and&nbsp;<strong>VS Code</strong>&nbsp;are the most popular. For data science,&nbsp;<strong>Spyder</strong>&nbsp;is widely used. For beginners,&nbsp;<strong>Thonny</strong>&nbsp;or&nbsp;<strong>IDLE</strong>&nbsp;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Scrape a Website and Search Inside PDFs with Python</title>
		<link>https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sat, 30 Aug 2025 09:13:30 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3292</guid>

					<description><![CDATA[<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if you could automate the entire process with just a few lines of code?</p>



<p>In this tutorial, we&#8217;ll show you exactly how to do that. We’ll build a powerful yet simple Python script that automatically scans a webpage, finds all the PDF links, and searches for specific text inside each one. Using popular libraries like <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you&#8217;ll learn a practical skill that can save you hours of manual work. Let&#8217;s get started!</p>



<p>Python, Web Scraping, PDF, Automation, BeautifulSoup, PyPDF, requests, Data Extraction, Python Projects, Text Search</p>



<h2 class="wp-block-heading">Scrape a Website</h2>



<p>in the script we <strong>Find</strong> all links on the initial page. <strong>Filter</strong> for links that end with <code>.pdf</code>. For each PDF link: <strong>Download</strong> the PDF file into memory. <strong>Extract</strong> text from every page of the PDF. <strong>Search</strong> the extracted text for your <code>search_string</code>. And finally <strong>Report</strong> which PDF files contain the phrase. Scrape a Website</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import requests
from bs4 import BeautifulSoup
from pypdf import PdfReader
import io

Scrape a Website
def find_linked_pdfs(url):
    """
    Scans a webpage for PDF links and searches for a string within each PDF.

    Args:
        url: The URL of the webpage to scan.
        search_string: The string to search for inside the PDFs.
    """
    print(f"Scanning {url} for PDF links...")
    try:
        # 1. Get the main page to find all links
        base_url_parts = requests.utils.urlparse(url)
        base_url = f"{base_url_parts.scheme}://{base_url_parts.netloc}"
        
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [a&#91;'href'&#93; \
          for a in soup.find_all('a', href=True) \
          if a&#91;'href'&#93;.endswith('.pdf')]
        
        if not pdf_links:
            print("No PDF links found on the page.")
            return

        print(f"Found {len(pdf_links)} PDF files. Now searching inside them...")

    return pdf_links</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">requests</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bs4</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BeautifulSoup</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pypdf</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PdfReader</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">io</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">Scrape</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Website</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    Scans a webpage for PDF links and searches for a string within each PDF</span><span style="color: #D8DEE9">.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">Args</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">webpage</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">scan</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">search_string</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">string</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDFs</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    print(f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #8FBCBB">Scanning</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span><span style="color: #8FBCBB">url</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span><span style="color: #D8DEE9FF">...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        # 1. </span><span style="color: #8FBCBB">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">main</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">all</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url_parts</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">urlparse</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url_parts.scheme}://{base_url_parts.netloc}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BeautifulSoup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">text</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">html.parser</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF"> = [</span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">find_all</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">a</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">href</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">) \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">endswith</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">.pdf</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)]</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">No PDF links found on the page.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">return</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Found {len(pdf_links)} PDF files. Now searching inside them...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span></span></code></pre></div>



<p><strong>Handling URLs</strong>: It constructs a full, absolute URL for each PDF, as many links on a page can be relative (e.g., <code>/path/to/file.pdf</code>). Scrape a Website. <a href="https://pypi.org/project/beautifulsoup4/" target="_blank" rel="noopener">https://pypi.org/project/beautifulsoup4/</a></p>



<p><strong>In-Memory Processing</strong>: Instead of saving each PDF to your disk, it uses <code>io.BytesIO</code> to treat the downloaded content as a file in your computer&#8217;s memory. This is faster and cleaner.</p>



<p><strong>Text Extraction</strong>: The <code>pypdf</code> library&#8217;s <code>PdfReader</code> opens this in-memory file. The script then loops through each page, calls <code>extract_text()</code>, and combines the text from all pages.</p>



<p><strong>Searching and Reporting</strong>: Finally, it performs a case-insensitive search on the extracted text and prints the URL of any PDF that contains your search term.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>Search Inside PDF
def find_text_in_pdfs(pdf_links, search_string):

        # 2. Loop through each PDF link

        found_in_files = []
        for pdf_path in pdf_links:
            # Construct absolute URL if the link is relative
            if not pdf_path.startswith(('http://', 'https://')):
                pdf_url = f"{base_url}{pdf_path}"
            else:
                pdf_url = pdf_path

            try:
                # 3. Download the PDF content
                pdf_response = requests.get(pdf_url)
                pdf_response.raise_for_status()

                # Use an in-memory buffer to read the PDF
                pdf_file = io.BytesIO(pdf_response.content)
                reader = PdfReader(pdf_file)
                
                # 4. Extract text and search
                full_text = ""
                for page in reader.pages:
                    full_text += page.extract_text() or ""
                
                if search_string.lower() in full_text.lower():
                    print(f"✔️ Found '{search_string}' in: {pdf_url}")
                    found_in_files.append(pdf_url)

            except Exception as e:
                print(f"⚠️ Could not process {pdf_url}. Reason: {e}")
        
        if not found_in_files:
            print(f"\nSearch complete. The string '{search_string}' was not found in any of the PDFs.")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred fetching the main URL: {e}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        # </span><span style="color: #B48EAD">2.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Loop</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">through</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> []</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> pdf_links</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            # </span><span style="color: #D8DEE9">Construct</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">absolute</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">relative</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">startswith</span><span style="color: #D8DEE9FF">((</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">https://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url}{pdf_path}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">3.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Download</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">content</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #D8DEE9">Use</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">an</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in-</span><span style="color: #D8DEE9">memory</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">buffer</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">read</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">io</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">BytesIO</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">content</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">reader</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">PdfReader</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">4.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reader</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">pages</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">+=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">extract_text</span><span style="color: #D8DEE9FF">() </span><span style="color: #D8DEE9">or</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">() </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">full_text</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">✔️ Found &#39;{search_string}&#39; in: {pdf_url}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Exception</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">⚠️ Could not process {pdf_url}. Reason: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> found_in_files</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Search complete. The string &#39;{search_string}&#39; was not found in any of the PDFs.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">exceptions</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">RequestException</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">An error occurred fetching the main URL: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>x</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>if __name__ == "__main__":
    target_url = "https://www.umcs.pl/pl/plany-zajec,10795.htm"
    search_term = "programming"
    pdf_links = find_linked_pdfs(target_url)
    find_text_in_pdfs(pdf_links, search_string)
    </textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__name__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">==</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">__main__</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">https://www.umcs.pl/pl/plany-zajec,10795.htm</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">search_term</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">programming</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">pdf_links</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span></code></pre></div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="769" height="255" src="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg" alt="Scrape a Website and Search Inside PDFs with Python" class="wp-image-3293" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg 769w, https://mietwood.com/wp-content/uploads/2025/08/image-8-300x99.jpg 300w" sizes="auto, (max-width: 769px) 100vw, 769px" /><figcaption class="wp-element-caption">Scrape a Website and Search Inside PDFs with Python</figcaption></figure>
</div>


<h3 class="wp-block-heading"><strong>Wrapping Up and Next Steps</strong></h3>



<p>Congratulations! You&#8217;ve successfully built a powerful automation script that bridges the gap between web scraping and document analysis. By combining the strengths of <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you can now programmatically find information that was previously locked away inside PDF files on any website. This not only saves an incredible amount of time but also opens up new possibilities for data collection and analysis. Feel free to adapt the code for your own projects and take your web scraping skills to the next level. Scrape a Website.</p>



<p>The applications for this technique extend far beyond a single use case. Imagine using this script for <strong>academic research</strong>, automatically scanning university archives for papers mentioning a specific topic. You could adapt it for <strong>financial analysis</strong> by pulling keywords from dozens of quarterly earnings reports, or for <strong>legal work</strong> by searching through court filings for a particular case name. Job seekers could even use it to scan company websites for PDF job descriptions that contain key skills. </p>



<p>To perform a statistical analysis of the overall economy, you can leverage a variety of online resources, including government and intergovernmental data portals, as well as academic publications. These sources often provide data in structured formats like CSVs and APIs, but also in less-structured formats like HTML tables and PDFs, which can be parsed using Python libraries like Beautiful Soup and pypdf.</p>



<h3 class="wp-block-heading"><strong>Government and Intergovernmental Data Sources</strong></h3>



<p>For raw, official economic data, these are your most reliable sources. They offer a wealth of information on everything from GDP and inflation to employment rates and international trade. Scrape a Website. Search Inside PDF</p>



<ul class="wp-block-list">
<li><strong>Federal Reserve Economic Data (FRED)</strong>: A fantastic resource from the St. Louis Fed, FRED offers over 800,000 economic time series from more than 100 sources. It&#8217;s a goldmine for anyone doing macroeconomic analysis.</li>



<li><strong>The World Bank Open Data</strong>: This portal provides comprehensive global development data, including indicators on economic policy, poverty, gender, and more, making it perfect for cross-country comparisons.</li>



<li><strong>Data.gov</strong>: The home of U.S. government open data, this site aggregates datasets from various federal agencies, including the Bureau of Economic Analysis (BEA) and the Bureau of Labor Statistics (BLS).</li>



<li><strong>United Nations Statistics Division (UNSD)</strong>: The UNSD offers a wide array of international statistics, including the UNdata portal which provides free access to over 60 million statistical records from various UN agencies.</li>



<li><strong>The Bureau of Economic Analysis (BEA)</strong>: The BEA produces some of the most critical U.S. economic statistics, such as GDP, personal income, and corporate profits.</li>
</ul>



<p>You can read about Business analyst carrier path <a href="https://mietwood.com/the-allure-of-business-analysis-as-a-career-path">here</a></p>



<p>The core principle remains the same: automate the discovery of information, no matter the format. Search Inside PDF. Happy coding! 🚀</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Semantic Search</title>
		<link>https://mietwood.com/semantic-search</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 20 Aug 2025 11:33:03 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3279</guid>

					<description><![CDATA[<p>Since the advent of ChatGPT in November 2022, there is not a single day goes by without hearing or reading about vector or semantic search. It’s everywhere and so prevalent that we often get the impression this is a new cutting-edge technology. Vector vs lexical search An easy way to introduce vector search is by...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search">Semantic Search</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Since the advent of ChatGPT in November 2022, there is not a single day goes by without hearing or reading about vector or semantic search. It’s everywhere and so prevalent that we often get the impression this is a new cutting-edge technology.</p>



<h2 class="wp-block-heading">Vector vs lexical search</h2>



<p>An easy way to introduce <strong>vector search</strong> is by comparing it to the more conventional <strong>lexical search</strong> that you’re probably used to. <strong>Vector search, also commonly known as semantic search</strong>, and lexical search work very differently. </p>



<p><strong>Lexical search</strong> is the kind of search that we’ve all been using for years. To summarize it very briefly, it doesn’t try to understand the real meaning of what is indexed and queried, instead, it makes a big effort to <strong>lexically</strong> match the literals of the words or variants of them like stemming words, or synonyms, etc.. That makes what the user types in a query with all the literals that have been previously indexed into the database. The similarity is replaced by ranking algorithm, such as TF-IDF.</p>



<p>Documents are tokenized and analyzed. Then, the resulting terms are indexed in an inverted index, which simply maps the analyzed terms to the documents containing them. Searching for “yellow texas roses” will match all documents with varying scores.</p>



<p><strong>Semantic search</strong> &#8211; the whole purpose of semantic search is to index data in such a way that it can be searched based on the meaning it represents.</p>



<h5 class="wp-block-heading">What&#8217;s the difference between semantic search and lexical search?</h5>



<p>Lexical search doesn’t try to understand the real meaning of what is indexed and queried- it matches the literals of the words or their variants. In contrast, vector search indexes data in a way that allows it to be searched based on the meaning it represents.</p>



<p>Read more about vector similarity <a href="https://www.elastic.co/search-labs/blog/introduction-to-vector-search" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/introduction-to-vector-search</a></p>



<h2 class="wp-block-heading">Semantic search with elasticsearch</h2>



<p>Load model. See details <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/04-multilingual.ipynb" target="_blank" rel="noopener">here</a> and <a href="https://huggingface.co/intfloat/multilingual-e5-base" target="_blank" rel="noopener">here</a>. </p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-base")
VECTOR_DIMENSION = model.get_sentence_embedding_dimension()
print(f"Model loaded successfully.VECTOR_DIMENSION {VECTOR_DIMENSION}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence_transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SentenceTransformer</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">SentenceTransformer</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">intfloat/multilingual-e5-base</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">VECTOR_DIMENSION</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">get_sentence_embedding_dimension</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Model loaded successfully.VECTOR_DIMENSION {VECTOR_DIMENSION}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>Start elastic server</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>from elasticsearch import Elasticsearch
ES_HOST = "http://localhost:9000"
es = Elasticsearch(hosts=&#91;ES_HOST&#93;)
print(f"Connection successful: {es.info().body&#91;'cluster_name'&#93;}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #8FBCBB">ES_HOST</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">http://localhost:9000</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">hosts</span><span style="color: #D8DEE9FF">=&#91;</span><span style="color: #8FBCBB">ES_HOST</span><span style="color: #D8DEE9FF">&#93;)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Connection successful: {es.info().body&#91;&#39;cluster_name&#39;&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>Index product data to elastic database</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Fetch products from database
def select_products():
    q = """
        SELECT *
        FROM &#91;DB_Products&#93; 
        where &#91;BrandName&#93; in ('PCE','De Walt')
        """
    dfp = read_from_sql_server(q, odbc_conect='DSN=SQLxx')
    return dfp

# >-------------------------------------------------
df_docs = select_products()
print(df_docs.info())
# -------------------------------------------------</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Fetch</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">database</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">q</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">        SELECT </span><span style="color: #D8DEE9">*</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">DB_Products</span><span style="color: #D8DEE9FF">&#93; </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">where</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">BrandName</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> (</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">PCE</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">De Walt</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    dfp = read_from_sql_server(q, odbc_conect=&#39;DSN=SQLxx&#39;</span><span style="color: #D8DEE9">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">dfp</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">&gt;-------------------------------------------------</span></span>
<span class="line"><span style="color: #D8DEE9">df_docs</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">df_docs</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">info</span><span style="color: #D8DEE9FF">())</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">-------------------------------------------------</span></span></code></pre></div>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Index product names into elastic
index_mapping = {
        "properties": {
            "embedding": {
                "type": "dense_vector",
                "dims":  VECTOR_DIMENSION,
                "index": True,
                "similarity": "cosine"
                },

            "product_name": { 
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256 
                    }
                  }
                },
            "product_name_org": {
                "type": "keyword",
                "ignore_above": 256
                },
            "prodidx": {
                "type": "keyword",
                "ignore_above": 256
                },
            "img_url": {
                "type": "keyword",
                "ignore_above": 512 
                },
            }
        }

# --- 4. Create the Index -------------------------------------------------

INDEX_NAME = 'prod_names_for_search_hybrid'

import time
es.options(ignore_status=&#91;400, 404&#93;).indices.delete(index=INDEX_NAME, ignore_unavailable=True)
time.sleep(3) # Give a moment for index deletion to propagate

if not es.indices.exists(index=INDEX_NAME):
    es.indices.create(index=INDEX_NAME, mappings=index_mapping)
    print("Index created.")
else:
    print("Index already exists.")
# -----------------------------------------------------------------------

# Indexing product data ------------------------------------------------- 
product_names = df_docs&#91;'ProductName'&#93;.apply(lambda x: x.lower()&#91;:256&#93;).to_list()

i = 1
for name in product_names:
    print(f"Indexing:{i} {name}")
    i += 1
    
    vector = model.encode(f"passage: {name}").tolist()
    doc = {
        "Product_name": name,
        "ProductNameVector": vector
    }
    
    es.index(index=INDEX_NAME, document=doc, refresh=True)

# ---------------------------------------------------------------------</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Index</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">names</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elastic</span></span>
<span class="line"><span style="color: #D8DEE9">index_mapping</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">properties</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dense_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">dims</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF">  </span><span style="color: #D8DEE9">VECTOR_DIMENSION</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">similarity</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">cosine</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">fields</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                  </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">256</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                  </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name_org</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">256</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">256</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">img_url</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">type</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">keyword</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ignore_above</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">512</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">---</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">4.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Index</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">-------------------------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prod_names_for_search_hybrid</span><span style="color: #ECEFF4">&#39;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">time</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">options</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">ignore_status</span><span style="color: #D8DEE9FF">=&#91;400</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 404&#93;).</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">delete</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ignore_unavailable</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">time</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sleep</span><span style="color: #D8DEE9FF">(3) # </span><span style="color: #8FBCBB">Give</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">moment</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">deletion</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">propagate</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">exists</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">create</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mappings</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">index_mapping</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Index created.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Index already exists.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># -----------------------------------------------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Indexing</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> ------------------------------------------------- </span></span>
<span class="line"><span style="color: #8FBCBB">product_names</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df_docs</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">ProductName</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">apply</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">lower</span><span style="color: #D8DEE9FF">()&#91;:256&#93;).</span><span style="color: #8FBCBB">to_list</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> = 1</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">name</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">product_names</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Indexing:{i} {name}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> += 1</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">vector</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">encode</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">passage: {name}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">).</span><span style="color: #8FBCBB">tolist</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">doc</span><span style="color: #D8DEE9FF"> = </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">Product_name</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">name</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">ProductNameVector</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">vector</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">doc</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">refresh</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># ---------------------------------------------------------------------</span></span></code></pre></div>



<p>You can check your index using diagnostic functions. Semantic search.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># --- Diagnostic functions  ---------------------------------------
def indices_list():
    return [index&#91;'index'&#93; for index in es.cat.indices(format='json')]   

def count_records(index_name):
    return es.count(index=index_name)&#91;'count'&#93;    
# -----------------------------------------------------------------

# Get indices and record count ------------------------------------
for i in indices_list():
    print(i, count_records(i))

# Get mapings
mapping = es.indices.get_mapping(index=INDEX_NAME)
fields = mapping&#91;INDEX_NAME&#93;&#91;'mappings'&#93;&#91;'properties'&#93;
for field, details in fields.items():
    print(f"Model {INDEX_NAME}, fields {field}: {details&#91;'type'&#93;}")
#------------------------------------------------------------------</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">---</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Diagnostic</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">functions</span><span style="color: #D8DEE9FF">  </span><span style="color: #81A1C1">---------------------------------------</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">index</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">cat</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">indices</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">format</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">json</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)]   </span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">count_records</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index_name</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">count</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">index_name</span><span style="color: #D8DEE9FF">)&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">count</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;    </span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">-----------------------------------------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">indices</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">record</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">count</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">------------------------------------</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">i</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">i</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">count_records</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">i</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">mapings</span></span>
<span class="line"><span style="color: #D8DEE9">mapping</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">indices</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get_mapping</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">fields</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">mapping</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">mappings</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">properties</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">field</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">details</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">fields</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">items</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Model {INDEX_NAME}, fields {field}: {details&#91;&#39;type&#39;&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">#</span><span style="color: #81A1C1">------------------------------------------------------------------</span></span></code></pre></div>



<h2 class="wp-block-heading">Semantic search</h2>



<p>Now when you have index you can search. You can read more in my post <a href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Semantic Search --- 

query_text = "yellow texas rose"
query_vector = model.encode(f"query: {query_text}").tolist()
knn_query = {
    "field": "embedding", #"ProductNameVector",
    "query_vector": query_vector,
    "k": 10,
    "num_candidates": 50
}
response = es.search(index=INDEX_NAME, knn=knn_query, source=&#91;"product_name"&#93;)
for hit in response&#91;'hits'&#93;&#91;'hits'&#93;:
    print(f"  - Product: {hit&#91;'_source'&#93;&#91;'product_name'&#93;} (Score: {hit&#91;'_score'&#93;:.4f})")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Semantic</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Search</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">---</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">query_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">yellow texas rose</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">model</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">encode</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query: {query_text}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">tolist</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9">knn_query</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> #</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">ProductNameVector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_vector</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">50</span></span>
<span class="line"><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">knn</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">knn_query</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">source</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">  - Product: {hit&#91;&#39;_source&#39;&#93;&#91;&#39;product_name&#39;&#93;} (Score: {hit&#91;&#39;_score&#39;&#93;:.4f})</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>You can perform also lexical search or finnali hybrid search.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>def search_score_script(query_h):
    response = es.search(
        index=INDEX_NAME,
        body={
            "_source": &#91;'product_name','prodidx','img_url'&#93;,
            "query": {
                "function_score": {
                    "query": {
                        "bool": {
                            "should": &#91;
                                {
                                    "match": {
                                        "product_name": {
                                            "query": query_h,
                                            'operator': 'and',
                                            "_name": "text_match"
                                        }
                                    }
                                },
                                {
                                    "knn": {
                                        "field": "embedding",
                                        "query_vector": model.encode(f"query: {query_text}").tolist(),
                                        "k": 30,
                                        "num_candidates": 300,
                                        "_name": "semantic_search"
                                    }
                                }
                            &#93;
                        }
                    },                 
                }
            },
            "size": 100
        }
    )
    return response

response = search_score_script(query_h)   
products = [
         {
             "product_name": hit&#91;"_source"&#93;&#91;"product_name"&#93;,
             "score": hit&#91;"_score"&#93;,
             "matched_queries": hit.get("matched_queries", []),
             "prodidx": hit&#91;"_source"&#93;&#91;"prodidx"&#93;,
         }
         for hit in response&#91;"hits"&#93;&#91;"hits"&#93;
     ]

print([(p&#91;"product_name"&#93;,p&#91;"score"&#93; for p in products])
 </textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">search_score_script</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">INDEX_NAME</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">body</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">img_url</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">function_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">bool</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">should</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_h</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                            </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">operator</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">and</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">knn</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">model</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">encode</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query: {query_text}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">tolist</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">30</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">300</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            &#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">},</span><span style="color: #D8DEE9FF">                 </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">size</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">100</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">    )</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">search_score_script</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">)   </span></span>
<span class="line"><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> [])</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">             </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodidx</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">         </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">     ]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">([(</span><span style="color: #D8DEE9">p</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9">p</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">p</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF">])</span></span>
<span class="line"><span style="color: #D8DEE9FF"> </span></span></code></pre></div>



<h4 class="wp-block-heading"><strong>Sentence Transformers (e.g., MiniLM, BERT variants)</strong> are the best choice models for semantic search</h4>



<ul class="wp-block-list">
<li><strong>Type</strong>: Dense vector models.</li>



<li><strong>Pros</strong>:
<ul class="wp-block-list">
<li>Rich semantic understanding.</li>



<li>Multilingual support.</li>



<li>Fine-tuning possible for domain-specific needs.</li>
</ul>
</li>



<li><strong>Popular Models</strong>:
<ul class="wp-block-list">
<li><code>msmarco-MiniLM-L-12-v3</code> – optimized for asymmetric search (short queries vs. long product descriptions) <a>3</a>.</li>



<li><code>all-MiniLM-L6-v2</code> – fast and lightweight for general semantic tasks.</li>
</ul>
</li>



<li><strong>Use Case</strong>: Ideal for large-scale product catalogs and multilingual e-commerce platforms <a>3</a> <a>4</a>.</li>
</ul>



<h4 class="wp-block-heading">3.&nbsp;<strong>Hybrid Search (BM25 + Semantic)</strong></h4>



<ul class="wp-block-list">
<li>Combine <strong>BM25</strong> (keyword relevance) with <strong>semantic embeddings</strong> using <strong>Reciprocal Rank Fusion (RRF)</strong>.</li>



<li>Delivers highly relevant results by balancing literal matches and contextual meaning <a>1</a>.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search">Semantic Search</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python for analysts most important datetime functions</title>
		<link>https://mietwood.com/python-for-analysts-most-important-datetime-functions</link>
					<comments>https://mietwood.com/python-for-analysts-most-important-datetime-functions#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 20 Jul 2025 16:18:38 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3211</guid>

					<description><![CDATA[<p>Python’s powerful date and time functions using the datetime and pandas libraries gives you a robust date table ready for Power BI and other business intelligence and analytical tools. Python for analysts most important datetime functions. Mastering Date and Time Functions in Python for Power BI Date Tables When working with Power BI, a well-structured...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analysts-most-important-datetime-functions">Python for analysts most important datetime functions</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Python’s powerful <strong>date and time functions</strong> using the <code>datetime</code> and <code>pandas</code> libraries gives you a robust date table ready for Power BI and other business intelligence and analytical tools. Python for analysts most important datetime functions.</p>



<h2 class="wp-block-heading" id="masteringdateandtimefunctionsinpythonforpowerbidatetables">Mastering Date and Time Functions in Python for Power BI Date Tables</h2>



<p>When working with Power BI, a well-structured <strong>Date Table</strong> is essential for time intelligence calculations like YTD, QTD, MTD, and custom period comparisons. While Power BI has built-in date table features, using <strong>Python</strong> to generate a custom date table gives you full control over the structure, granularity, and logic.</p>



<p>In this post, we’ll explore Python’s powerful <strong>date and time functions</strong> using the <code>datetime</code> and <code>pandas</code> libraries, and show how to create a robust date table ready for Power BI.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="1pythondateandtimebasics">Python for analyst – date and time functions &#8211; basics</h2>



<p>Python provides the <a href="https://docs.python.org/3/library/datetime.html" target="_blank" rel="noopener"><code>datetime</code> module</a> to work with dates and times. Here&#8217;s a quick overview:</p>



<pre class="wp-block-code"><code>from datetime import datetime, timedelta, date

# Current date and time
now = datetime.now()
print("Now:", now)

# Just the date
today = date.today()
print("Today:", today)

# Add 7 days
next_week = today + timedelta(days=7)
print("Next week:", next_week)

# Subtract 30 days
last_month = today - timedelta(days=30)
print("30 days ago:", last_month)
</code></pre>



<p>These functions are the foundation for generating date ranges and calculating custom columns like fiscal periods or holidays. Python for analysts most important datetime functions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="2creatingadaterangewithpandas">Creating a Date Range with Pandas</h2>



<p>To build a date table, we need a continuous range of dates. <code>pandas.date_range()</code> is perfect for this:</p>



<pre class="wp-block-code"><code>import pandas as pd

# Generate a date range from 2020 to 2030
date_range = pd.date_range(start='2020-01-01', end='2030-12-31', freq='D')
df = pd.DataFrame({'Date': date_range})
</code></pre>



<p>This gives us a DataFrame with one row per day — the backbone of our date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="3enrichingthedatetable">Enriching the Date Table</h2>



<p>Now let’s add useful columns for Power BI:</p>



<pre class="wp-block-code"><code>df&#91;'Year'] = df&#91;'Date'].dt.year
df&#91;'Month'] = df&#91;'Date'].dt.month
df&#91;'MonthName'] = df&#91;'Date'].dt.strftime('%B')
df&#91;'Quarter'] = df&#91;'Date'].dt.quarter
df&#91;'Day'] = df&#91;'Date'].dt.day
df&#91;'Weekday'] = df&#91;'Date'].dt.weekday + 1  # Monday = 1
df&#91;'WeekdayName'] = df&#91;'Date'].dt.strftime('%A')
df&#91;'IsWeekend'] = df&#91;'Weekday'].isin(&#91;6, 7])
df&#91;'Week'] = df&#91;'Date'].dt.isocalendar().week
df&#91;'DayOfYear'] = df&#91;'Date'].dt.dayofyear
</code></pre>



<p>These columns allow for slicing and dicing your data in Power BI by year, month, weekday, and more.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="4fiscalcalendarsupport">Fiscal Calendar Support</h2>



<p>Many businesses use fiscal calendars that don’t align with the calendar year. Here’s how to add a fiscal year starting in July:</p>



<pre class="wp-block-code"><code>df&#91;'FiscalYear'] = df&#91;'Date'].apply(lambda x: x.year if x.month &lt; 7 else x.year + 1)
df&#91;'FiscalMonth'] = df&#91;'Date'].apply(lambda x: x.month - 6 if x.month &gt;= 7 else x.month + 6)
df&#91;'FiscalQuarter'] = ((df&#91;'FiscalMonth'] - 1) // 3) + 1
</code></pre>



<p>This logic adjusts the fiscal year, month, and quarter based on a July start. Python for analysts most important datetime functions</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="5flagsfortimeintelligence">5. Flags for Time Intelligence</h2>



<p>Power BI benefits from flags that simplify DAX calculations:</p>



<pre class="wp-block-code"><code>today = pd.to_datetime('today').normalize()

df&#91;'IsToday'] = df&#91;'Date'] == today
df&#91;'IsCurrentMonth'] = (df&#91;'Date'].dt.month == today.month) &amp; (df&#91;'Date'].dt.year == today.year)
df&#91;'IsCurrentYear'] = df&#91;'Date'].dt.year == today.year
</code></pre>



<p>You can also add flags for holidays, fiscal periods, or custom business logic. Python for analysts most important datetime functions</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="6exportingtocsvforpowerbi">Exporting to CSV for Power BI</h2>



<p>Once your date table is ready, export it:</p>



<pre class="wp-block-code"><code>df.to_csv('DateTable.csv', index=False)
</code></pre>



<p>You can now import this CSV into Power BI as a static date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="7sampleoutput">7. Sample Output</h2>



<p>Here’s a preview of what your date table might look like:</p>



<figure class="wp-block-table aligncenter is-style-regular has-small-font-size"><table><thead><tr><th></th><th></th><th class="has-text-align-center" data-align="center"></th><th></th><th></th><th></th><th></th><th></th><th></th></tr></thead><tbody><tr><td>2025-01-01</td><td>2025</td><td class="has-text-align-center" data-align="center">1</td><td>January</td><td>1</td><td>Wednesday</td><td>False</td><td>2025</td><td>False</td></tr><tr><td>2025-07-01</td><td>2025</td><td class="has-text-align-center" data-align="center">7</td><td>July</td><td>3</td><td>Tuesday</td><td>False</td><td>2026</td><td>False</td></tr><tr><td>2025-12-25</td><td>2025</td><td class="has-text-align-center" data-align="center">12</td><td>December</td><td>4</td><td>Thursday</td><td>False</td><td>2026</td><td>False</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Pandas to_datetime() function</h2>



<p>Python for analysts most important datetime functions &#8211; pandas</p>



<pre class="wp-block-code"><code> #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CustomerId              13483 non-null  int64         
 1   Dt_first_rew_income     2987 non-null   datetime64&#91;ns]
 2   Dt_first_purchase       13483 non-null  object        
 3   Dt_last_purchase        13483 non-null  object        

df_cust&#91;'Dt_first_purchase'] = pd.to_datetime(df_cust&#91;'Dt_first_purchase'],format="yyyy-mm-dd")
df_cust&#91;'Dt_last_purchase'] = pd.to_datetime(df_cust&#91;'Dt_last_purchase'],format="yyyy-mm-dd")

 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CustomerId              13483 non-null  int64         
 1   Dt_first_rew_income     2987 non-null   datetime64&#91;ns]
 2   Dt_first_purchase       13483 non-null  datetime64&#91;ns]
 3   Dt_last_purchase        13483 non-null  datetime64&#91;ns]</code></pre>



<h2 class="wp-block-heading" id="8advancedtips">Advanced Tips</h2>



<ul class="wp-block-list">
<li><strong>Holidays</strong>: Use external APIs or CSVs to mark public holidays.</li>



<li><strong>Week Start</strong>: Adjust <code>Weekday</code> to match your locale (e.g., Monday vs. Sunday).</li>



<li><strong>Time Zones</strong>: Use <code>pytz</code> or <code>zoneinfo</code> for timezone-aware datetime handling.</li>



<li><strong>Dynamic Updates</strong>: Automate the script to regenerate the table monthly or yearly.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Certainly! Here&#8217;s a concise <strong>400-word post</strong> on <strong>SQL Date and Time Functions</strong>, with examples, tailored for building a <strong>Date Table in Power BI</strong>:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="sqldateandtimefunctionsforpowerbidatetables">SQL Date and Time Functions for Power BI Date Tables</h2>



<p>When building reports in Power BI, a comprehensive <strong>Date Table</strong> is essential for enabling time-based calculations like YTD, MTD, and custom period comparisons. While Power BI can auto-generate a date table, using <strong>SQL</strong> to create one gives you full control over its structure and logic.</p>



<p>Let’s explore key <strong>SQL Server date and time functions</strong> and how to use them to build a robust date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="1generatingadaterange">1. Generating a Date Range</h2>



<p>To create a date table, you need a continuous range of dates. In SQL Server, you can use a loop or a recursive CTE:</p>



<pre class="wp-block-code"><code>DECLARE @StartDate DATE = '2020-01-01';
DECLARE @EndDate DATE = '2030-12-31';

WITH DateCTE AS (
    SELECT @StartDate AS DateValue
    UNION ALL
    SELECT DATEADD(DAY, 1, DateValue)
    FROM DateCTE
    WHERE DateValue &lt; @EndDate
)
SELECT * INTO DateTable FROM DateCTE
OPTION (MAXRECURSION 32767);

select * from DateTable
</code></pre>



<p>Here the example </p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="440" height="575" src="https://mietwood.com/wp-content/uploads/2025/07/image-17.jpg" alt="Python for analysts most important datetime functions in sql" class="wp-image-3213" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-17.jpg 440w, https://mietwood.com/wp-content/uploads/2025/07/image-17-230x300.jpg 230w" sizes="auto, (max-width: 440px) 100vw, 440px" /><figcaption class="wp-element-caption">Python for analysts most important datetime functions in sql</figcaption></figure>



<p>This creates a table with one row per day between 2020 and 2030.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="2addingdateattributes">2. Adding Date Attributes</h2>



<p>Once you have the base dates, enrich them with useful columns:</p>



<pre class="wp-block-code"><code>ALTER TABLE DateTable ADD 
    Year INT,
    Month INT,
    MonthName VARCHAR(20),
    Quarter INT,
    Weekday INT,
    WeekdayName VARCHAR(20);

UPDATE DateTable
SET 
    Year = YEAR(DateValue),
    Month = MONTH(DateValue),
    MonthName = DATENAME(MONTH, DateValue),
    Quarter = DATEPART(QUARTER, DateValue),
    Weekday = DATEPART(WEEKDAY, DateValue),
    WeekdayName = DATENAME(WEEKDAY, DateValue);
</code></pre>



<p>These columns allow for flexible filtering and grouping in Power BI.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="3fiscalcalendarandflags">3. Fiscal Calendar and Flags</h2>



<p>You can also add fiscal logic and flags:</p>



<pre class="wp-block-code"><code>ALTER TABLE DateTable ADD FiscalYear INT;

UPDATE DateTable
SET FiscalYear = CASE 
    WHEN MONTH(DateValue) &gt;= 7 THEN YEAR(DateValue) + 1
    ELSE YEAR(DateValue)
END;
</code></pre>



<p>Add flags like <code>IsWeekend</code>, <code>IsToday</code>, or <code>IsCurrentMonth</code> to simplify DAX expressions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p></p>



<h2 class="wp-block-heading" id="conclusion">Conclusion</h2>



<p>Python offers a flexible and powerful way to create a <strong>custom date table</strong> for Power BI. With just a few lines of code, you can generate a rich dataset that supports advanced time intelligence and reporting needs.</p>



<p>Whether you&#8217;re working with fiscal calendars, custom flags, or multilingual support, Python gives you the tools to tailor your date table exactly to your business requirements.</p>



<p>SQL’s date and time functions like <code>DATEADD</code>, <code>DATEPART</code>, <code>DATENAME</code>, and <code>YEAR</code> are powerful tools for building a custom date table. Once created, export it to Power BI or use it as a view for dynamic reporting.</p>



<p>Would you like a ready-to-run SQL script for a complete date table?</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Find more resources in our course of <a href="https://mietwood.com/programowanie-zaawansowane-w-analityce">Advanced programming for business analysts</a></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analysts-most-important-datetime-functions">Python for analysts most important datetime functions</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/python-for-analysts-most-important-datetime-functions/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</title>
		<link>https://mietwood.com/bilgoraj-zalew-bojary-plywanie-100-km</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 13 Jul 2025 13:56:53 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3173</guid>

					<description><![CDATA[<p>Damian Błaszczyk będzie pływał w Zalewie Bojary, stawiając sobie za cel pokonanie 100 km. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km. Przygotowaliśmy zadanie matematyczne jak może wyglądać takie pływanie i w jaki sposób pokonać 100km. Rozwiązanie zadania w dalszej części artykułu. Cel tego wydarzenia Ośrodek Sportu i Rekreacji w Biłgoraju &#8220;Na Fali&#8221; ma zaszczyt zaprosić na wyjątkowe wydarzenie,...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/bilgoraj-zalew-bojary-plywanie-100-km">Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Damian Błaszczyk będzie pływał w Zalewie Bojary, stawiając sobie za cel pokonanie 100 km. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km. Przygotowaliśmy zadanie matematyczne jak może wyglądać takie pływanie i w jaki sposób pokonać 100km. Rozwiązanie zadania w dalszej części artykułu.</p>



<p><strong>Cel tego wydarzenia</strong></p>



<p><strong>Ośrodek Sportu i Rekreacji w Biłgoraju &#8220;Na Fali&#8221; ma zaszczyt zaprosić na wyjątkowe wydarzenie, które po raz drugi zjednoczy nas wokół szczytnego celu! W dniach 25-27 lipca 2025 roku Zalew Bojary w Biłgoraju stanie się areną drugiej edycji akcji &#8220;Na Fali Nadziei – Przekraczając Granice&#8221;!</strong> &#8211; <a href="https://osir.lbl.pl/aktualnosci/2025/3428" target="_blank" rel="noopener">czytaj tutaj</a></p>



<p>W tym roku wspieramy&nbsp;<strong>Adasia Iwanejko</strong>, 12-latka z&nbsp;Woli Małej, który dzielnie walczy ze śmiertelną dystrofią mięśniową Duchenne’a. Jego jedyną szansą na powrót do zdrowia jest kosztowna terapia genowa w&nbsp;USA, której szacunkowy koszt przekracza 16 milionów złotych. Wierzymy, że dzięki wspólnej mobilizacji możemy zdziałać cuda i&nbsp;pomóc Adasiowi spełnić marzenie o&nbsp;normalnym życiu!</p>



<h2 class="wp-block-heading"><strong>Zadanie</strong> 1. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</h2>



<p>Pływak Damian Błaszczyk ma pokonać dystans 100 km pływając po okręgu przywiązany na linie o długości 100m do słupa o nieistotnej grubości. Lina obraca się wokół słupa zachowująć cały czas długość 100m. Zadanie to jest do wykonania na zalewie Bojary ponieważ zalew ten ma średnicę około 230m. Tak więc zakładamy, że pływak będzie okrążał zalew aż pokona dystans D=100km. Pytanie jest ile okrążeń musi wykonać pływak?</p>



<p>Obliczenia przedstawiono w tabeli poniżej. Pływak wykona 8 rund po 10 okrążeń w prawo, co daje 80 okrążeń i 80 okrążeń w lewo, co daje razem 160 okrążeń. Każde okrążenie daje dystans d=2Pir = 628 m. Tak więc jedna runda 10 okrążeń daje 6,280 m, 8 rund to 50,240 m. Przepłynięcie 100 km wymaga więc 16 rund czyli 160 okrążeń zalewu. Płynąc w tempie 4 km/h zajmuje czas około 25 godzin. W informacji prasowej jest wzmianka, że pływak zamierza pływać 48 godzin, więc jest to realny czas na pokonanie dystansu 100 km. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="755" height="462" src="https://mietwood.com/wp-content/uploads/2025/07/image-8.jpg" alt="" class="wp-image-3178" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-8.jpg 755w, https://mietwood.com/wp-content/uploads/2025/07/image-8-300x184.jpg 300w" sizes="auto, (max-width: 755px) 100vw, 755px" /></figure>



<div class="wp-block-kadence-accordion alignnone"><div class="kt-accordion-wrap kt-accordion-id3173_02a6a7-f3 kt-accordion-has-2-panes kt-active-pane-0 kt-accordion-block kt-pane-header-alignment-left kt-accodion-icon-style-basic kt-accodion-icon-side-right" style="max-width:none"><div class="kt-accordion-inner-wrap" data-allow-multiple-open="false" data-start-open="0">
<div class="wp-block-kadence-pane kt-accordion-pane kt-accordion-pane-1 kt-pane3173_d3b4d2-ba"><div class="kt-accordion-header-wrap"><button class="kt-blocks-accordion-header kt-acccordion-button-label-show" type="button"><span class="kt-blocks-accordion-title-wrap"><span class="kt-blocks-accordion-title">Jak obliczyć czas potrzebny na pokonanie dystansu 100 km, płynąc z prędkością 4 km/h?</span></span><span class="kt-blocks-accordion-icon-trigger"></span></button></div><div class="kt-accordion-panel kt-accordion-panel-hidden"><div class="kt-accordion-panel-inner">
<p>Czas potrzebny na pokonanie dystansu 100 km obliczamy ze wzoru: t = d/v, gdzie d=dystans 100 km, a v=prędkość 4 km/godz. Przepłynięcie 100 km w tempie 4 km/h zajmuje czas około 25 godzin. Około oznacza to, że prędkość pływaka może być nieco większa lub mniejsza, wtedy ten czas ulegnie zmianie.</p>
</div></div></div>
</div></div></div>



<h2 class="wp-block-heading">Zadanie 2</h2>



<p>Teraz zakładamy, że średnicy słupa wynosi 20 cm, co jest bardziej realne. Lina nawija się na słup w ten sposób, że po każdych 10 ciu okrążeniach przeskakuje na kolejną warstwę zwiększając tym samym średnicę słupa. Pytanie: ile okrążeń musi wykonać pływak aby pokonać dystans 100 km? Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<p>Rozwiązanie zadania: </p>



<ul class="wp-block-list">
<li>Długość początkowa liny = promień początkowy okręgu = 100 m.</li>



<li>Średnica liny = 1 cm ( rl = 0,01 m).</li>



<li>Średnica słupa = 20 cm (promień słupa = 10cm = r0 = 0,1m).</li>
</ul>



<p>Etap 1, Pływak w pierwszym okrążeniu pokona dystans d = 2Pir = 628 m. W czasie tego okrążenia lina nawinie się na słup, co spowoduje, że jej długość zmniejszy się o d0 = 2Pi * r0 = 0,628 m. Tak więc drugie okrążenie będzie miało długość 628 m &#8211; 0,628 m = 627,2 m. Itd aż do 10-tego okrążenia. Czyli w okrążeniach 1-10, lina skraca się o d0 = 2*Pi*r0 = 6.28 * 0,1m = 0,628 m per okrążenie, czyli razem skróci się o 6,28m. Po 10 okrążeniach lina przeskakuje na wierzch poprzednio nawiniętej warstwy, co zwiększa promień słupa o grubość liny, czyli o 1 cm (0,01 m). Tak więc w okrążeniach 11-20, lina skraca się o d2 = 2 Pi * (r0+rl) = 6.28 * 0,11 m = 0,69 m.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="926" height="721" src="https://mietwood.com/wp-content/uploads/2025/07/image-5.jpg" alt="Biłgoraj - Zalew Bojary - pływanie 100 km" class="wp-image-3175" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-5.jpg 926w, https://mietwood.com/wp-content/uploads/2025/07/image-5-300x234.jpg 300w, https://mietwood.com/wp-content/uploads/2025/07/image-5-768x598.jpg 768w" sizes="auto, (max-width: 926px) 100vw, 926px" /></figure>



<p>Całość obliczeń przedstawia tabela poniżej. Pływak bedzie płynął na coraz to krótszej linie, co spowoduje, że ostatecznie cała lina zawinie się na słup. Po ilu okrążeniach to nastąpi i jaki dystans pływak pokona w tym czasie? </p>



<p>Jeśli pierwsze 10 okrążeń skraca line o 6,28, drugie 10 okrążeń skraca linę o 6,91 m itd. &#8211; patrz kolumnę czwartą Ubytek*10, to 100 m liny wystarczy na 10 rund (etapów) po 10 okrążeń i 11 rundę 6 okrążeń. Razem będzie to 106 okrążeń. Da to dystans 37,655 m. Jeśli pływak płynął w prawo to teraz powinien płynąć w lewo, co da kolejne 37,655 m. Ciągle nie będzie to pełne 100 km. Potrzebne są jeszcze 4 rundy o 8 okrążeń. Czyli podsumowując: pływak przepłynie łącznie 106 okrążeń w prawo i 106 w lewo a potem jeszcze 48 okrążeń w prawo. Razem da to 260 okrążeń. Z racji tego, że promień okręgu jest zmienny, co powoduje, że pokonywany dystans jest zmienny z każdym okrążeniem, to do pokonania 100 km potrzebne jest 260 okrążeń czyli o 100 wiecej niż w poprzedniej metodzie. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="796" height="685" src="https://mietwood.com/wp-content/uploads/2025/07/image-6.jpg" alt="" class="wp-image-3176" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-6.jpg 796w, https://mietwood.com/wp-content/uploads/2025/07/image-6-300x258.jpg 300w, https://mietwood.com/wp-content/uploads/2025/07/image-6-768x661.jpg 768w" sizes="auto, (max-width: 796px) 100vw, 796px" /></figure>



<div class="wp-block-kadence-accordion alignnone"><div class="kt-accordion-wrap kt-accordion-id3173_7c2f57-98 kt-accordion-has-2-panes kt-active-pane-0 kt-accordion-block kt-pane-header-alignment-left kt-accodion-icon-style-basic kt-accodion-icon-side-right" style="max-width:none"><div class="kt-accordion-inner-wrap" data-allow-multiple-open="false" data-start-open="0">
<div class="wp-block-kadence-pane kt-accordion-pane kt-accordion-pane-1 kt-pane3173_aebbfa-ed"><div class="kt-accordion-header-wrap"><button class="kt-blocks-accordion-header kt-acccordion-button-label-show" type="button"><span class="kt-blocks-accordion-title-wrap"><span class="kt-blocks-accordion-title">Jak obliczyć dystans okrążenia po okręgu o promieniu R, kiedy R zmienia się w wyniku nawijania się liny na okrąg (słup) o promieniu r?</span></span><span class="kt-blocks-accordion-icon-trigger"></span></button></div><div class="kt-accordion-panel kt-accordion-panel-hidden"><div class="kt-accordion-panel-inner">
<ul class="wp-block-list">
<li>Dystans D okrążenia po okręgu o promieniu D = 2PiR, gdzie R = promień okręgu, w naszym przypadku jest to 100m, a 2*Pi = 6.28.  Dystans okrążenia po okręgu 100m wynosi 628 m.</li>



<li>Jeśli lina nawija się na słup o promieniu r = 0,1 m, to podczas jednego okrążenia skraca się o 2Pir = 0,628 m. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</li>
</ul>
</div></div></div>
</div></div></div>



<h2 class="wp-block-heading">Zadanie 3</h2>



<p>Ciekawe zadanie powstaje, jeśli lina będzie się skracać za każdym razem nawijając się na poprzednie warstwy tworząc spiralę. W takiej spirali każde okrążenie jest krótsze o 0,628 m tworząc ciąg liczbowy. Do rozwiązanie tego zadania wykorzystaliśmy Google Gemini <a href="https://gemini.google.com" target="_blank" rel="noopener">https://gemini.google.com</a> , co przedstawia poniższy obrazek:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="684" height="481" src="https://mietwood.com/wp-content/uploads/2025/07/image-10.jpg" alt="Biłgoraj - Zalew Bojary - pływanie 100 km" class="wp-image-3182" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-10.jpg 684w, https://mietwood.com/wp-content/uploads/2025/07/image-10-300x211.jpg 300w" sizes="auto, (max-width: 684px) 100vw, 684px" /></figure>



<p>Równanie to można uprościć do następującej formy ( Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km )</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="291" height="117" src="https://mietwood.com/wp-content/uploads/2025/07/image-14.jpg" alt="" class="wp-image-3202"/></figure>



<p>i następnie</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="473" height="193" src="https://mietwood.com/wp-content/uploads/2025/07/image-13.jpg" alt="" class="wp-image-3201" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-13.jpg 473w, https://mietwood.com/wp-content/uploads/2025/07/image-13-300x122.jpg 300w" sizes="auto, (max-width: 473px) 100vw, 473px" /></figure>



<p>co po wymnożeniu daje równanie kwadratowe</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="303" height="38" src="https://mietwood.com/wp-content/uploads/2025/07/image-15.jpg" alt="" class="wp-image-3203" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-15.jpg 303w, https://mietwood.com/wp-content/uploads/2025/07/image-15-300x38.jpg 300w" sizes="auto, (max-width: 303px) 100vw, 303px" /></figure>



<p>i następnie</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="276" height="69" src="https://mietwood.com/wp-content/uploads/2025/07/image-16.jpg" alt="Równanie - Biłgoraj - Zalew Bojary - pływanie 100 km" class="wp-image-3204"/></figure>



<p>W tej sytuacji mamy równanie kwadratowe typu ax^2 + bx + c =0. Używając formuły kwadratowej x = (−b ± pierwiastek kwadratowy z ( b2 − 4ac) ) / 2a​​, dostajemy: n = 160,43. Tak więc w takiej sytuacji pływak wykona 160 okrążenia na ciągle skracającej się linie. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Dla miłośników Pythona, zadanie może być rozwiązane przy pomocy poniższej procedury. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<pre class="wp-block-code"><code>def calculate_turns(rope_length, rope_thickness, bar_radius, shorten_per_turn, turns_per_layer, layer_increase):
    turns = 0
    while rope_length &gt; 0:
        turns += 1
        rope_length -= shorten_per_turn
        if turns % turns_per_layer == 0:
            bar_radius += layer_increase / 2
            shorten_per_turn = 2 * 3.14 * (bar_radius + rope_thickness)
    return turns
# Given values
rope_length = 100  # meters
rope_thickness = 0.01  # meters (1 cm)
bar_radius = 0.1  # meters
shorten_per_turn = 0.628  # meters
turns_per_layer = 10
layer_increase = 0.02  # meters (2 cm)

# Calculate the number of turns
turns = calculate_turns(rope_length, rope_thickness, bar_radius, shorten_per_turn, turns_per_layer, layer_increase)

print(f"The swimmer can make {turns} turns before the rope ends.")
</code></pre>



<p>More python <a href="https://mietwood.com/advanced-programming-in-sql-and-python">here</a></p>



<p>Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading" id="100kilometrwpywaniadlanadzieizalewbojarywbigoraju">P<strong>ływania w dobrej sprawie – </strong>Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</h3>



<p>Artykuł opisuje niezwykłe wyzwanie wytrzymałościowe podjęte przez pływaka <strong>Damiana Błaszczyka</strong>, który postanowił przepłynąć <strong>100 kilometrów</strong> w zbiorniku wodnym <strong>Zalew Bojary</strong> w Biłgoraju. Wydarzenie to, będące częścią inicjatywy <strong>„Na Fali Nadziei – Przekraczając Granice”</strong>, ma nie tylko sportowy, ale przede wszystkim charytatywny charakter. Celem tegorocznej edycji jest wsparcie <strong>Adasia Iwanejko</strong>, 12-letniego chłopca chorego na dystrofię mięśniową Duchenne’a. Zebrane środki mają pomóc w sfinansowaniu kosztownej terapii genowej w USA, której koszt przekracza <strong>16 milionów złotych</strong>.</p>



<h3 class="wp-block-heading" id="matematykapywaniamodelowaniedystansu"><strong>Matematyka pływania – modelowanie dystansu</strong></h3>



<p>Autor bloga przedstawia matematyczne modele opisujące, jak można przepłynąć 100 km w jednym zbiorniku. W pierwszym scenariuszu Damian pływa po idealnym okręgu, przymocowany do słupa liną o długości <strong>100 metrów</strong>. Każde okrążenie ma wtedy około <strong>628 metrów</strong> (obliczone ze wzoru na obwód koła: (2\pi r)). Aby osiągnąć 100 km, potrzeba <strong>160 okrążeń</strong>, zmieniając kierunek co 10, by równomiernie obciążać ciało. Przy stałym tempie <strong>4 km/h</strong>, całość zajęłaby około <strong>25 godzin</strong>, choć wydarzenie przewiduje <strong>48 godzin</strong>, co czyni cel realnym.</p>



<h3 class="wp-block-heading" id="realizmlinanawijajcasinasup"><strong>Realizm: lina nawijająca się na słup</strong></h3>



<p>Drugi scenariusz uwzględnia bardziej realistyczny aspekt – lina nawija się na słup, skracając z każdym okrążeniem promień pływania. Słup ma <strong>20 cm średnicy</strong>, a lina <strong>1 cm grubości</strong>. Po każdych 10 okrążeniach lina tworzy nową warstwę, zwiększając promień słupa. W efekcie każde kolejne okrążenie jest krótsze. Obliczenia pokazują, że Damian może wykonać <strong>106 okrążeń w jednym kierunku</strong>, pokonując <strong>37,7 km</strong>. Powtórzenie tego w przeciwnym kierunku daje kolejne 37,7 km. Aby osiągnąć 100 km, potrzeba jeszcze <strong>48 dodatkowych okrążeń</strong>, co daje łącznie <strong>260 okrążeń</strong> – o 100 więcej niż w pierwszym modelu.</p>



<h3 class="wp-block-heading" id="modelspiralnymatematycznaelegancja"><strong>Model spiralny – matematyczna elegancja</strong></h3>



<p>Trzeci model zakłada spiralne nawijanie liny, gdzie każde okrążenie jest krótsze o stałą wartość. Tworzy to ciąg arytmetyczny długości okrążeń. Korzystając z wzoru na sumę ciągu, autor oblicza, że potrzeba około <strong>160,43 okrążeń</strong>, by osiągnąć 100 km. Model ten jest zbliżony do pierwszego, ale bardziej realistyczny i matematycznie elegancki.</p>



<h3 class="wp-block-heading" id="podsumowanie"><strong>Podsumowanie</strong></h3>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. To także przykład, jak pasja i nauka mogą wspólnie służyć wyższemu celowi.</p>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. To także przykład, jak pasja i nauka mogą wspólnie służyć wyższemu celowi.</p>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. To także przykład, jak pasja i nauka mogą wspólnie służyć wyższemu celowi.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/bilgoraj-zalew-bojary-plywanie-100-km">Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Semantic Search with Elasticsearch</title>
		<link>https://mietwood.com/semantic-search-with-elasticsearch</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Fri, 11 Jul 2025 10:11:19 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3158</guid>

					<description><![CDATA[<p>Semantic search with Elasticsearch is must have for modern e-commerce. Elasticsearch is a powerful search engine, scalable data store, and vector database built on Apache Lucene. It’s optimized for speed and relevance on production-scale workloads. You can use Elasticsearch to index your product database and built beautiful Semantic search with Elasticsearch. How AI is changing...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Semantic search with Elasticsearch is must have for modern e-commerce. Elasticsearch is a powerful search engine, scalable data store, and vector database built on Apache Lucene. It’s optimized for speed and relevance on production-scale workloads. You can use Elasticsearch to index your product database and built beautiful Semantic search with Elasticsearch.</p>



<h2 class="wp-block-heading">How AI is c<a href="https://www.nngroup.com/articles/ai-changing-search-behaviors" target="_blank" rel="noopener">hanging search behaviors</a></h2>



<p><a href="https://www.nngroup.com/articles/ai-changing-search-behaviors" target="_blank" rel="noopener">https://www.nngroup.com/articles/ai-changing-search-behaviors</a></p>





<h2 class="wp-block-heading">Semantic search with Elasticsearch &#8211; intro</h2>



<p><a href="https://www.elastic.co/search-labs/blog/introduction-to-vector-search" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/introduction-to-vector-search</a></p>



<h2 class="wp-block-heading">Language model implementation</h2>



<p>As the first we should initiate a language model and prepare product data for input. Semantic search with Elasticsearch require data indexing via vector transformer.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd
import numpy as np

# function to normalize vectors
def normalize_embedding(embedding):
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm != 0 else embedding

# select products from sql database
def select_products():
    q = """
        SELECT 
         &#91;ProductId&#93;
        ,&#91;ProdIdx&#93;
        ,&#91;ProductName&#93;
        FROM &#91;DB_Products&#93; 
    """
    dfp = read_from_sql_server(q);
    return dfp

# import sentence transformers and initiate model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
print(model)

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

model = model.to(device)
print(model)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">numpy</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">np</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">normalize</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">vectors</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">normalize_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">np</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">linalg</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> != 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sql</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">database</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">q</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">        SELECT</span><span style="color: #D8DEE9"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">         &#91;</span><span style="color: #D8DEE9">ProductId</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">ProdIdx</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">ProductName</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">DB_Products</span><span style="color: #D8DEE9FF">&#93; </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    dfp = read_from_sql_server(q)</span><span style="color: #D8DEE9">;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">dfp</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">initiate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">model</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence_transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SentenceTransformer</span></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">SentenceTransformer</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">all-MiniLM-L6-v2</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">torch</span></span>
<span class="line"><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">torch</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">cuda</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">torch</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">cuda</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">is_available</span><span style="color: #D8DEE9FF">() </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">cpu</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<h2 class="wp-block-heading">Data indexing for semantic search</h2>



<p>Now we can index product data into Elasticsearch database (index). We will index product names as vectors and lexically. This allows hybrid search. Semantic search with Elasticsearch.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># initiate Elasticsearch client
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch('http://localhost:9100')

# check es client
from pprint import pprint
pprint(es.info().body)

# ReCreate the index with dense_vector and text mappings
# step 1
es.indices.delete(index="prod_search_hybrid", ignore_unavailable=True)

# step 2 - mappings
es.indices.create(
    index="prod_search_hybrid",
    mappings={
        "properties": {
            "embedding": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "prodct_name": {
                "type": "text"
            }
        }
    },
)

# list indices and number of documents indexed 
def indices_list():
    indices = es.cat.indices(format='json')
    return [x&#91;'index'&#93; for x in indices]
# -------------------
print(indices_list())
# ------------------------------

# Create documents for embedding
documents = []
for i, r in df_docs.iterrows():
    documents.append({        
        'product_name': r&#91;'ProductName'&#93;&#91;:256&#93;.lower(),        
    })

print(f' Created table of {len(documents)} docs')

# Prepare bulk operations
from tqdm import tqdm                                         # for a prograss bar
operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({'index': {'_index': 'prod_search_hybrid'}})
    operations.append({
        **document,
        'embedding': get_embedding(document&#91;'product_name'&#93;), # vectors for semantic search
        'product_name': document&#91;'product_name'&#93;              # the text field for hybrid search
    })

# Bulk insert the data into Elasticsearch
response = es.bulk(operations=operations)
print(f' Records indexed: {len(response&#91;"items"&#93;)}')</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">initiate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">client</span></span>
<span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">helpers</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://localhost:9100</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">check</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">client</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pprint</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pprint</span></span>
<span class="line"><span style="color: #8FBCBB">pprint</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">info</span><span style="color: #D8DEE9FF">().</span><span style="color: #8FBCBB">body</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">ReCreate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dense_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mappings</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">step</span><span style="color: #D8DEE9FF"> 1</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">delete</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ignore_unavailable</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">step</span><span style="color: #D8DEE9FF"> 2 - </span><span style="color: #8FBCBB">mappings</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">create</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">mappings</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">properties</span><span style="color: #D8DEE9FF">&quot;: {</span></span>
<span class="line"><span style="color: #D8DEE9FF">            &quot;</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">&quot;: {</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">dense_vector</span><span style="color: #D8DEE9FF">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">dims</span><span style="color: #D8DEE9FF">&quot;: 384</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">similarity</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">cosine</span><span style="color: #D8DEE9FF">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodct_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">: </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        }</span></span>
<span class="line"><span style="color: #D8DEE9FF">    }</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">list</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">number</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indexed</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">cat</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">format</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">json</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> [</span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF"># -------------------</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">indices_list</span><span style="color: #D8DEE9FF">())</span></span>
<span class="line"><span style="color: #D8DEE9FF"># ------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span></span>
<span class="line"><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">r</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">df_docs</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">iterrows</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">r</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">ProductName</span><span style="color: #D8DEE9FF">&#39;&#93;&#91;:256&#93;.</span><span style="color: #8FBCBB">lower</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C"> Created table of {len(documents)} docs</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Prepare</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bulk</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">operations</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF">                                         # </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">prograss</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bar</span></span>
<span class="line"><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">documents</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">total</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">len</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF">&#39;</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">&#39;: {&#39;</span><span style="color: #8FBCBB">_index</span><span style="color: #D8DEE9FF">&#39;: &#39;</span><span style="color: #8FBCBB">prod_search_hybrid</span><span style="color: #D8DEE9FF">&#39;</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">})</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">**</span><span style="color: #8FBCBB">document</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">get_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;&#93;)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> # </span><span style="color: #8FBCBB">vectors</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">semantic</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;&#93;              # </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">field</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hybrid</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Bulk</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">insert</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bulk</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C"> Records indexed: {len(response&#91;&quot;items&quot;&#93;)}</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<h2 class="wp-block-heading">Hybrid search</h2>



<p>For hybrid search we combine <strong>match</strong> and <strong>knn</strong> search inside a bool query. The <strong>_name</strong> field return what what part of the query has returned results. This allow to build hybride scoring. As you can see we vectorize <strong>query_h</strong> to <strong>query_vector</strong> using the same procedure get_embeding as we were using during indexing. Semantic search with Elasticsearch.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>query_h = "anti-explosion device"

# Print the query vector for debugging
# query_vector = get_embedding(query_h)
# print("Query Vector:", query_vector)

response = es.search(
    index='prod_search_hybrid',
    body={
        "query": {
            "bool": {
                "should": &#91;
                    {
                        "match": {
                            "product_name": {
                                "query": query_h,
                                "_name": "text_match"
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": "embedding",
                            "query_vector":  query_vector,
                            "k": 10,
                            "num_candidates": 100,
                            "_name": "semantic_search"
                        }
                    }
                &#93;
            }
        }
        ,
        'size': 30
    }
)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">anti-explosion device</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">debugging</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">get_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Query Vector:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">body</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">bool</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">should</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_h</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">knn</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF">  </span><span style="color: #D8DEE9">query_vector</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">100</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">size</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">30</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>After that we can extract product names and scoring and build a list according to our intention. In here we separate products according to field _name and then build list of top 10 lexical match and semantic similarity. More about similarity measures you can read in post <a href="https://mietwood.com/measuring-product-similarity">Measuring product similarity</a></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Extract product names, scores, and sources
products = [
    (hit&#91;"_source"&#93;&#91;"product_name"&#93;, hit&#91;"_score"&#93;, hit.get("matched_queries", []))
    for hit in response&#91;"hits"&#93;&#91;"hits"&#93;
]

# Separate products into text_match and semantic_search groups
text_match_products = [product for product in products if "text_match" in product&#91;2&#93;]
semantic_search_products = [product for product in products if "semantic_search" in product&#91;2&#93;]

# Sort each group by score in descending order
sorted_text_match_products = sorted(text_match_products, key=lambda x: x&#91;1&#93;, reverse=True)&#91;:10&#93;
sorted_semantic_search_products = sorted(semantic_search_products, key=lambda x: x&#91;1&#93;, reverse=True)&#91;:10&#93;

# Print top 10 text_match products
print("\nTop 10 Text Match Products:")
for product in sorted_text_match_products:
    print(f"Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}")

# Print top 10 semantic_search products
print("\nTop 10 Semantic Search Products:")
for product in sorted_semantic_search_products:
    print(f"Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}")</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">names</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">scores</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">sources</span></span>
<span class="line"><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> []))</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Separate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text_match</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">semantic_search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">groups</span></span>
<span class="line"><span style="color: #D8DEE9">text_match_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">2</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"><span style="color: #D8DEE9">semantic_search_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">2</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Sort</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">group</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">score</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">descending</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">order</span></span>
<span class="line"><span style="color: #D8DEE9">sorted_text_match_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sorted</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">text_match_products</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">key</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reverse</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)&#91;:</span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9">sorted_semantic_search_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sorted</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">semantic_search_products</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">key</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reverse</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)&#91;:</span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">top</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text_match</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Top 10 Text Match Products:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> sorted_text_match_products</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">top</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">semantic_search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Top 10 Semantic Search Products:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> sorted_semantic_search_products</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Full-text search, also known as lexical search, is a technique for fast, efficient searching through text fields in documents. Documents and search queries are transformed to enable returning&nbsp;<a href="https://www.elastic.co/what-is/search-relevance" target="_blank" rel="noreferrer noopener">relevant</a>&nbsp;results instead of simply exact term matches. Fields of type&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/text#text-field-type" target="_blank" rel="noopener"><code>text</code></a>&nbsp;are analyzed and indexed for full-text search.</p>



<p>You can combine full-text search with&nbsp;<a href="https://www.elastic.co/docs/solutions/search/semantic-search" target="_blank" rel="noopener">semantic search using vectors</a>&nbsp;to build modern hybrid search applications. While vector search may require additional GPU resources, the full-text component remains cost-effective by leveraging existing CPU infrastructure.</p>



<p>Another example of vector indexing and sementic search you can find here: <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb" class="ek-link" target="_blank" rel="noopener">https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb</a></p>



<h2 class="wp-block-heading">Vector search setup and performing hybrid search</h2>



<p><a href="https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch</a></p>



<p><a href="https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch</a></p>



<h2 class="wp-block-heading" id="Filtering">Filtering. Semantic search with Elasticsearch</h2>



<p>Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:</p>



<ul class="wp-block-list">
<li><em>Does this timestamp fall into the range 2015 to 2016?</em></li>



<li><em>Is the status field set to &#8220;published&#8221;?</em></li>
</ul>



<p>Filter context is in effect whenever a query clause is passed to a filter parameter, such as the&nbsp;<code>filter</code>&nbsp;or&nbsp;<code>must_not</code>&nbsp;parameters in a&nbsp;<code>bool</code>&nbsp;query. <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context" target="_blank" rel="noopener">Learn more</a>&nbsp;about filter context in the Elasticsearch docs.</p>



<h3 class="wp-block-heading" id="Example:-Keyword-Filtering">Keyword Filtering</h3>



<p>This is an example of adding a keyword filter to the query. The example retrieves the top books that are similar to &#8220;javascript books&#8221; based on their title vectors, and also Addison-Wesley as publisher. Semantic search with Elasticsearch.</p>



<pre class="wp-block-code"><code>response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"),
        "k": 10,
        "num_candidates": 100,
<span style="background-color:var(--global-palette1)" class="has-inline-background"><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-theme-palette-9-color">        "filter": {"term": {"publisher.keyword": "addison-wesley"}},</mark></span>
    },
)

pprint(response)</code></pre>



<h2 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#query-filter-context" target="_blank" rel="noopener">Query and filter context</a></h2>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#relevance-scores" target="_blank" rel="noopener">Relevance scores</a></h3>



<p>By default, Elasticsearch sorts matching search results by&nbsp;<strong>relevance score</strong>, which measures how well each document matches a query. The relevance score is a positive floating point number, returned in the&nbsp;<code>_score</code>&nbsp;metadata field of the&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search" target="_blank" rel="noreferrer noopener" class="ek-link">search</a>&nbsp;API. The higher the&nbsp;<code>_score</code>, the more relevant the document. While each query type can calculate relevance scores differently, score calculation also depends on whether the query clause is run in a&nbsp;<strong>query</strong>&nbsp;or&nbsp;<strong>filter</strong>&nbsp;context.</p>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#query-context" class="ek-link" target="_blank" rel="noopener">Query context</a></h3>



<p>In the query context, a query clause answers the question&nbsp;<em>How well does this document match this query clause?</em>&nbsp;Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the&nbsp;<code>_score</code>&nbsp;metadata field. Query context is in effect whenever a query clause is passed to a&nbsp;<code>query</code>&nbsp;parameter, such as the&nbsp;<code>query</code>&nbsp;parameter in the&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#request-body-search-query" target="_blank" rel="noreferrer noopener">search</a>&nbsp;API. Semantic search with Elasticsearch</p>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#filter-context" target="_blank" rel="noopener">Filter context</a></h3>



<p>A filter answers the binary question “Does this document match this query clause?”. The answer is simply &#8220;yes&#8221; or &#8220;no&#8221;. Filtering has several benefits:</p>



<ol class="wp-block-list">
<li><strong>Simple binary logic</strong>: In a filter context, a query clause determines document matches based on a yes/no criterion, without score calculation.</li>



<li><strong>Performance</strong>: Because they don’t compute relevance scores, filters execute faster than queries.</li>



<li><strong>Caching</strong>: Elasticsearch automatically caches frequently used filters, speeding up subsequent search performance.</li>



<li><strong>Resource efficiency</strong>: Filters consume less CPU resources compared to full-text queries.</li>



<li><strong>Query combination</strong>: Filters can be combined with scored queries to refine result sets efficiently.</li>
</ol>



<p>Filters are particularly effective for querying structured data and implementing &#8220;must have&#8221; criteria in complex searches.</p>



<p>Structured data refers to information that is highly organized and formatted in a predefined manner. In the context of Elasticsearch, this typically includes:</p>



<ul class="wp-block-list">
<li>Numeric fields (integers, floating-point numbers)</li>



<li>Dates and timestamps</li>



<li>Boolean values</li>



<li>Keyword fields (exact match strings)</li>



<li>Geo-points and geo-shapes</li>
</ul>



<p>Unlike full-text fields, structured data has a consistent, predictable format, making it ideal for precise filtering operations. Semantic search with Elasticsearch.</p>



<p>Common filter applications include:</p>



<ul class="wp-block-list">
<li>Date range checks: for example is the&nbsp;<code>timestamp</code>&nbsp;field between 2015 and 2016</li>



<li>Specific field value checks: for example is the&nbsp;<code>status</code>&nbsp;field equal to &#8220;published&#8221; or is the&nbsp;<code>author</code>&nbsp;field equal to &#8220;John Doe&#8221;</li>
</ul>



<p>Filter context applies when a query clause is passed to a&nbsp;<code>filter</code>&nbsp;parameter, such as:</p>



<ul class="wp-block-list">
<li><code>filter</code>&nbsp;or&nbsp;<code>must_not</code>&nbsp;parameters in&nbsp;<a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query" target="_blank" rel="noopener"><code>bool</code></a>&nbsp;queries</li>



<li><code>filter</code>&nbsp;parameter in&nbsp;<a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-constant-score-query" target="_blank" rel="noopener"><code>constant_score</code></a>&nbsp;queries</li>



<li><a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-filter-aggregation" target="_blank" rel="noopener"><code>filter</code></a>&nbsp;aggregations</li>
</ul>



<p>Filters optimize query performance and efficiency, especially for structured data queries and when combined with full-text searches.</p>



<pre class="wp-block-code"><code>GET /_search
{
  "query": {
    "bool": {
      "must": &#91;
        { "match": { "title":   "Search"        }},
        { "match": { "content": "Elasticsearch" }}
      ],
      "filter": &#91;
        { "term":  { "status": "published" }},
        { "range": { "publish_date": { "gte": "2015-01-01" }}}
      ]
    }
  }
}</code></pre>



<p>Read more: <a href="https://mietwood.com/product-search-and-product-classification-for-e-commerce">Product Search and Product classification for E-commerce</a></p>



<h1 class="wp-block-heading">Reciprocal rank fusion</h1>



<p><a href="https://plg.uwaterloo.ca/%7Egvcormac/cormacksigir09-rrf.pdf" target="_blank" rel="noreferrer noopener">Reciprocal rank fusion (RRF)</a>&nbsp;is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results. Semantic search with Elasticsearch.</p>



<p>RRF uses the following formula to determine the score for ranking each document:</p>



<pre class="wp-block-code"><code>score = 0.0
for q in queries:
    if d in result(q):
        score += 1.0 / ( k + rank( result(q), d ) )
return score

# where
# k is a ranking constant
# q is a query in the set of queries
# d is a document in the result set of q
# result(q) is the result set of q
# rank( result(q), d ) is d's rank within the result(q) starting from 1</code></pre>



<p>You can use RRF as part of a&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search" target="_blank" rel="noreferrer noopener">search</a>&nbsp;to combine and rank documents using separate sets of top documents (result sets) from a combination of&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers" target="_blank" rel="noopener">child retrievers</a>&nbsp;using an&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers#rrf-retriever" target="_blank" rel="noopener">RRF retriever</a>. A minimum of&nbsp;<strong>two</strong>&nbsp;child retrievers is required for ranking.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="431" height="198" src="https://mietwood.com/wp-content/uploads/2025/07/image-1.jpg" alt="Semantic search with Elasticsearch. RRF retriever combined the results." class="wp-image-3163" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-1.jpg 431w, https://mietwood.com/wp-content/uploads/2025/07/image-1-300x138.jpg 300w" sizes="auto, (max-width: 431px) 100vw, 431px" /><figcaption class="wp-element-caption">Semantic search with Elasticsearch. RRF retriever combined the results.</figcaption></figure>
</div>


<p>We rank the documents based on the RRF formula with a&nbsp;<code>rank_window_size</code>&nbsp;of&nbsp;<code>5</code>&nbsp;truncating the bottom&nbsp;<code>2</code>&nbsp;docs in our RRF result set with a&nbsp;<code>size</code>&nbsp;of&nbsp;<code>3</code>. <strong>We end with&nbsp;<code>_id: 3</code>&nbsp;as&nbsp;<code>_rank: 1</code>,&nbsp;<code>_id: 2</code>&nbsp;as&nbsp;<code>_rank: 2</code>, and&nbsp;<code>_id: 4</code>&nbsp;as&nbsp;<code>_rank: 3</code>. </strong>This ranking matches the result set from the original RRF search as expected.</p>



<p>In this example, we execute the&nbsp;<code>knn</code>&nbsp;and&nbsp;<code>standard</code>&nbsp;retrievers independently of each other. Then we use the&nbsp;<code>rrf</code>&nbsp;retriever to combine the results.</p>



<ol class="wp-block-list">
<li>First, we execute the kNN search specified by the&nbsp;<code>knn</code>&nbsp;retriever to get its global top 50 results.</li>



<li>Second, we execute the query specified by the&nbsp;<code>standard</code>&nbsp;retriever to get its global top 50 results.</li>



<li>Then, on a coordinating node, we combine the kNN search top documents with the query top documents and rank them based on the RRF formula using parameters from the&nbsp;<code>rrf</code>&nbsp;retriever to get the combined top documents using the default&nbsp;<code>size</code>&nbsp;of&nbsp;<code>10</code>.</li>
</ol>



<p>Note that if&nbsp;<code>k</code>&nbsp;from a knn search is larger than&nbsp;<code>rank_window_size</code>, the results are truncated to&nbsp;<code>rank_window_size</code>. If&nbsp;<code>k</code>&nbsp;is smaller than&nbsp;<code>rank_window_size</code>, the results are&nbsp;<code>k</code>&nbsp;size.</p>



<pre class="wp-block-code"><code>GET example-index/_search
{
    "retriever": {
        "rrf": {
            "retrievers": &#91;
                {
                    "standard": {
                        "query": {
                            "term": {
                                "text": "shoes"
                            }
                        }
                    }
                },
                {
                    "knn": {
                        "field": "vector",
                        "query_vector": &#91;1.25, 2, 3.5],
                        "k": 50,
                        "num_candidates": 100
                    }
                }
            ],
            "rank_window_size": 50,
            "rank_constant": 20
        }
    }
}</code></pre>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Survival Analysis Models</title>
		<link>https://mietwood.com/survival-analysis-models</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 02 Jul 2025 10:30:53 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3115</guid>

					<description><![CDATA[<p>Survival analysis is&#160;a statistical method used to modeling object behavior dependent on set of variables (x1 .. xn) in time-to-event period. It is especially useful in modeling the probability of object&#8217;s survival in certain circumstances. One can analyze a timeline to events occurrences in relation to variables influencing the time until a specific event occurs&#160;like...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/survival-analysis-models">Survival Analysis Models</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Survival analysis is&nbsp;a statistical method used to modeling object behavior dependent on set of variables (x1 .. xn) in time-to-event period. It is especially useful in modeling the probability of object&#8217;s survival in certain circumstances. One can analyze a timeline to events occurrences in relation to variables influencing the time until a specific event occurs&nbsp;like death, failure, or customer say &#8220;Goodby to the company&#8221;. Here&#8217;s a breakdown of key terminology of survival analysis.</p>





<h2 class="wp-block-heading">Key terminology of survival analysis models</h2>



<ul class="wp-block-list">
<li><strong>Event</strong>: The outcome of interest, such as death, disease occurrence, customer churn, or equipment failure.</li>



<li><strong>Time</strong>: The duration from a defined starting point (e.g., start of treatment or customer acquisition) to the occurrence of the event or the end of observation (censoring).</li>



<li><strong>Censoring</strong>: When the event of interest is not observed for some individuals during the study period, making their exact survival time unknown.</li>



<li><strong>Survival Function</strong> S(t): The survival function&nbsp;<em>S</em>(<em>t</em>)&nbsp;is defined as the probability that a subject survives (i.e., does not experience the event) beyond time&nbsp;<em>t</em>.</li>



<li><strong>Hazard Function</strong> h(t): The hazard function&nbsp;<em>h</em>(<em>t</em>)&nbsp;represents the instantaneous rate at which the event occurs at time&nbsp;<em>t</em>, given that the subject has survived up to that time.</li>



<li><strong>Linking the Survival and Hazard Functions</strong>: The survival and hazard functions are mathematically related. Knowing one allows you to derive the other, reflecting the duality between retention (survival) and churn (event occurrence).</li>
</ul>



<h2 class="wp-block-heading"><strong>Linking the Survival and Hazard Functions</strong></h2>



<p>The survival and hazard functions are mathematically related. Knowing one allows you to derive the other, reflecting the duality between retention (survival) and churn (event occurrence).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="210" height="63" src="https://mietwood.com/wp-content/uploads/2025/06/image-13.jpg" alt="" class="wp-image-3132"/></figure>
</div>


<p class="has-text-align-center"><strong>In summary, the cumulative hazard rate of subject i at time t can also be defined as the negative logarithm of the survival function at time t.</strong></p>


<div class="kb-row-layout-wrap kb-row-layout-id3115_1f2f65-c8 alignnone wp-block-kadence-rowlayout"><div class="kt-row-column-wrap kt-has-2-columns kt-row-layout-equal kt-tab-layout-inherit kt-mobile-layout-row kt-row-valign-top">

<div class="wp-block-kadence-column kadence-column3115_17ced8-bb"><div class="kt-inside-inner-col">
<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="580" height="365" src="https://mietwood.com/wp-content/uploads/2025/07/image-20.jpg" alt="" class="wp-image-3236" style="width:433px;height:auto" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-20.jpg 580w, https://mietwood.com/wp-content/uploads/2025/07/image-20-300x189.jpg 300w" sizes="auto, (max-width: 580px) 100vw, 580px" /></figure>
</div></div>



<div class="wp-block-kadence-column kadence-column3115_5097c2-f7"><div class="kt-inside-inner-col">
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="573" height="372" src="https://mietwood.com/wp-content/uploads/2025/07/image-21.jpg" alt="" class="wp-image-3237" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-21.jpg 573w, https://mietwood.com/wp-content/uploads/2025/07/image-21-300x195.jpg 300w" sizes="auto, (max-width: 573px) 100vw, 573px" /></figure>
</div></div>

</div></div>


<h2 class="wp-block-heading">Stratification</h2>



<p>When we have a variable influence in some moment of time, we can divide the period of observation to two periods, when the time before the variable application and the the time post. It constitutes a new starting point of the new treatment application to the patient. The shift of the survival function could be observed.</p>



<h2 class="wp-block-heading"><strong>Kaplan-Meier Estimator</strong></h2>



<p>Kaplan-Meier Estimator is a non-parametric model for estimating the survival function, often used as a first step in survival analysis.&nbsp;It demonstrate the probability of survival a certain time in function of time. The maximum value is 1 and the K-M curve goes approximately to zero when time increases.</p>



<h2 class="wp-block-heading"><strong>Cox Proportional Hazards Model</strong></h2>



<p>Cox Proportional Hazards Model is a semi-parametric model that incorporate the influence of some predictor variables to the hazard rate or survival model.&nbsp; Each subject has an observed survival time 0-t and an event variable xE that shows whether the event has occurred. It is also observed in time 0-t tat variables x1 .. n have had an influence on event xE occurrence. The Cox model incorporate the influence of x1 .. xn factors to estimation the time of survival or hazard rate as n explanatory variables. This make the model can also predict a time of survival.</p>



<p>Disclaimer: the post is inspired by: <strong>Benjamin Lee</strong>, A Comparison Study of Parametric and Machine Learning Survival Analysis Models to Predict Customer Churn in the Edtech Sector, Vienna, 27th January, 2025<br>Benjamin Lee, Peter Filzmoser, link: <a href="https://repositum.tuwien.at/bitstream/20.500.12708/213329/1/Lee%20Benjamin%20-%202025%20-%20Survival%20Analysis%20Model%20to%20Predict%20Customer%20Churn%20in%20the...pdf" target="_blank" rel="noopener">here</a></p>



<h2 class="wp-block-heading">Dataset for <strong>Cox Proportional Hazards Model</strong></h2>



<p>If we have dataset like here</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="466" height="187" src="https://mietwood.com/wp-content/uploads/2025/06/image-7.jpg" alt="Survival Analysis Models. The general data input structure." class="wp-image-3116" srcset="https://mietwood.com/wp-content/uploads/2025/06/image-7.jpg 466w, https://mietwood.com/wp-content/uploads/2025/06/image-7-300x120.jpg 300w" sizes="auto, (max-width: 466px) 100vw, 466px" /></figure>



<p>The Kaplan-Meier estimator is the statistical tool used to estimate a true survival function from available data and can be considered the ’best’ estimator of survival probability when no parametric structure is assumed. This estimator is a non-parametric estimator that only requires the time-to-event (or time-to-censoring) t , and the event status e for every subject. With this information, the survival function estimator S(t) is given by:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="280" height="85" src="https://mietwood.com/wp-content/uploads/2025/06/image-8.jpg" alt="survival function estimator" class="wp-image-3117"/></figure>
</div>


<p>where e<sub>j</sub> is the time at which at least one event e occurred, and n<sub>j</sub> is the total number of subjects who have been censored or have not had the event yet at time t<sub>j</sub> .</p>



<p>One of the main drawbacks of the Kaplan-Meier model is that it is not able to take any subject covariates into account &#8211; what means, it does not provide explanatory power in terms of different variables within a group. </p>



<h2 class="wp-block-heading">Understanding the Cox Proportional Hazards Model</h2>



<p>When we want to understand how factors like age, treatment type, or blood pressure affect survival time, the <strong>Cox Proportional Hazards model</strong> is a go-to tool. Instead of trying to pin down the exact risk at every single moment, this model cleverly separates the hazard rate into two distinct parts.</p>



<p>The core idea is to model how specific factors (or <strong>covariates</strong>) have a <strong>multiplicative effect</strong> on an underlying hazard rate. The formula for the model looks like this:</p>



<p class="has-text-align-center">h(t,X)=h<sub>0</sub>​(t)⋅exp(βX)</p>



<p>Let&#8217;s break that down:</p>



<ul class="wp-block-list">
<li><strong>h<sub>0​</sub>(t) is the Baseline Hazard.</strong> Think of this as the underlying risk of the event for a &#8220;standard&#8221; individual (where all covariate values are zero) over time. It&#8217;s the part of the model that changes with time (t), but it&#8217;s &#8220;non-parametric,&#8221; meaning we don&#8217;t assume its shape. We let the data speak for itself.</li>



<li><strong>exp(βX) is the Covariate Effect.</strong> This is the parametric part of the model and it&#8217;s where your specific factors come in.
<ul class="wp-block-list">
<li>X is the set of your covariates (e.g., age, treatment type, or blood pressure).</li>



<li>β are the coefficients, similar to those in a linear regression, that the model estimates.</li>



<li>This component tells us how much an individual&#8217;s unique characteristics increase or decrease their risk compared to the baseline. Applying the exponential function, exp(), ensures this multiplier is always positive, as negative risk doesn&#8217;t make sense.</li>
</ul>
</li>
</ul>



<p>Notice that <strong>only the baseline hazard depends on time</strong>. The covariate effect, exp(βX), is constant over time. This separation is the key to the model&#8217;s power and leads directly to its most important assumption.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>In the Cox proportional hazards model, a&nbsp;<strong>covariate</strong>&nbsp;is a predictor variable that you include in the model to explain or predict the risk (hazard) of a specific event occurring over time, such as death, failure, or relapse. These covariates can be continuous (e.g., age, blood pressure) or categorical (e.g., treatment group, gender).</p>



<p>What the program calculates:</p>



<ul class="wp-block-list">
<li>The model estimates the effect of each covariate on the hazard function—the instantaneous risk of the event happening at a certain time.</li>



<li>It computes coefficients (often denoted as β<em>β</em>) for each covariate, quantifying how the risk changes with a one-unit increase in that variable, assuming other covariates remain constant.</li>



<li>These coefficients translate into <strong>hazard ratios</strong> (exponentiated coefficients), which tell how much the hazard (risk) is multiplied by when the covariate changes by one unit.</li>



<li>The baseline hazard function h0(t)<em>h</em>0(<em>t</em>) represents the hazard if all covariates were zero; the model then scales this baseline hazard depending on the values of the covariates for each individual.</li>



<li>Overall, the Cox model evaluates how your covariates modify the risk of the event over survival time, assuming that these effects multiply the baseline hazard proportionally and do not change with time (proportional hazards assumption).</li>
</ul>



<p>Formally, the hazard function is modeled as</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="403" height="48" src="https://mietwood.com/wp-content/uploads/2025/07/image-23.jpg" alt="" class="wp-image-3247" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-23.jpg 403w, https://mietwood.com/wp-content/uploads/2025/07/image-23-300x36.jpg 300w" sizes="auto, (max-width: 403px) 100vw, 403px" /></figure>



<p>where&nbsp;x1,x2,…,xp<em>x</em>1,<em>x</em>2,…,<em>x</em><em>p</em>&nbsp;are your covariates and&nbsp;β1,β2,…,βp<em>β</em>1,<em>β</em>2,…,<em>β</em><em>p</em>&nbsp;are their estimated effects.</p>



<p><strong>Summary:</strong></p>



<ul class="wp-block-list">
<li>Your <strong>covariate</strong> is any variable in your data that may affect the timing of the event.</li>



<li>The Cox model calculates how these covariates affect the <em>hazard</em> or instantaneous risk of the event happening, summarized through hazard ratios.</li>



<li>This helps you understand which factors increase or decrease risk while accounting for the survival time and censoring in your dataset.</li>
</ul>



<p>This explanation aligns with standard survival analysis literature and is consistent with how lifelines or R <code>coxph</code> functions use covariates in the Cox model.</p>



<h2 class="wp-block-heading">The &#8220;Proportional&#8221; in Proportional Hazards</h2>



<p>The model&#8217;s name comes from its core assumption: <strong>the effect of the covariates is constant over time.</strong> In other words, the hazard ratio between any two individuals remains proportional throughout the entire timeline.</p>



<p>Let&#8217;s make this concrete. Imagine we have a single covariate, x &#8211; age, and two subjects, &#8216;a&#8217; and &#8216;b&#8217;. Their hazard rates are:</p>



<ul class="wp-block-list">
<li>Subject a: h(t,x<sub>a</sub>​)=h<sub>0​</sub>(t)⋅exp(βx<sub>a</sub>​)</li>



<li>Subject b: h(t,x<sub>b</sub>​)=h<sub>0</sub>​(t)⋅exp(βx<sub>b</sub>​)</li>
</ul>



<p>If we look at the ratio of their hazards, the baseline hazard h<sub>0</sub>​(t) cancels out:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="383" height="69" src="https://mietwood.com/wp-content/uploads/2025/07/image-22.jpg" alt="" class="wp-image-3238" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-22.jpg 383w, https://mietwood.com/wp-content/uploads/2025/07/image-22-300x54.jpg 300w" sizes="auto, (max-width: 383px) 100vw, 383px" /></figure>
</div>


<p>As you can see, time (t) has vanished from the right side of the equation. This means that if subject &#8216;a&#8217; has double the risk of subject &#8216;b&#8217; on day 1, they will also have double the risk on day 100 or day 500. The risk difference is because of age difference. Their <strong>relative risk is constant</strong>, or <strong>proportional</strong>. This powerful assumption is a direct consequence of the model&#8217;s structure.</p>



<p>If one has a dataset which violates the proportional hazards assumption, it causes reduction in predictive power of the model. The hazard as an output from the model is a useful tool in assessing the general magnitude of a covariates&#8217; effect on the time of survival. One can interpret the hazard ratio as the weighted average of true hazard ratios over the time period. Therefore, one must not strictly conform to the proportional hazards assumption, but always check if dataset is appropriate to this assumption.</p>



<p>You can check if your dataset meets the proportional hazards assumption of the Cox model using both visual plots and formal statistical tests.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Visual Inspection (Graphical Methods)</h3>



<p>Visual checks are often the first and most intuitive step.</p>



<h4 class="wp-block-heading">Log-Log Survival Plots</h4>



<p>This is a classic method for categorical covariates (like &#8220;treatment group&#8221; vs. &#8220;control group&#8221;).</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<ul class="wp-block-list">
<li><strong>How it works:</strong> You plot the logarithm of time, <code>log(t)</code>, on the x-axis against a special transformation of the survival probability, <code>-log(-log(S(t)))</code>, on the y-axis for each group.</li>



<li><strong>What to look for:</strong> If the proportional hazards assumption holds, the resulting curves for each group should be <strong>roughly parallel</strong> and not cross. If the lines cross or move closer or further apart in a systematic way, the assumption may be violated.</li>
</ul>
</blockquote>



<h4 class="wp-block-heading">Schoenfeld Residuals Plots</h4>



<p>This is the most common and powerful visual method, and it works for both continuous and categorical covariates.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<ul class="wp-block-list">
<li><strong>How it works:</strong> For each event that occurs, a residual is calculated that represents the difference between the observed covariate value and the expected covariate value for the individual who had the event. You then plot these residuals against time.</li>



<li><strong>What to look for:</strong> If the assumption holds, you should see a <strong>random scatter of points around a horizontal line at zero</strong>. If you see any clear pattern or trend (e.g., a line with a positive or negative slope), it suggests the effect of the covariate changes over time, violating the assumption.</li>
</ul>
</blockquote>



<p></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Formal Statistical Tests</h3>



<p>A formal test gives you a p-value to help you decide if any violations you see are statistically significant.</p>



<ul class="wp-block-list">
<li><strong>How it works:</strong> The most common test is based on the <strong>Schoenfeld residuals</strong>. It formally tests whether the slope of a line fitted to the Schoenfeld residuals-vs-time plot is significantly different from zero. This is often done using a function like <code>cox.zph()</code> in R or the <code>check_assumptions()</code> method in Python&#8217;s <code>lifelines</code> library.</li>



<li><strong>How to interpret the results:</strong>
<ul class="wp-block-list">
<li><strong>Null Hypothesis (H_0):</strong> The effect of the covariate is constant over time (the proportional hazards assumption holds).</li>



<li>A <strong>low p-value (e.g., &lt; 0.05)</strong> suggests you should <strong>reject the null hypothesis</strong>. This is evidence that the assumption is violated for that specific covariate.</li>



<li>A <strong>high p-value</strong> means you <strong>fail to reject the null hypothesis</strong>, so it&#8217;s reasonable to assume the proportional hazards assumption is met.</li>
</ul>
</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">E<strong>xample of check_assumptions(data)</strong> <strong>in Python</strong></h2>



<p><strong>The dataset</strong> includes:</p>



<ul class="wp-block-list">
<li><strong>Subject ID</strong></li>



<li><strong>Survival time in days</strong></li>



<li><strong>Event status</strong>&nbsp;(1 = event occurred, 0 = censored)</li>



<li><strong>Age group</strong> (1 &#8211; 4)</li>
</ul>



<p>To use&nbsp;<code>check_assumptions()</code>&nbsp;from&nbsp;<strong>lifelines</strong>, you need to&nbsp;<strong>fit a Cox Proportional Hazards model</strong>, which requires at least&nbsp;<strong>one covariate</strong>&nbsp;(a variable that might affect survival, ex age group).</p>



<pre class="wp-block-code"><code>from lifelines import CoxPHFitter

# Fit the Cox model
cph = CoxPHFitter()
cph.fit(data, duration_col="Survival_in_days", event_col="Status_bool")

# Check proportional hazards assumption
cph.check_assumptions(data)
# or
cph.check_assumptions(data, p_value_threshold=0.05)

</code></pre>



<h3 class="wp-block-heading">What to Do If the Assumption is Violated</h3>



<p>If you find that a key variable violates the assumption, you have options:</p>



<ol start="1" class="wp-block-list">
<li><strong>Stratification:</strong> You can stratify your model by the problematic covariate. This allows the baseline hazard function to be different for each level of that variable, resolving the violation. For example, if &#8220;gender&#8221; violates the assumption, you can stratify by it.</li>



<li><strong>Use Time-Dependent Covariates:</strong> You can modify the model to include an interaction term between the covariate and time. This explicitly models how the covariate&#8217;s effect changes over time.</li>



<li><strong>Choose a Different Model:</strong> If the violation is severe, a parametric model (like a Weibull or Log-Logistic model) that has a built-in shape for the hazard function might be a more appropriate choice for your data.</li>
</ol>



<h2 class="wp-block-heading">Parametric Models</h2>



<p>When the underlying probability distribution of the dataset is known, one can use parametric models to model the survival function of the dataset. Once the underlying model is specified either in terms of the survival times or the logarithm of survival times, the model can be fitted and estimated using the maximum likelihood estimator.</p>



<p>Of course. Parametric models in survival analysis assume that a subject&#8217;s survival time follows a specific, known statistical distribution (like the exponential or Weibull distribution). Unlike non-parametric models (e.g., Kaplan-Meier) which don&#8217;t make assumptions about the data&#8217;s distribution, parametric models are defined by a set of parameters that dictate the shape of the survival and hazard functions.</p>



<p>Understanding them means understanding the <strong>hazard function</strong> each model assumes. The hazard function describes the instantaneous risk of an event occurring at a specific time, given that it hasn&#8217;t occurred yet.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Exponential Model</h3>



<p>This is the simplest parametric model. It assumes the hazard rate is <strong>constant</strong> over time.</p>



<ul class="wp-block-list">
<li><strong>Core Idea:</strong> The risk of the event happening is the same every single day. If a component hasn&#8217;t failed by day 10, its risk of failing on day 11 is the same as it was on day 2.</li>



<li><strong>Hazard Function:</strong> h(t)=λ (a constant)</li>



<li><strong>How to Understand It:</strong> This model is defined by a single parameter, λ (lambda), the hazard rate. It&#8217;s best suited for events that don&#8217;t have a &#8220;memory&#8221; or aging process, such as the failure rate of certain electronic components or the occurrence of random external events.</li>
</ul>



<h3 class="wp-block-heading">Weibull Model</h3>



<p>The Weibull model is a more flexible and widely used model because it does not assume a constant hazard rate.</p>



<ul class="wp-block-list">
<li><strong>Core Idea:</strong> The risk of the event can <strong>increase, decrease, or remain constant</strong> over time. This makes it much more adaptable to real-world scenarios.</li>



<li><strong>Hazard Function:</strong> h(t)=λk(λt)k−1</li>



<li><strong>How to Understand It:</strong> The model is defined by two main parameters:
<ul class="wp-block-list">
<li>λ (<strong>scale parameter</strong>): Stretches or compresses the curve.</li>



<li>k (<strong>shape parameter</strong>): This is the key. It dictates the nature of the hazard.
<ul class="wp-block-list">
<li>If <strong>k&gt;1</strong>, the hazard <strong>increases</strong> over time (e.g., aging, where the risk of failure grows).</li>



<li>If <strong>k&lt;1</strong>, the hazard <strong>decreases</strong> over time (e.g., post-surgery recovery, where the initial risk is high but drops).</li>



<li>If <strong>k=1</strong>, the Weibull model simplifies to the Exponential model with a <strong>constant</strong> hazard.</li>
</ul>
</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">Log-Normal and Log-Logistic Models</h3>



<p>These models are useful for situations where the hazard rate is not monotonic (i.e., it doesn&#8217;t just go up or down).</p>



<ul class="wp-block-list">
<li><strong>Log-Normal Model</strong>
<ul class="wp-block-list">
<li><strong>Core Idea:</strong> Assumes that the <em>logarithm</em> of the survival time follows a normal (bell-shaped) distribution.</li>



<li><strong>Hazard Function:</strong> The hazard rate first <strong>increases</strong> to a peak and then <strong>decreases</strong>.</li>



<li><strong>How to Understand It:</strong> Think of situations where failure is most likely after a certain &#8220;wear-in&#8221; period, but if the subject survives past that peak, the immediate risk then declines. It&#8217;s often used in engineering for component fatigue.</li>
</ul>
</li>



<li><strong>Log-Logistic Model</strong>
<ul class="wp-block-list">
<li><strong>Core Idea:</strong> Similar to the Log-Normal model, it also assumes the logarithm of survival time follows a specific distribution (the logistic distribution).</li>



<li><strong>Hazard Function:</strong> The hazard rate can be <strong>hump-shaped</strong> (increasing then decreasing) or <strong>monotonically decreasing</strong>, depending on its parameters. It&#8217;s more flexible than the Log-Normal model.</li>



<li><strong>How to Understand It:</strong> This model is popular in medical research, especially when studying diseases like cancer where the risk of mortality might peak some time after diagnosis and then fall. It&#8217;s also notable because its survival function has a simple, explicit formula, which makes interpreting odds easier.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading">Accelerated Failure Time (AFT) models</h2>



<p>In contrast to the Cox model, AFT models assume that covariates are proportional with respect to survival time.</p>



<p></p>



<h2 class="wp-block-heading">Metrics for Survival Analysis</h2>



<p>Log-Rank Test</p>



<p>The log-rank test is used to <strong>compare two or more survival functions </strong>with each other. In this sense, it is analogous to the t-test or Pearson’s chi-squared test for survival analysis. Like those tests, the log-rank test tests the null hypothesis H0 that there is no difference between the survival functions being compared in the probability of an event e occurring at any time t.</p>



<p>The <strong>Log-Rank Test</strong> is a statistical test used in survival analysis to <strong>compare the survival distributions of two or more groups</strong>. It&#8217;s particularly useful for determining if there are significant differences in the time it takes for an event (like death, failure, or relapse) to occur between groups.</p>



<h3 class="wp-block-heading" id="examplecalculation">Example Calculation</h3>



<p>Let&#8217;s consider a clinical trial comparing the survival times of patients using two different cancer treatments, Drug A and Drug B. Here&#8217;s a simplified dataset:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Time (months)</th><th>Event (1=death, 0=censored)</th><th>Group (A/B)</th></tr></thead><tbody><tr><td>2</td><td>1</td><td>A</td></tr><tr><td>3</td><td>0</td><td>A</td></tr><tr><td>4</td><td>1</td><td>B</td></tr><tr><td>6</td><td>1</td><td>A</td></tr><tr><td>7</td><td>0</td><td>B</td></tr><tr><td>8</td><td>1</td><td>B</td></tr></tbody></table></figure>



<h4 class="wp-block-heading" id="stepstocalculatethelogranktest">Steps to Calculate the Log-Rank Test:</h4>



<ol class="wp-block-list">
<li><strong>Combine the Data</strong>: List all unique time points from both groups.</li>



<li><strong>Calculate the Number at Risk</strong>: For each time point, determine how many patients are still at risk in each group.</li>



<li><strong>Calculate the Expected Events</strong>: For each time point, calculate the expected number of events (deaths) for each group.</li>



<li><strong>Compute the Test Statistic</strong>: Use the observed and expected events to calculate the test statistic.</li>
</ol>



<h4 class="wp-block-heading" id="detailedcalculation">Detailed Calculation:</h4>



<ol class="wp-block-list">
<li><strong>Combine the Data</strong>:
<ul class="wp-block-list">
<li>Time points: 2, 3, 4, 6, 7, 8</li>
</ul>
</li>



<li><strong>Number at Risk</strong>:
<ul class="wp-block-list">
<li>At time 2: Group A: 2, Group B: 3</li>



<li>At time 3: Group A: 1, Group B: 3</li>



<li>At time 4: Group A: 1, Group B: 2</li>



<li>At time 6: Group A: 1, Group B: 2</li>



<li>At time 7: Group A: 0, Group B: 2</li>



<li>At time 8: Group A: 0, Group B: 1</li>
</ul>
</li>



<li><strong>Expected Events</strong>:
<ul class="wp-block-list">
<li>At time 2: Expected events for Group A = (2/5) * 1 = 0.4, Group B = (3/5) * 1 = 0.6</li>



<li>At time 4: Expected events for Group A = (1/3) * 1 = 0.33, Group B = (2/3) * 1 = 0.67</li>



<li>At time 6: Expected events for Group A = (1/3) * 1 = 0.33, Group B = (2/3) * 1 = 0.67</li>



<li>At time 8: Expected events for Group A = 0, Group B = 1</li>
</ul>
</li>



<li><strong>Test Statistic</strong>:<ul><li>Sum the observed and expected events for each group.</li><li>Calculate the test statistic using the formula:</li></ul>$ \chi^2 = \sum \frac{(O<em>i &#8211; E</em>i)^2}{E_i} $ Where ( O<em>i ) is the observed number of events and ( E</em>i ) is the expected number of events. For our example, the test statistic would be calculated based on the observed and expected events at each time point.</li>
</ol>



<p>If the test statistic is greater than the critical value from the chi-square distribution table (with 1 degree of freedom), we reject the null hypothesis and conclude that there is a significant difference between the survival distributions of the two groups <a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">[1]</a> <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">[2]</a>.</p>



<h2 class="wp-block-heading">Example of Survival Analysis Models in Python</h2>



<p>The code demonstrate a plot of survival and hazard functions.</p>



<pre class="wp-block-code"><code>import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Create a Sample Dataset ---
data_dict = {
    'time': &#91;6, 7, 10, 15, 18, 22, 25, 30, 32, 38, 40, 45],
    'event': &#91;1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1]  # 1 = event, 0 = censored
}
data = pd.DataFrame(data_dict)
data = data.sort_values(by='time').reset_index(drop=True)

# --- 2. Calculate Kaplan-Meier Survival Function ---
km_data = data.copy()
unique_times = sorted(km_data&#91;'time'].unique())
survival_prob = 1.0
results = &#91;]

for t in unique_times:
    at_risk = (km_data&#91;'time'] &gt;= t).sum()
    events = km_data&#91;(km_data&#91;'time'] == t) &amp; (km_data&#91;'event'] == 1)].shape&#91;0]
    
    if at_risk &gt; 0:
        survival_prob *= (1 - events / at_risk)
    
    results.append({'time': t, 'survival': survival_prob})

km_results = pd.DataFrame(results)

# --- 3. Plot the Survival Function ---
plt.figure(figsize=(10, 6))
plt.step(km_results&#91;'time'], km_results&#91;'survival'], where='post', label='Kaplan-Meier Estimate')
plt.scatter(data&#91;data&#91;'event'] == 0]&#91;'time'], km_results.loc&#91;data&#91;data&#91;'event']==0].index, 'survival'],
            marker='+', color='red', s=100, label='Censored')
plt.title('Survival Function (Kaplan-Meier Estimate)')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.ylim(0, 1.05)
plt.xlim(0, max(data&#91;'time']) + 5)
plt.grid(True)
plt.legend()
plt.savefig('manual_survival_function_plot.png')
plt.close()

# --- 4. Calculate Nelson-Aalen Cumulative Hazard ---
na_data = data.copy()
hazard = 0.0
na_results_list = &#91;]

for t in unique_times:
    at_risk = (na_data&#91;'time'] &gt;= t).sum()
    events = na_data&#91;(na_data&#91;'time'] == t) &amp; (na_data&#91;'event'] == 1)].shape&#91;0]

    if at_risk &gt; 0:
        hazard += events / at_risk
    
    na_results_list.append({'time': t, 'cumulative_hazard': hazard})

na_results = pd.DataFrame(na_results_list)

# --- 5. Plot the Cumulative Hazard Function ---
plt.figure(figsize=(10, 6))
plt.step(na_results&#91;'time'], na_results&#91;'cumulative_hazard'], where='post', label='Nelson-Aalen Estimate')
plt.title('Cumulative Hazard Function (Nelson-Aalen Estimate)')
plt.xlabel('Time (Months)')
plt.ylabel('Cumulative Hazard')
plt.xlim(0, max(data&#91;'time']) + 5)
plt.grid(True)
plt.legend()
plt.savefig('manual_cumulative_hazard_plot.png')
plt.close()</code></pre>



<p><a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">[1]</a>: <a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">DATAtab</a> <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">[2]</a>: <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">Real Statistics Using Excel</a></p>



<p>References</p>



<p>[1] <a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">Log-Rank Test: A Beginner’s Guide &#8211; DATAtab</a></p>



<p>[2] <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">Log-Rank Test &#8211; Real Statistics Using Excel</a></p>



<ol start="152" class="wp-block-list">
<li></li>
</ol>
<p>The post <a rel="nofollow" href="https://mietwood.com/survival-analysis-models">Survival Analysis Models</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
