<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>python &#8211; Customer Experience Management</title>
	<atom:link href="https://mietwood.com/tag/python/feed" rel="self" type="application/rss+xml" />
	<link>https://mietwood.com</link>
	<description>Customer Experience Can Be Managed</description>
	<lastBuildDate>Thu, 25 Sep 2025 19:19:21 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://mietwood.com/wp-content/uploads/2022/09/cropped-Fav7-32x32.png</url>
	<title>python &#8211; Customer Experience Management</title>
	<link>https://mietwood.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>How to check virtual environments</title>
		<link>https://mietwood.com/virtual-environment</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 19:19:18 +0000</pubDate>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3321</guid>

					<description><![CDATA[<p>You can check virtual environments using a simple for loop in your terminal. This command finds all subdirectories in a specified parent folder, assumes each is a virtual environment, and then uses that environment&#8217;s pip to check for lifelines. How it works: ## For Windows 🪟 You can use a similar loop in either Command...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/virtual-environment">How to check virtual environments</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>You can check virtual environments using a simple <code>for</code> loop in your terminal. This command finds all subdirectories in a specified parent folder, assumes each is a virtual environment, and then uses that environment&#8217;s <code>pip</code> to check for <code>lifelines</code>.</p>



<ol start="1" class="wp-block-list">
<li><strong>Navigate</strong> to the folder that contains all your virtual environments.</li>



<li><strong>Run the following command</strong>: Bash<code> for venv in */ ; do if [ -f "${venv}bin/pip" ]; then echo "--- Checking in '${venv%?}' ---" ${venv}bin/pip list | grep 'lifelines' fi done</code></li>
</ol>



<p><strong>How it works:</strong></p>



<ul class="wp-block-list">
<li>It loops through each subdirectory (e.g., <code>my_project_venv/</code>).</li>



<li>It checks if a <code>pip</code> executable exists inside the <code>bin</code> folder to confirm it&#8217;s likely a venv.</li>



<li>It then runs the <code>pip list</code> command from <em>within</em> that specific environment and uses <code>grep</code> to filter for the line containing &#8220;lifelines&#8221;.</li>



<li>If <code>lifelines</code> is installed, it will print the package name and its version. If not, it will print nothing for that environment.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">## For Windows 🪟</h3>



<p>You can use a similar loop in either Command Prompt (CMD) or PowerShell.</p>



<h4 class="wp-block-heading">In PowerShell:</h4>



<ol start="1" class="wp-block-list">
<li><strong>Open PowerShell</strong> and navigate to the folder containing your virtual environments.</li>



<li><strong>Run the following command</strong>: PowerShe ll <code>Get-ChildItem -Directory | ForEach-Object { $pipPath = Join-Path $_.FullName "Scripts\pip.exe" if (Test-Path $pipPath) { Write-Host "--- Checking in '$($_.Name)' ---" &amp; $pipPath list | findstr "lifelines" } }</code></li>
</ol>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>Get-ChildItem -Directory | ForEach-Object {
    $pipPath = Join-Path $_.FullName "Scripts\pip.exe"
    if (Test-Path $pipPath) {
        Write-Host "--- Checking in '$($_.Name)' ---"
        &amp; $pipPath list | findstr "lifelines"
    }
}</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Get</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">ChildItem</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Directory</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">|</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">ForEach</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Object</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">$pipPath</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Join</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Path</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$_</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">FullName</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Scripts</span><span style="color: #EBCB8B">\p</span><span style="color: #A3BE8C">ip.exe</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> (</span><span style="color: #D8DEE9">Test</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Path</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$pipPath</span><span style="color: #D8DEE9FF">) </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">Write</span><span style="color: #81A1C1">-</span><span style="color: #D8DEE9">Host</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">--- Checking in &#39;$($_.Name)&#39; ---</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">&amp;</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">$pipPath</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">list</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">|</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">findstr</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">lifelines</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #ECEFF4">}</span></span></code></pre></div>



<p>Inspection results</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="969" height="458" src="https://mietwood.com/wp-content/uploads/2025/09/image-6.jpg" alt="" class="wp-image-3322" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-6.jpg 969w, https://mietwood.com/wp-content/uploads/2025/09/image-6-300x142.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-6-768x363.jpg 768w" sizes="(max-width: 969px) 100vw, 969px" /></figure>
<p>The post <a rel="nofollow" href="https://mietwood.com/virtual-environment">How to check virtual environments</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>What are independent variables?</title>
		<link>https://mietwood.com/independent-variables</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 18:44:48 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3316</guid>

					<description><![CDATA[<p>Independent variables, also called predictors, features, or explanatory variables, are the variables in a statistical or machine learning model that are used to explain or predict changes in another variable — the dependent variable, also called the outcome or target. Independent variables in simple terms: Example of independent variables in Customer Management (RFM Model): Suppose...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Independent variables, </strong>also called <strong>predictors</strong>, <strong>features</strong>, or <strong>explanatory variables</strong>, are the variables in a statistical or machine learning model that are used to <strong>explain or predict</strong> changes in another variable — the <strong>dependent variable</strong>, also called the outcome or target.</p>



<h3 class="wp-block-heading" id="insimpleterms"><strong>Independent variables</strong> in simple terms:</h3>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>: Inputs you control or observe.</li>



<li><strong>Dependent variable</strong>: Output you want to understand or predict.</li>
</ul>



<h3 class="wp-block-heading" id="exampleincustomermanagementrfmmodel">Example of i<strong>ndependent variables</strong> in Customer Management (RFM Model):</h3>



<p>Suppose you&#8217;re analyzing customer behavior to predict <strong>churn</strong> (whether a customer will stop buying).</p>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>:
<ul class="wp-block-list">
<li><strong>Recency</strong>: How recently a customer made a purchase.</li>



<li><strong>Frequency</strong>: How often they purchase.</li>



<li><strong>Monetary</strong>: How much they spend.</li>
</ul>
</li>



<li><strong>Dependent variable</strong>:
<ul class="wp-block-list">
<li><strong>Churn</strong>: 1 if the customer churned, 0 if they stayed.</li>
</ul>
</li>
</ul>



<p>In this case, <strong>Recency</strong>, <strong>Frequency</strong>, and <strong>Monetary</strong> are independent variables used to predict the likelihood of <strong>churn</strong>. See also <a href="https://mietwood.com/python-for-business-analytics-2">Python for business analytics &#8211; rfm analysis</a></p>



<h3 class="wp-block-heading" id="whycheckforindependenceamongindependentvariables">Why check for independence among independent variables?</h3>



<p>If independent variables are <strong>highly correlated with other variables, </strong>i.e., it is not truly independent, it can cause <strong>multicollinearity</strong>, which makes model coefficients unstable, reduces interpretability, and can lead to misleading conclusions.</p>



<h2 class="wp-block-heading"><strong>Multicollinearity</strong></h2>



<p><strong>Multicollinearity</strong> refers to a statistical phenomenon in which two or more independent variables in a regression model are highly correlated. This makes it difficult to determine the individual effect of each variable on the dependent variable because they essentially carry overlapping information.</p>



<h3 class="wp-block-heading"><strong>Assessment of Multicollinearity</strong></h3>



<p>To assess multicollinearity, you can use following methods:</p>



<ol class="wp-block-list">
<li><strong>Correlation Matrix</strong>
<ul class="wp-block-list">
<li>Check pairwise correlations between independent variables.</li>



<li>High correlation (e.g., > 0.8 or &lt; -0.8) may indicate multicollinearity.</li>
</ul>
</li>



<li><strong>Variance Inflation Factor (VIF)</strong>
<ul class="wp-block-list">
<li>Measures how much the variance of a regression coefficient is inflated due to multicollinearity.</li>



<li><strong>VIF > 5 or 10</strong> is often considered problematic.</li>
</ul>
</li>



<li><strong>Tolerance</strong>
<ul class="wp-block-list">
<li>Tolerance = 1 / VIF.</li>



<li>Low tolerance values (close to 0) indicate high multicollinearity.</li>
</ul>
</li>



<li><strong>Condition Index and Eigenvalues</strong>
<ul class="wp-block-list">
<li>Part of a more advanced diagnostic using matrix decomposition.</li>



<li>A <strong>condition index > 30</strong> may suggest serious multicollinearity.</li>
</ul>
</li>
</ol>



<h3 class="wp-block-heading"><strong>How to deal with multicollinearity?</strong></h3>



<ul class="wp-block-list">
<li><strong>Remove one of the correlated variables.</strong></li>



<li><strong>Combine variables</strong> (e.g., using PCA or creating an index).</li>



<li><strong>Regularization techniques</strong> like Ridge or Lasso regression.</li>



<li><strong>Centering variables</strong> (subtracting the mean) can help in some cases.</li>
</ul>



<h2 class="wp-block-heading">Calculation example</h2>



<p>Assume, you have data similar to this sample.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="488" height="337" src="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg" alt="" class="wp-image-3317" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg 488w, https://mietwood.com/wp-content/uploads/2025/09/image-4-300x207.jpg 300w" sizes="(max-width: 488px) 100vw, 488px" /><figcaption class="wp-element-caption">RFM data sample &#8211; for testing independent variables </figcaption></figure>
</div>


<h2 class="wp-block-heading">Variable independence testing</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd

# Load the data with the specified delimiter
df = pd.read_csv("RFM_analysis_614.csv", delimiter=",")

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Select the independent variables (RFM)
X = df[&#91;'recency', 'freq', 'monetary'&#93;]

# Add a constant for the VIF calculation (required by the statsmodels function)
X = add_constant(X)

# Create a DataFrame to hold the VIF results
vif_data = pd.DataFrame()
vif_data&#91;"Variable"&#93; = X.columns
vif_data&#91;"VIF"&#93; = [variance_inflation_factor(X.values, i) for i in range(X.shape&#91;1&#93;)]

# Exclude the constant row from the final output since it's not a true variable
vif_data = vif_data&#91;vif_data.Variable != 'const'&#93;.reset_index(drop=True)

print(vif_data)

# Save the VIF results to a CSV file
vif_data.to_csv("vif_results.csv", index=False)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Load</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">specified</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span></span>
<span class="line"><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">read_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">RFM_analysis_614.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">,</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">stats</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">outliers_influence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variance_inflation_factor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">add_constant</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">independent</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variables</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">RFM</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">[&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">recency</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">freq</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">monetary</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Add</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">calculation</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">required</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">add_constant</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hold</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">results</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Variable</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = </span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">columns</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">VIF</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = [</span><span style="color: #8FBCBB">variance_inflation_factor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">values</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">range</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">shape</span><span style="color: #D8DEE9FF">&#91;1&#93;)]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Exclude</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">final</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">output</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">since</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">it</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">s not a true variabl</span><span style="color: #D8DEE9">e</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">Variable</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">!=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">const</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">reset_index</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">drop</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Save</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">results</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">CSV</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">file</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">to_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">vif_results.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">False</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Finally program prints following results</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="179" height="85" src="https://mietwood.com/wp-content/uploads/2025/09/image-5.jpg" alt="" class="wp-image-3318"/></figure>
</div>


<p>Based on the Variance Inflation Factor (VIF) calculation, the columns <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> are <strong>statistically independent</strong> of each other and <strong>not informationally overlapping</strong>.</p>



<p>This means that you can use all three variables together as independent predictors in a statistical model, such as a Cox Proportional Hazards (Cox PH) model, without concern for severe multicollinearity.</p>



<h2 class="wp-block-heading">Variance Inflation Factor (VIF) Results</h2>



<p>The VIF (<a href="https://en.wikipedia.org/wiki/Variance_inflation_factor" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Variance_inflation_factor</a>)  is a measure of how much the variance of an estimated regression coefficient is increased due to collinearity. A common rule of thumb is that a VIF value <strong>less than 5</strong> or sometimes 10 indicates that the correlation between the variables is not high enough to warrant concern.</p>



<p>The calculated VIF values for your RFM variables are very low:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Variable</td><td>VIF</td></tr></thead><tbody><tr><td><strong>recency</strong></td><td>1.437</td></tr><tr><td><strong>freq</strong></td><td>1.488</td></tr><tr><td><strong>monetary</strong></td><td>1.065</td></tr></tbody></table></figure>



<p></p>



<h2 class="wp-block-heading">Conclusion on Independence</h2>



<p>Since all VIF values are close to 1.0 and well below the 5.0 threshold:</p>



<ul class="wp-block-list">
<li><strong>Independent Variables:</strong> You can confidently treat <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> as independent variables for your statistical analysis (e.g., in a Cox PH model).</li>



<li><strong>No Informational Overlap:</strong> The variables are providing distinct, non-redundant information to the model. For instance, knowing a customer&#8217;s frequency does not allow the model to strongly predict their recency or monetary value.</li>
</ul>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Top IDEs for Python developers</title>
		<link>https://mietwood.com/ides-for-python-developers</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 09:11:31 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3312</guid>

					<description><![CDATA[<p>Here a top graphical IDEs (Integrated Development Environments) for Python developers as of 2024: 1.&#160;PyCharm 2.&#160;Visual Studio Code (VS Code) 3.&#160;Spyder 4.&#160;Thonny 5.&#160;Wing IDE 6.&#160;Eric 7.&#160;IDLE Summary:For professional development,&#160;PyCharm&#160;and&#160;VS Code&#160;are the most popular. For data science,&#160;Spyder&#160;is widely used. For beginners,&#160;Thonny&#160;or&#160;IDLE&#160;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Here a top graphical IDEs (Integrated Development Environments) for Python developers</strong> as of 2024:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">1.&nbsp;<strong>PyCharm</strong></h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="421" src="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg" alt="" class="wp-image-3313" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/image-3-300x123.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-3-768x316.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<ul class="wp-block-list">
<li><strong>Developer:</strong> JetBrains</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Advanced code analysis, smart code completion, integrated debugger, Git support, virtual environment management, Django support.</li>



<li><strong>Community (free) and Professional (paid) editions.</strong></li>



<li><strong>Website:</strong> <a href="https://www.jetbrains.com/pycharm/" target="_blank" rel="noreferrer noopener">PyCharm</a></li>
</ul>



<h3 class="wp-block-heading">2.&nbsp;<strong>Visual Studio Code (VS Code)</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Microsoft</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Lightweight but powerful; excellent Python extension; integrated terminal; rich plugin ecosystem; Git support; Jupyter notebook integration.</li>



<li><strong>Website:</strong> <a href="https://code.visualstudio.com/" target="_blank" rel="noreferrer noopener">VS Code</a></li>
</ul>



<h3 class="wp-block-heading">3.&nbsp;<strong>Spyder</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Scientific Python Development Environment Community</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Focused on scientific computing and data science; variable explorer; integrated IPython console; plotting support.</li>



<li><strong>Website:</strong> <a href="https://www.spyder-ide.org/" target="_blank" rel="noreferrer noopener">Spyder</a></li>
</ul>



<h3 class="wp-block-heading">4.&nbsp;<strong>Thonny</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> University of Tartu</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Beginner-friendly; simple UI; built-in debugger; good for learning and education.</li>



<li><strong>Website:</strong> <a href="https://thonny.org/" target="_blank" rel="noreferrer noopener">Thonny</a></li>
</ul>



<h3 class="wp-block-heading">5.&nbsp;<strong>Wing IDE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Wingware</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Powerful debugger; code intelligence; remote development support.</li>



<li><strong>Website:</strong> <a href="https://wingware.com/" target="_blank" rel="noreferrer noopener">Wing IDE</a></li>
</ul>



<h3 class="wp-block-heading">6.&nbsp;<strong>Eric</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Detlev Offenbach</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Full-featured Python and Ruby IDE; integrated debugger; plugin support.</li>



<li><strong>Website:</strong> <a href="https://eric-ide.python-projects.org/" target="_blank" rel="noreferrer noopener">Eric Python IDE</a></li>
</ul>



<h3 class="wp-block-heading">7.&nbsp;<strong>IDLE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Python Software Foundation (bundled with Python)</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Basic, lightweight, good for quick scripts and learning.</li>



<li><strong>Website:</strong> <a href="https://docs.python.org/3/library/idle.html" target="_blank" rel="noreferrer noopener">IDLE Documentation</a></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Summary:</strong><br>For professional development,&nbsp;<strong>PyCharm</strong>&nbsp;and&nbsp;<strong>VS Code</strong>&nbsp;are the most popular. For data science,&nbsp;<strong>Spyder</strong>&nbsp;is widely used. For beginners,&nbsp;<strong>Thonny</strong>&nbsp;or&nbsp;<strong>IDLE</strong>&nbsp;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Scrape a Website and Search Inside PDFs with Python</title>
		<link>https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sat, 30 Aug 2025 09:13:30 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3292</guid>

					<description><![CDATA[<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if you could automate the entire process with just a few lines of code?</p>



<p>In this tutorial, we&#8217;ll show you exactly how to do that. We’ll build a powerful yet simple Python script that automatically scans a webpage, finds all the PDF links, and searches for specific text inside each one. Using popular libraries like <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you&#8217;ll learn a practical skill that can save you hours of manual work. Let&#8217;s get started!</p>



<p>Python, Web Scraping, PDF, Automation, BeautifulSoup, PyPDF, requests, Data Extraction, Python Projects, Text Search</p>



<h2 class="wp-block-heading">Scrape a Website</h2>



<p>in the script we <strong>Find</strong> all links on the initial page. <strong>Filter</strong> for links that end with <code>.pdf</code>. For each PDF link: <strong>Download</strong> the PDF file into memory. <strong>Extract</strong> text from every page of the PDF. <strong>Search</strong> the extracted text for your <code>search_string</code>. And finally <strong>Report</strong> which PDF files contain the phrase. Scrape a Website</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import requests
from bs4 import BeautifulSoup
from pypdf import PdfReader
import io

Scrape a Website
def find_linked_pdfs(url):
    """
    Scans a webpage for PDF links and searches for a string within each PDF.

    Args:
        url: The URL of the webpage to scan.
        search_string: The string to search for inside the PDFs.
    """
    print(f"Scanning {url} for PDF links...")
    try:
        # 1. Get the main page to find all links
        base_url_parts = requests.utils.urlparse(url)
        base_url = f"{base_url_parts.scheme}://{base_url_parts.netloc}"
        
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [a&#91;'href'&#93; \
          for a in soup.find_all('a', href=True) \
          if a&#91;'href'&#93;.endswith('.pdf')]
        
        if not pdf_links:
            print("No PDF links found on the page.")
            return

        print(f"Found {len(pdf_links)} PDF files. Now searching inside them...")

    return pdf_links</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">requests</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bs4</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BeautifulSoup</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pypdf</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PdfReader</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">io</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">Scrape</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Website</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    Scans a webpage for PDF links and searches for a string within each PDF</span><span style="color: #D8DEE9">.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">Args</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">webpage</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">scan</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">search_string</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">string</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDFs</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    print(f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #8FBCBB">Scanning</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span><span style="color: #8FBCBB">url</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span><span style="color: #D8DEE9FF">...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        # 1. </span><span style="color: #8FBCBB">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">main</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">all</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url_parts</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">urlparse</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url_parts.scheme}://{base_url_parts.netloc}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BeautifulSoup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">text</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">html.parser</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF"> = [</span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">find_all</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">a</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">href</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">) \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">endswith</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">.pdf</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)]</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">No PDF links found on the page.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">return</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Found {len(pdf_links)} PDF files. Now searching inside them...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span></span></code></pre></div>



<p><strong>Handling URLs</strong>: It constructs a full, absolute URL for each PDF, as many links on a page can be relative (e.g., <code>/path/to/file.pdf</code>). Scrape a Website. <a href="https://pypi.org/project/beautifulsoup4/" target="_blank" rel="noopener">https://pypi.org/project/beautifulsoup4/</a></p>



<p><strong>In-Memory Processing</strong>: Instead of saving each PDF to your disk, it uses <code>io.BytesIO</code> to treat the downloaded content as a file in your computer&#8217;s memory. This is faster and cleaner.</p>



<p><strong>Text Extraction</strong>: The <code>pypdf</code> library&#8217;s <code>PdfReader</code> opens this in-memory file. The script then loops through each page, calls <code>extract_text()</code>, and combines the text from all pages.</p>



<p><strong>Searching and Reporting</strong>: Finally, it performs a case-insensitive search on the extracted text and prints the URL of any PDF that contains your search term.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>Search Inside PDF
def find_text_in_pdfs(pdf_links, search_string):

        # 2. Loop through each PDF link

        found_in_files = []
        for pdf_path in pdf_links:
            # Construct absolute URL if the link is relative
            if not pdf_path.startswith(('http://', 'https://')):
                pdf_url = f"{base_url}{pdf_path}"
            else:
                pdf_url = pdf_path

            try:
                # 3. Download the PDF content
                pdf_response = requests.get(pdf_url)
                pdf_response.raise_for_status()

                # Use an in-memory buffer to read the PDF
                pdf_file = io.BytesIO(pdf_response.content)
                reader = PdfReader(pdf_file)
                
                # 4. Extract text and search
                full_text = ""
                for page in reader.pages:
                    full_text += page.extract_text() or ""
                
                if search_string.lower() in full_text.lower():
                    print(f"✔️ Found '{search_string}' in: {pdf_url}")
                    found_in_files.append(pdf_url)

            except Exception as e:
                print(f"⚠️ Could not process {pdf_url}. Reason: {e}")
        
        if not found_in_files:
            print(f"\nSearch complete. The string '{search_string}' was not found in any of the PDFs.")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred fetching the main URL: {e}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        # </span><span style="color: #B48EAD">2.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Loop</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">through</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> []</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> pdf_links</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            # </span><span style="color: #D8DEE9">Construct</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">absolute</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">relative</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">startswith</span><span style="color: #D8DEE9FF">((</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">https://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url}{pdf_path}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">3.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Download</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">content</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #D8DEE9">Use</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">an</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in-</span><span style="color: #D8DEE9">memory</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">buffer</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">read</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">io</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">BytesIO</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">content</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">reader</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">PdfReader</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">4.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reader</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">pages</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">+=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">extract_text</span><span style="color: #D8DEE9FF">() </span><span style="color: #D8DEE9">or</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">() </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">full_text</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">✔️ Found &#39;{search_string}&#39; in: {pdf_url}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Exception</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">⚠️ Could not process {pdf_url}. Reason: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> found_in_files</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Search complete. The string &#39;{search_string}&#39; was not found in any of the PDFs.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">exceptions</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">RequestException</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">An error occurred fetching the main URL: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>x</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>if __name__ == "__main__":
    target_url = "https://www.umcs.pl/pl/plany-zajec,10795.htm"
    search_term = "programming"
    pdf_links = find_linked_pdfs(target_url)
    find_text_in_pdfs(pdf_links, search_string)
    </textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__name__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">==</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">__main__</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">https://www.umcs.pl/pl/plany-zajec,10795.htm</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">search_term</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">programming</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">pdf_links</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span></code></pre></div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="769" height="255" src="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg" alt="Scrape a Website and Search Inside PDFs with Python" class="wp-image-3293" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg 769w, https://mietwood.com/wp-content/uploads/2025/08/image-8-300x99.jpg 300w" sizes="auto, (max-width: 769px) 100vw, 769px" /><figcaption class="wp-element-caption">Scrape a Website and Search Inside PDFs with Python</figcaption></figure>
</div>


<h3 class="wp-block-heading"><strong>Wrapping Up and Next Steps</strong></h3>



<p>Congratulations! You&#8217;ve successfully built a powerful automation script that bridges the gap between web scraping and document analysis. By combining the strengths of <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you can now programmatically find information that was previously locked away inside PDF files on any website. This not only saves an incredible amount of time but also opens up new possibilities for data collection and analysis. Feel free to adapt the code for your own projects and take your web scraping skills to the next level. Scrape a Website.</p>



<p>The applications for this technique extend far beyond a single use case. Imagine using this script for <strong>academic research</strong>, automatically scanning university archives for papers mentioning a specific topic. You could adapt it for <strong>financial analysis</strong> by pulling keywords from dozens of quarterly earnings reports, or for <strong>legal work</strong> by searching through court filings for a particular case name. Job seekers could even use it to scan company websites for PDF job descriptions that contain key skills. </p>



<p>To perform a statistical analysis of the overall economy, you can leverage a variety of online resources, including government and intergovernmental data portals, as well as academic publications. These sources often provide data in structured formats like CSVs and APIs, but also in less-structured formats like HTML tables and PDFs, which can be parsed using Python libraries like Beautiful Soup and pypdf.</p>



<h3 class="wp-block-heading"><strong>Government and Intergovernmental Data Sources</strong></h3>



<p>For raw, official economic data, these are your most reliable sources. They offer a wealth of information on everything from GDP and inflation to employment rates and international trade. Scrape a Website. Search Inside PDF</p>



<ul class="wp-block-list">
<li><strong>Federal Reserve Economic Data (FRED)</strong>: A fantastic resource from the St. Louis Fed, FRED offers over 800,000 economic time series from more than 100 sources. It&#8217;s a goldmine for anyone doing macroeconomic analysis.</li>



<li><strong>The World Bank Open Data</strong>: This portal provides comprehensive global development data, including indicators on economic policy, poverty, gender, and more, making it perfect for cross-country comparisons.</li>



<li><strong>Data.gov</strong>: The home of U.S. government open data, this site aggregates datasets from various federal agencies, including the Bureau of Economic Analysis (BEA) and the Bureau of Labor Statistics (BLS).</li>



<li><strong>United Nations Statistics Division (UNSD)</strong>: The UNSD offers a wide array of international statistics, including the UNdata portal which provides free access to over 60 million statistical records from various UN agencies.</li>



<li><strong>The Bureau of Economic Analysis (BEA)</strong>: The BEA produces some of the most critical U.S. economic statistics, such as GDP, personal income, and corporate profits.</li>
</ul>



<p>You can read about Business analyst carrier path <a href="https://mietwood.com/the-allure-of-business-analysis-as-a-career-path">here</a></p>



<p>The core principle remains the same: automate the discovery of information, no matter the format. Search Inside PDF. Happy coding! 🚀</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python for analysts most important datetime functions</title>
		<link>https://mietwood.com/python-for-analysts-most-important-datetime-functions</link>
					<comments>https://mietwood.com/python-for-analysts-most-important-datetime-functions#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 20 Jul 2025 16:18:38 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3211</guid>

					<description><![CDATA[<p>Python’s powerful date and time functions using the datetime and pandas libraries gives you a robust date table ready for Power BI and other business intelligence and analytical tools. Python for analysts most important datetime functions. Mastering Date and Time Functions in Python for Power BI Date Tables When working with Power BI, a well-structured...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analysts-most-important-datetime-functions">Python for analysts most important datetime functions</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Python’s powerful <strong>date and time functions</strong> using the <code>datetime</code> and <code>pandas</code> libraries gives you a robust date table ready for Power BI and other business intelligence and analytical tools. Python for analysts most important datetime functions.</p>



<h2 class="wp-block-heading" id="masteringdateandtimefunctionsinpythonforpowerbidatetables">Mastering Date and Time Functions in Python for Power BI Date Tables</h2>



<p>When working with Power BI, a well-structured <strong>Date Table</strong> is essential for time intelligence calculations like YTD, QTD, MTD, and custom period comparisons. While Power BI has built-in date table features, using <strong>Python</strong> to generate a custom date table gives you full control over the structure, granularity, and logic.</p>



<p>In this post, we’ll explore Python’s powerful <strong>date and time functions</strong> using the <code>datetime</code> and <code>pandas</code> libraries, and show how to create a robust date table ready for Power BI.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="1pythondateandtimebasics">Python for analyst – date and time functions &#8211; basics</h2>



<p>Python provides the <a href="https://docs.python.org/3/library/datetime.html" target="_blank" rel="noopener"><code>datetime</code> module</a> to work with dates and times. Here&#8217;s a quick overview:</p>



<pre class="wp-block-code"><code>from datetime import datetime, timedelta, date

# Current date and time
now = datetime.now()
print("Now:", now)

# Just the date
today = date.today()
print("Today:", today)

# Add 7 days
next_week = today + timedelta(days=7)
print("Next week:", next_week)

# Subtract 30 days
last_month = today - timedelta(days=30)
print("30 days ago:", last_month)
</code></pre>



<p>These functions are the foundation for generating date ranges and calculating custom columns like fiscal periods or holidays. Python for analysts most important datetime functions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="2creatingadaterangewithpandas">Creating a Date Range with Pandas</h2>



<p>To build a date table, we need a continuous range of dates. <code>pandas.date_range()</code> is perfect for this:</p>



<pre class="wp-block-code"><code>import pandas as pd

# Generate a date range from 2020 to 2030
date_range = pd.date_range(start='2020-01-01', end='2030-12-31', freq='D')
df = pd.DataFrame({'Date': date_range})
</code></pre>



<p>This gives us a DataFrame with one row per day — the backbone of our date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="3enrichingthedatetable">Enriching the Date Table</h2>



<p>Now let’s add useful columns for Power BI:</p>



<pre class="wp-block-code"><code>df&#91;'Year'] = df&#91;'Date'].dt.year
df&#91;'Month'] = df&#91;'Date'].dt.month
df&#91;'MonthName'] = df&#91;'Date'].dt.strftime('%B')
df&#91;'Quarter'] = df&#91;'Date'].dt.quarter
df&#91;'Day'] = df&#91;'Date'].dt.day
df&#91;'Weekday'] = df&#91;'Date'].dt.weekday + 1  # Monday = 1
df&#91;'WeekdayName'] = df&#91;'Date'].dt.strftime('%A')
df&#91;'IsWeekend'] = df&#91;'Weekday'].isin(&#91;6, 7])
df&#91;'Week'] = df&#91;'Date'].dt.isocalendar().week
df&#91;'DayOfYear'] = df&#91;'Date'].dt.dayofyear
</code></pre>



<p>These columns allow for slicing and dicing your data in Power BI by year, month, weekday, and more.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="4fiscalcalendarsupport">Fiscal Calendar Support</h2>



<p>Many businesses use fiscal calendars that don’t align with the calendar year. Here’s how to add a fiscal year starting in July:</p>



<pre class="wp-block-code"><code>df&#91;'FiscalYear'] = df&#91;'Date'].apply(lambda x: x.year if x.month &lt; 7 else x.year + 1)
df&#91;'FiscalMonth'] = df&#91;'Date'].apply(lambda x: x.month - 6 if x.month &gt;= 7 else x.month + 6)
df&#91;'FiscalQuarter'] = ((df&#91;'FiscalMonth'] - 1) // 3) + 1
</code></pre>



<p>This logic adjusts the fiscal year, month, and quarter based on a July start. Python for analysts most important datetime functions</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="5flagsfortimeintelligence">5. Flags for Time Intelligence</h2>



<p>Power BI benefits from flags that simplify DAX calculations:</p>



<pre class="wp-block-code"><code>today = pd.to_datetime('today').normalize()

df&#91;'IsToday'] = df&#91;'Date'] == today
df&#91;'IsCurrentMonth'] = (df&#91;'Date'].dt.month == today.month) &amp; (df&#91;'Date'].dt.year == today.year)
df&#91;'IsCurrentYear'] = df&#91;'Date'].dt.year == today.year
</code></pre>



<p>You can also add flags for holidays, fiscal periods, or custom business logic. Python for analysts most important datetime functions</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="6exportingtocsvforpowerbi">Exporting to CSV for Power BI</h2>



<p>Once your date table is ready, export it:</p>



<pre class="wp-block-code"><code>df.to_csv('DateTable.csv', index=False)
</code></pre>



<p>You can now import this CSV into Power BI as a static date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="7sampleoutput">7. Sample Output</h2>



<p>Here’s a preview of what your date table might look like:</p>



<figure class="wp-block-table aligncenter is-style-regular has-small-font-size"><table><thead><tr><th></th><th></th><th class="has-text-align-center" data-align="center"></th><th></th><th></th><th></th><th></th><th></th><th></th></tr></thead><tbody><tr><td>2025-01-01</td><td>2025</td><td class="has-text-align-center" data-align="center">1</td><td>January</td><td>1</td><td>Wednesday</td><td>False</td><td>2025</td><td>False</td></tr><tr><td>2025-07-01</td><td>2025</td><td class="has-text-align-center" data-align="center">7</td><td>July</td><td>3</td><td>Tuesday</td><td>False</td><td>2026</td><td>False</td></tr><tr><td>2025-12-25</td><td>2025</td><td class="has-text-align-center" data-align="center">12</td><td>December</td><td>4</td><td>Thursday</td><td>False</td><td>2026</td><td>False</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Pandas to_datetime() function</h2>



<p>Python for analysts most important datetime functions &#8211; pandas</p>



<pre class="wp-block-code"><code> #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CustomerId              13483 non-null  int64         
 1   Dt_first_rew_income     2987 non-null   datetime64&#91;ns]
 2   Dt_first_purchase       13483 non-null  object        
 3   Dt_last_purchase        13483 non-null  object        

df_cust&#91;'Dt_first_purchase'] = pd.to_datetime(df_cust&#91;'Dt_first_purchase'],format="yyyy-mm-dd")
df_cust&#91;'Dt_last_purchase'] = pd.to_datetime(df_cust&#91;'Dt_last_purchase'],format="yyyy-mm-dd")

 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CustomerId              13483 non-null  int64         
 1   Dt_first_rew_income     2987 non-null   datetime64&#91;ns]
 2   Dt_first_purchase       13483 non-null  datetime64&#91;ns]
 3   Dt_last_purchase        13483 non-null  datetime64&#91;ns]</code></pre>



<h2 class="wp-block-heading" id="8advancedtips">Advanced Tips</h2>



<ul class="wp-block-list">
<li><strong>Holidays</strong>: Use external APIs or CSVs to mark public holidays.</li>



<li><strong>Week Start</strong>: Adjust <code>Weekday</code> to match your locale (e.g., Monday vs. Sunday).</li>



<li><strong>Time Zones</strong>: Use <code>pytz</code> or <code>zoneinfo</code> for timezone-aware datetime handling.</li>



<li><strong>Dynamic Updates</strong>: Automate the script to regenerate the table monthly or yearly.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Certainly! Here&#8217;s a concise <strong>400-word post</strong> on <strong>SQL Date and Time Functions</strong>, with examples, tailored for building a <strong>Date Table in Power BI</strong>:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="sqldateandtimefunctionsforpowerbidatetables">SQL Date and Time Functions for Power BI Date Tables</h2>



<p>When building reports in Power BI, a comprehensive <strong>Date Table</strong> is essential for enabling time-based calculations like YTD, MTD, and custom period comparisons. While Power BI can auto-generate a date table, using <strong>SQL</strong> to create one gives you full control over its structure and logic.</p>



<p>Let’s explore key <strong>SQL Server date and time functions</strong> and how to use them to build a robust date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="1generatingadaterange">1. Generating a Date Range</h2>



<p>To create a date table, you need a continuous range of dates. In SQL Server, you can use a loop or a recursive CTE:</p>



<pre class="wp-block-code"><code>DECLARE @StartDate DATE = '2020-01-01';
DECLARE @EndDate DATE = '2030-12-31';

WITH DateCTE AS (
    SELECT @StartDate AS DateValue
    UNION ALL
    SELECT DATEADD(DAY, 1, DateValue)
    FROM DateCTE
    WHERE DateValue &lt; @EndDate
)
SELECT * INTO DateTable FROM DateCTE
OPTION (MAXRECURSION 32767);

select * from DateTable
</code></pre>



<p>Here the example </p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="440" height="575" src="https://mietwood.com/wp-content/uploads/2025/07/image-17.jpg" alt="Python for analysts most important datetime functions in sql" class="wp-image-3213" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-17.jpg 440w, https://mietwood.com/wp-content/uploads/2025/07/image-17-230x300.jpg 230w" sizes="auto, (max-width: 440px) 100vw, 440px" /><figcaption class="wp-element-caption">Python for analysts most important datetime functions in sql</figcaption></figure>



<p>This creates a table with one row per day between 2020 and 2030.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="2addingdateattributes">2. Adding Date Attributes</h2>



<p>Once you have the base dates, enrich them with useful columns:</p>



<pre class="wp-block-code"><code>ALTER TABLE DateTable ADD 
    Year INT,
    Month INT,
    MonthName VARCHAR(20),
    Quarter INT,
    Weekday INT,
    WeekdayName VARCHAR(20);

UPDATE DateTable
SET 
    Year = YEAR(DateValue),
    Month = MONTH(DateValue),
    MonthName = DATENAME(MONTH, DateValue),
    Quarter = DATEPART(QUARTER, DateValue),
    Weekday = DATEPART(WEEKDAY, DateValue),
    WeekdayName = DATENAME(WEEKDAY, DateValue);
</code></pre>



<p>These columns allow for flexible filtering and grouping in Power BI.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="3fiscalcalendarandflags">3. Fiscal Calendar and Flags</h2>



<p>You can also add fiscal logic and flags:</p>



<pre class="wp-block-code"><code>ALTER TABLE DateTable ADD FiscalYear INT;

UPDATE DateTable
SET FiscalYear = CASE 
    WHEN MONTH(DateValue) &gt;= 7 THEN YEAR(DateValue) + 1
    ELSE YEAR(DateValue)
END;
</code></pre>



<p>Add flags like <code>IsWeekend</code>, <code>IsToday</code>, or <code>IsCurrentMonth</code> to simplify DAX expressions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p></p>



<h2 class="wp-block-heading" id="conclusion">Conclusion</h2>



<p>Python offers a flexible and powerful way to create a <strong>custom date table</strong> for Power BI. With just a few lines of code, you can generate a rich dataset that supports advanced time intelligence and reporting needs.</p>



<p>Whether you&#8217;re working with fiscal calendars, custom flags, or multilingual support, Python gives you the tools to tailor your date table exactly to your business requirements.</p>



<p>SQL’s date and time functions like <code>DATEADD</code>, <code>DATEPART</code>, <code>DATENAME</code>, and <code>YEAR</code> are powerful tools for building a custom date table. Once created, export it to Power BI or use it as a view for dynamic reporting.</p>



<p>Would you like a ready-to-run SQL script for a complete date table?</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Find more resources in our course of <a href="https://mietwood.com/programowanie-zaawansowane-w-analityce">Advanced programming for business analysts</a></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analysts-most-important-datetime-functions">Python for analysts most important datetime functions</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/python-for-analysts-most-important-datetime-functions/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Semantic Search with Elasticsearch</title>
		<link>https://mietwood.com/semantic-search-with-elasticsearch</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Fri, 11 Jul 2025 10:11:19 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3158</guid>

					<description><![CDATA[<p>Semantic search with Elasticsearch is must have for modern e-commerce. Elasticsearch is a powerful search engine, scalable data store, and vector database built on Apache Lucene. It’s optimized for speed and relevance on production-scale workloads. You can use Elasticsearch to index your product database and built beautiful Semantic search with Elasticsearch. How AI is changing...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Semantic search with Elasticsearch is must have for modern e-commerce. Elasticsearch is a powerful search engine, scalable data store, and vector database built on Apache Lucene. It’s optimized for speed and relevance on production-scale workloads. You can use Elasticsearch to index your product database and built beautiful Semantic search with Elasticsearch.</p>



<h2 class="wp-block-heading">How AI is c<a href="https://www.nngroup.com/articles/ai-changing-search-behaviors" target="_blank" rel="noopener">hanging search behaviors</a></h2>



<p><a href="https://www.nngroup.com/articles/ai-changing-search-behaviors" target="_blank" rel="noopener">https://www.nngroup.com/articles/ai-changing-search-behaviors</a></p>





<h2 class="wp-block-heading">Semantic search with Elasticsearch &#8211; intro</h2>



<p><a href="https://www.elastic.co/search-labs/blog/introduction-to-vector-search" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/introduction-to-vector-search</a></p>



<h2 class="wp-block-heading">Language model implementation</h2>



<p>As the first we should initiate a language model and prepare product data for input. Semantic search with Elasticsearch require data indexing via vector transformer.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd
import numpy as np

# function to normalize vectors
def normalize_embedding(embedding):
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm != 0 else embedding

# select products from sql database
def select_products():
    q = """
        SELECT 
         &#91;ProductId&#93;
        ,&#91;ProdIdx&#93;
        ,&#91;ProductName&#93;
        FROM &#91;DB_Products&#93; 
    """
    dfp = read_from_sql_server(q);
    return dfp

# import sentence transformers and initiate model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
print(model)

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

model = model.to(device)
print(model)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">numpy</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">np</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">normalize</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">vectors</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">normalize_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">np</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">linalg</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> != 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sql</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">database</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">q</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">        SELECT</span><span style="color: #D8DEE9"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">         &#91;</span><span style="color: #D8DEE9">ProductId</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">ProdIdx</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">ProductName</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">DB_Products</span><span style="color: #D8DEE9FF">&#93; </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    dfp = read_from_sql_server(q)</span><span style="color: #D8DEE9">;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">dfp</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">initiate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">model</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence_transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SentenceTransformer</span></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">SentenceTransformer</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">all-MiniLM-L6-v2</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">torch</span></span>
<span class="line"><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">torch</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">cuda</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">torch</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">cuda</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">is_available</span><span style="color: #D8DEE9FF">() </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">cpu</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<h2 class="wp-block-heading">Data indexing for semantic search</h2>



<p>Now we can index product data into Elasticsearch database (index). We will index product names as vectors and lexically. This allows hybrid search. Semantic search with Elasticsearch.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># initiate Elasticsearch client
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch('http://localhost:9100')

# check es client
from pprint import pprint
pprint(es.info().body)

# ReCreate the index with dense_vector and text mappings
# step 1
es.indices.delete(index="prod_search_hybrid", ignore_unavailable=True)

# step 2 - mappings
es.indices.create(
    index="prod_search_hybrid",
    mappings={
        "properties": {
            "embedding": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "prodct_name": {
                "type": "text"
            }
        }
    },
)

# list indices and number of documents indexed 
def indices_list():
    indices = es.cat.indices(format='json')
    return [x&#91;'index'&#93; for x in indices]
# -------------------
print(indices_list())
# ------------------------------

# Create documents for embedding
documents = []
for i, r in df_docs.iterrows():
    documents.append({        
        'product_name': r&#91;'ProductName'&#93;&#91;:256&#93;.lower(),        
    })

print(f' Created table of {len(documents)} docs')

# Prepare bulk operations
from tqdm import tqdm                                         # for a prograss bar
operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({'index': {'_index': 'prod_search_hybrid'}})
    operations.append({
        **document,
        'embedding': get_embedding(document&#91;'product_name'&#93;), # vectors for semantic search
        'product_name': document&#91;'product_name'&#93;              # the text field for hybrid search
    })

# Bulk insert the data into Elasticsearch
response = es.bulk(operations=operations)
print(f' Records indexed: {len(response&#91;"items"&#93;)}')</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">initiate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">client</span></span>
<span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">helpers</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://localhost:9100</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">check</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">client</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pprint</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pprint</span></span>
<span class="line"><span style="color: #8FBCBB">pprint</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">info</span><span style="color: #D8DEE9FF">().</span><span style="color: #8FBCBB">body</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">ReCreate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dense_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mappings</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">step</span><span style="color: #D8DEE9FF"> 1</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">delete</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ignore_unavailable</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">step</span><span style="color: #D8DEE9FF"> 2 - </span><span style="color: #8FBCBB">mappings</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">create</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">mappings</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">properties</span><span style="color: #D8DEE9FF">&quot;: {</span></span>
<span class="line"><span style="color: #D8DEE9FF">            &quot;</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">&quot;: {</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">dense_vector</span><span style="color: #D8DEE9FF">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">dims</span><span style="color: #D8DEE9FF">&quot;: 384</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">similarity</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">cosine</span><span style="color: #D8DEE9FF">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodct_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">: </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        }</span></span>
<span class="line"><span style="color: #D8DEE9FF">    }</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">list</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">number</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indexed</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">cat</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">format</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">json</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> [</span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF"># -------------------</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">indices_list</span><span style="color: #D8DEE9FF">())</span></span>
<span class="line"><span style="color: #D8DEE9FF"># ------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span></span>
<span class="line"><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">r</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">df_docs</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">iterrows</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">r</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">ProductName</span><span style="color: #D8DEE9FF">&#39;&#93;&#91;:256&#93;.</span><span style="color: #8FBCBB">lower</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C"> Created table of {len(documents)} docs</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Prepare</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bulk</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">operations</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF">                                         # </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">prograss</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bar</span></span>
<span class="line"><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">documents</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">total</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">len</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF">&#39;</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">&#39;: {&#39;</span><span style="color: #8FBCBB">_index</span><span style="color: #D8DEE9FF">&#39;: &#39;</span><span style="color: #8FBCBB">prod_search_hybrid</span><span style="color: #D8DEE9FF">&#39;</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">})</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">**</span><span style="color: #8FBCBB">document</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">get_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;&#93;)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> # </span><span style="color: #8FBCBB">vectors</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">semantic</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;&#93;              # </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">field</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hybrid</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Bulk</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">insert</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bulk</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C"> Records indexed: {len(response&#91;&quot;items&quot;&#93;)}</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<h2 class="wp-block-heading">Hybrid search</h2>



<p>For hybrid search we combine <strong>match</strong> and <strong>knn</strong> search inside a bool query. The <strong>_name</strong> field return what what part of the query has returned results. This allow to build hybride scoring. As you can see we vectorize <strong>query_h</strong> to <strong>query_vector</strong> using the same procedure get_embeding as we were using during indexing. Semantic search with Elasticsearch.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>query_h = "anti-explosion device"

# Print the query vector for debugging
# query_vector = get_embedding(query_h)
# print("Query Vector:", query_vector)

response = es.search(
    index='prod_search_hybrid',
    body={
        "query": {
            "bool": {
                "should": &#91;
                    {
                        "match": {
                            "product_name": {
                                "query": query_h,
                                "_name": "text_match"
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": "embedding",
                            "query_vector":  query_vector,
                            "k": 10,
                            "num_candidates": 100,
                            "_name": "semantic_search"
                        }
                    }
                &#93;
            }
        }
        ,
        'size': 30
    }
)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">anti-explosion device</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">debugging</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">get_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Query Vector:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">body</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">bool</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">should</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_h</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">knn</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF">  </span><span style="color: #D8DEE9">query_vector</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">100</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">size</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">30</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>After that we can extract product names and scoring and build a list according to our intention. In here we separate products according to field _name and then build list of top 10 lexical match and semantic similarity. More about similarity measures you can read in post <a href="https://mietwood.com/measuring-product-similarity">Measuring product similarity</a></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Extract product names, scores, and sources
products = [
    (hit&#91;"_source"&#93;&#91;"product_name"&#93;, hit&#91;"_score"&#93;, hit.get("matched_queries", []))
    for hit in response&#91;"hits"&#93;&#91;"hits"&#93;
]

# Separate products into text_match and semantic_search groups
text_match_products = [product for product in products if "text_match" in product&#91;2&#93;]
semantic_search_products = [product for product in products if "semantic_search" in product&#91;2&#93;]

# Sort each group by score in descending order
sorted_text_match_products = sorted(text_match_products, key=lambda x: x&#91;1&#93;, reverse=True)&#91;:10&#93;
sorted_semantic_search_products = sorted(semantic_search_products, key=lambda x: x&#91;1&#93;, reverse=True)&#91;:10&#93;

# Print top 10 text_match products
print("\nTop 10 Text Match Products:")
for product in sorted_text_match_products:
    print(f"Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}")

# Print top 10 semantic_search products
print("\nTop 10 Semantic Search Products:")
for product in sorted_semantic_search_products:
    print(f"Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}")</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">names</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">scores</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">sources</span></span>
<span class="line"><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> []))</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Separate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text_match</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">semantic_search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">groups</span></span>
<span class="line"><span style="color: #D8DEE9">text_match_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">2</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"><span style="color: #D8DEE9">semantic_search_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">2</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Sort</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">group</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">score</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">descending</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">order</span></span>
<span class="line"><span style="color: #D8DEE9">sorted_text_match_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sorted</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">text_match_products</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">key</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reverse</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)&#91;:</span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9">sorted_semantic_search_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sorted</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">semantic_search_products</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">key</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reverse</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)&#91;:</span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">top</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text_match</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Top 10 Text Match Products:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> sorted_text_match_products</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">top</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">semantic_search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Top 10 Semantic Search Products:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> sorted_semantic_search_products</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Full-text search, also known as lexical search, is a technique for fast, efficient searching through text fields in documents. Documents and search queries are transformed to enable returning&nbsp;<a href="https://www.elastic.co/what-is/search-relevance" target="_blank" rel="noreferrer noopener">relevant</a>&nbsp;results instead of simply exact term matches. Fields of type&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/text#text-field-type" target="_blank" rel="noopener"><code>text</code></a>&nbsp;are analyzed and indexed for full-text search.</p>



<p>You can combine full-text search with&nbsp;<a href="https://www.elastic.co/docs/solutions/search/semantic-search" target="_blank" rel="noopener">semantic search using vectors</a>&nbsp;to build modern hybrid search applications. While vector search may require additional GPU resources, the full-text component remains cost-effective by leveraging existing CPU infrastructure.</p>



<p>Another example of vector indexing and sementic search you can find here: <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb" class="ek-link" target="_blank" rel="noopener">https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb</a></p>



<h2 class="wp-block-heading">Vector search setup and performing hybrid search</h2>



<p><a href="https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch</a></p>



<p><a href="https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch</a></p>



<h2 class="wp-block-heading" id="Filtering">Filtering. Semantic search with Elasticsearch</h2>



<p>Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:</p>



<ul class="wp-block-list">
<li><em>Does this timestamp fall into the range 2015 to 2016?</em></li>



<li><em>Is the status field set to &#8220;published&#8221;?</em></li>
</ul>



<p>Filter context is in effect whenever a query clause is passed to a filter parameter, such as the&nbsp;<code>filter</code>&nbsp;or&nbsp;<code>must_not</code>&nbsp;parameters in a&nbsp;<code>bool</code>&nbsp;query. <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context" target="_blank" rel="noopener">Learn more</a>&nbsp;about filter context in the Elasticsearch docs.</p>



<h3 class="wp-block-heading" id="Example:-Keyword-Filtering">Keyword Filtering</h3>



<p>This is an example of adding a keyword filter to the query. The example retrieves the top books that are similar to &#8220;javascript books&#8221; based on their title vectors, and also Addison-Wesley as publisher. Semantic search with Elasticsearch.</p>



<pre class="wp-block-code"><code>response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"),
        "k": 10,
        "num_candidates": 100,
<span style="background-color:var(--global-palette1)" class="has-inline-background"><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-theme-palette-9-color">        "filter": {"term": {"publisher.keyword": "addison-wesley"}},</mark></span>
    },
)

pprint(response)</code></pre>



<h2 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#query-filter-context" target="_blank" rel="noopener">Query and filter context</a></h2>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#relevance-scores" target="_blank" rel="noopener">Relevance scores</a></h3>



<p>By default, Elasticsearch sorts matching search results by&nbsp;<strong>relevance score</strong>, which measures how well each document matches a query. The relevance score is a positive floating point number, returned in the&nbsp;<code>_score</code>&nbsp;metadata field of the&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search" target="_blank" rel="noreferrer noopener" class="ek-link">search</a>&nbsp;API. The higher the&nbsp;<code>_score</code>, the more relevant the document. While each query type can calculate relevance scores differently, score calculation also depends on whether the query clause is run in a&nbsp;<strong>query</strong>&nbsp;or&nbsp;<strong>filter</strong>&nbsp;context.</p>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#query-context" class="ek-link" target="_blank" rel="noopener">Query context</a></h3>



<p>In the query context, a query clause answers the question&nbsp;<em>How well does this document match this query clause?</em>&nbsp;Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the&nbsp;<code>_score</code>&nbsp;metadata field. Query context is in effect whenever a query clause is passed to a&nbsp;<code>query</code>&nbsp;parameter, such as the&nbsp;<code>query</code>&nbsp;parameter in the&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#request-body-search-query" target="_blank" rel="noreferrer noopener">search</a>&nbsp;API. Semantic search with Elasticsearch</p>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#filter-context" target="_blank" rel="noopener">Filter context</a></h3>



<p>A filter answers the binary question “Does this document match this query clause?”. The answer is simply &#8220;yes&#8221; or &#8220;no&#8221;. Filtering has several benefits:</p>



<ol class="wp-block-list">
<li><strong>Simple binary logic</strong>: In a filter context, a query clause determines document matches based on a yes/no criterion, without score calculation.</li>



<li><strong>Performance</strong>: Because they don’t compute relevance scores, filters execute faster than queries.</li>



<li><strong>Caching</strong>: Elasticsearch automatically caches frequently used filters, speeding up subsequent search performance.</li>



<li><strong>Resource efficiency</strong>: Filters consume less CPU resources compared to full-text queries.</li>



<li><strong>Query combination</strong>: Filters can be combined with scored queries to refine result sets efficiently.</li>
</ol>



<p>Filters are particularly effective for querying structured data and implementing &#8220;must have&#8221; criteria in complex searches.</p>



<p>Structured data refers to information that is highly organized and formatted in a predefined manner. In the context of Elasticsearch, this typically includes:</p>



<ul class="wp-block-list">
<li>Numeric fields (integers, floating-point numbers)</li>



<li>Dates and timestamps</li>



<li>Boolean values</li>



<li>Keyword fields (exact match strings)</li>



<li>Geo-points and geo-shapes</li>
</ul>



<p>Unlike full-text fields, structured data has a consistent, predictable format, making it ideal for precise filtering operations. Semantic search with Elasticsearch.</p>



<p>Common filter applications include:</p>



<ul class="wp-block-list">
<li>Date range checks: for example is the&nbsp;<code>timestamp</code>&nbsp;field between 2015 and 2016</li>



<li>Specific field value checks: for example is the&nbsp;<code>status</code>&nbsp;field equal to &#8220;published&#8221; or is the&nbsp;<code>author</code>&nbsp;field equal to &#8220;John Doe&#8221;</li>
</ul>



<p>Filter context applies when a query clause is passed to a&nbsp;<code>filter</code>&nbsp;parameter, such as:</p>



<ul class="wp-block-list">
<li><code>filter</code>&nbsp;or&nbsp;<code>must_not</code>&nbsp;parameters in&nbsp;<a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query" target="_blank" rel="noopener"><code>bool</code></a>&nbsp;queries</li>



<li><code>filter</code>&nbsp;parameter in&nbsp;<a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-constant-score-query" target="_blank" rel="noopener"><code>constant_score</code></a>&nbsp;queries</li>



<li><a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-filter-aggregation" target="_blank" rel="noopener"><code>filter</code></a>&nbsp;aggregations</li>
</ul>



<p>Filters optimize query performance and efficiency, especially for structured data queries and when combined with full-text searches.</p>



<pre class="wp-block-code"><code>GET /_search
{
  "query": {
    "bool": {
      "must": &#91;
        { "match": { "title":   "Search"        }},
        { "match": { "content": "Elasticsearch" }}
      ],
      "filter": &#91;
        { "term":  { "status": "published" }},
        { "range": { "publish_date": { "gte": "2015-01-01" }}}
      ]
    }
  }
}</code></pre>



<p>Read more: <a href="https://mietwood.com/product-search-and-product-classification-for-e-commerce">Product Search and Product classification for E-commerce</a></p>



<h1 class="wp-block-heading">Reciprocal rank fusion</h1>



<p><a href="https://plg.uwaterloo.ca/%7Egvcormac/cormacksigir09-rrf.pdf" target="_blank" rel="noreferrer noopener">Reciprocal rank fusion (RRF)</a>&nbsp;is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results. Semantic search with Elasticsearch.</p>



<p>RRF uses the following formula to determine the score for ranking each document:</p>



<pre class="wp-block-code"><code>score = 0.0
for q in queries:
    if d in result(q):
        score += 1.0 / ( k + rank( result(q), d ) )
return score

# where
# k is a ranking constant
# q is a query in the set of queries
# d is a document in the result set of q
# result(q) is the result set of q
# rank( result(q), d ) is d's rank within the result(q) starting from 1</code></pre>



<p>You can use RRF as part of a&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search" target="_blank" rel="noreferrer noopener">search</a>&nbsp;to combine and rank documents using separate sets of top documents (result sets) from a combination of&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers" target="_blank" rel="noopener">child retrievers</a>&nbsp;using an&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers#rrf-retriever" target="_blank" rel="noopener">RRF retriever</a>. A minimum of&nbsp;<strong>two</strong>&nbsp;child retrievers is required for ranking.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="431" height="198" src="https://mietwood.com/wp-content/uploads/2025/07/image-1.jpg" alt="Semantic search with Elasticsearch. RRF retriever combined the results." class="wp-image-3163" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-1.jpg 431w, https://mietwood.com/wp-content/uploads/2025/07/image-1-300x138.jpg 300w" sizes="auto, (max-width: 431px) 100vw, 431px" /><figcaption class="wp-element-caption">Semantic search with Elasticsearch. RRF retriever combined the results.</figcaption></figure>
</div>


<p>We rank the documents based on the RRF formula with a&nbsp;<code>rank_window_size</code>&nbsp;of&nbsp;<code>5</code>&nbsp;truncating the bottom&nbsp;<code>2</code>&nbsp;docs in our RRF result set with a&nbsp;<code>size</code>&nbsp;of&nbsp;<code>3</code>. <strong>We end with&nbsp;<code>_id: 3</code>&nbsp;as&nbsp;<code>_rank: 1</code>,&nbsp;<code>_id: 2</code>&nbsp;as&nbsp;<code>_rank: 2</code>, and&nbsp;<code>_id: 4</code>&nbsp;as&nbsp;<code>_rank: 3</code>. </strong>This ranking matches the result set from the original RRF search as expected.</p>



<p>In this example, we execute the&nbsp;<code>knn</code>&nbsp;and&nbsp;<code>standard</code>&nbsp;retrievers independently of each other. Then we use the&nbsp;<code>rrf</code>&nbsp;retriever to combine the results.</p>



<ol class="wp-block-list">
<li>First, we execute the kNN search specified by the&nbsp;<code>knn</code>&nbsp;retriever to get its global top 50 results.</li>



<li>Second, we execute the query specified by the&nbsp;<code>standard</code>&nbsp;retriever to get its global top 50 results.</li>



<li>Then, on a coordinating node, we combine the kNN search top documents with the query top documents and rank them based on the RRF formula using parameters from the&nbsp;<code>rrf</code>&nbsp;retriever to get the combined top documents using the default&nbsp;<code>size</code>&nbsp;of&nbsp;<code>10</code>.</li>
</ol>



<p>Note that if&nbsp;<code>k</code>&nbsp;from a knn search is larger than&nbsp;<code>rank_window_size</code>, the results are truncated to&nbsp;<code>rank_window_size</code>. If&nbsp;<code>k</code>&nbsp;is smaller than&nbsp;<code>rank_window_size</code>, the results are&nbsp;<code>k</code>&nbsp;size.</p>



<pre class="wp-block-code"><code>GET example-index/_search
{
    "retriever": {
        "rrf": {
            "retrievers": &#91;
                {
                    "standard": {
                        "query": {
                            "term": {
                                "text": "shoes"
                            }
                        }
                    }
                },
                {
                    "knn": {
                        "field": "vector",
                        "query_vector": &#91;1.25, 2, 3.5],
                        "k": 50,
                        "num_candidates": 100
                    }
                }
            ],
            "rank_window_size": 50,
            "rank_constant": 20
        }
    }
}</code></pre>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Survival Analysis Models</title>
		<link>https://mietwood.com/survival-analysis-models</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 02 Jul 2025 10:30:53 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3115</guid>

					<description><![CDATA[<p>Survival analysis is&#160;a statistical method used to modeling object behavior dependent on set of variables (x1 .. xn) in time-to-event period. It is especially useful in modeling the probability of object&#8217;s survival in certain circumstances. One can analyze a timeline to events occurrences in relation to variables influencing the time until a specific event occurs&#160;like...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/survival-analysis-models">Survival Analysis Models</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Survival analysis is&nbsp;a statistical method used to modeling object behavior dependent on set of variables (x1 .. xn) in time-to-event period. It is especially useful in modeling the probability of object&#8217;s survival in certain circumstances. One can analyze a timeline to events occurrences in relation to variables influencing the time until a specific event occurs&nbsp;like death, failure, or customer say &#8220;Goodby to the company&#8221;. Here&#8217;s a breakdown of key terminology of survival analysis.</p>





<h2 class="wp-block-heading">Key terminology of survival analysis models</h2>



<ul class="wp-block-list">
<li><strong>Event</strong>: The outcome of interest, such as death, disease occurrence, customer churn, or equipment failure.</li>



<li><strong>Time</strong>: The duration from a defined starting point (e.g., start of treatment or customer acquisition) to the occurrence of the event or the end of observation (censoring).</li>



<li><strong>Censoring</strong>: When the event of interest is not observed for some individuals during the study period, making their exact survival time unknown.</li>



<li><strong>Survival Function</strong> S(t): The survival function&nbsp;<em>S</em>(<em>t</em>)&nbsp;is defined as the probability that a subject survives (i.e., does not experience the event) beyond time&nbsp;<em>t</em>.</li>



<li><strong>Hazard Function</strong> h(t): The hazard function&nbsp;<em>h</em>(<em>t</em>)&nbsp;represents the instantaneous rate at which the event occurs at time&nbsp;<em>t</em>, given that the subject has survived up to that time.</li>



<li><strong>Linking the Survival and Hazard Functions</strong>: The survival and hazard functions are mathematically related. Knowing one allows you to derive the other, reflecting the duality between retention (survival) and churn (event occurrence).</li>
</ul>



<h2 class="wp-block-heading"><strong>Linking the Survival and Hazard Functions</strong></h2>



<p>The survival and hazard functions are mathematically related. Knowing one allows you to derive the other, reflecting the duality between retention (survival) and churn (event occurrence).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="210" height="63" src="https://mietwood.com/wp-content/uploads/2025/06/image-13.jpg" alt="" class="wp-image-3132"/></figure>
</div>


<p class="has-text-align-center"><strong>In summary, the cumulative hazard rate of subject i at time t can also be defined as the negative logarithm of the survival function at time t.</strong></p>


<div class="kb-row-layout-wrap kb-row-layout-id3115_1f2f65-c8 alignnone wp-block-kadence-rowlayout"><div class="kt-row-column-wrap kt-has-2-columns kt-row-layout-equal kt-tab-layout-inherit kt-mobile-layout-row kt-row-valign-top">

<div class="wp-block-kadence-column kadence-column3115_17ced8-bb"><div class="kt-inside-inner-col">
<figure class="wp-block-image size-full is-resized"><img loading="lazy" decoding="async" width="580" height="365" src="https://mietwood.com/wp-content/uploads/2025/07/image-20.jpg" alt="" class="wp-image-3236" style="width:433px;height:auto" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-20.jpg 580w, https://mietwood.com/wp-content/uploads/2025/07/image-20-300x189.jpg 300w" sizes="auto, (max-width: 580px) 100vw, 580px" /></figure>
</div></div>



<div class="wp-block-kadence-column kadence-column3115_5097c2-f7"><div class="kt-inside-inner-col">
<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="573" height="372" src="https://mietwood.com/wp-content/uploads/2025/07/image-21.jpg" alt="" class="wp-image-3237" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-21.jpg 573w, https://mietwood.com/wp-content/uploads/2025/07/image-21-300x195.jpg 300w" sizes="auto, (max-width: 573px) 100vw, 573px" /></figure>
</div></div>

</div></div>


<h2 class="wp-block-heading">Stratification</h2>



<p>When we have a variable influence in some moment of time, we can divide the period of observation to two periods, when the time before the variable application and the the time post. It constitutes a new starting point of the new treatment application to the patient. The shift of the survival function could be observed.</p>



<h2 class="wp-block-heading"><strong>Kaplan-Meier Estimator</strong></h2>



<p>Kaplan-Meier Estimator is a non-parametric model for estimating the survival function, often used as a first step in survival analysis.&nbsp;It demonstrate the probability of survival a certain time in function of time. The maximum value is 1 and the K-M curve goes approximately to zero when time increases.</p>



<h2 class="wp-block-heading"><strong>Cox Proportional Hazards Model</strong></h2>



<p>Cox Proportional Hazards Model is a semi-parametric model that incorporate the influence of some predictor variables to the hazard rate or survival model.&nbsp; Each subject has an observed survival time 0-t and an event variable xE that shows whether the event has occurred. It is also observed in time 0-t tat variables x1 .. n have had an influence on event xE occurrence. The Cox model incorporate the influence of x1 .. xn factors to estimation the time of survival or hazard rate as n explanatory variables. This make the model can also predict a time of survival.</p>



<p>Disclaimer: the post is inspired by: <strong>Benjamin Lee</strong>, A Comparison Study of Parametric and Machine Learning Survival Analysis Models to Predict Customer Churn in the Edtech Sector, Vienna, 27th January, 2025<br>Benjamin Lee, Peter Filzmoser, link: <a href="https://repositum.tuwien.at/bitstream/20.500.12708/213329/1/Lee%20Benjamin%20-%202025%20-%20Survival%20Analysis%20Model%20to%20Predict%20Customer%20Churn%20in%20the...pdf" target="_blank" rel="noopener">here</a></p>



<h2 class="wp-block-heading">Dataset for <strong>Cox Proportional Hazards Model</strong></h2>



<p>If we have dataset like here</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="466" height="187" src="https://mietwood.com/wp-content/uploads/2025/06/image-7.jpg" alt="Survival Analysis Models. The general data input structure." class="wp-image-3116" srcset="https://mietwood.com/wp-content/uploads/2025/06/image-7.jpg 466w, https://mietwood.com/wp-content/uploads/2025/06/image-7-300x120.jpg 300w" sizes="auto, (max-width: 466px) 100vw, 466px" /></figure>



<p>The Kaplan-Meier estimator is the statistical tool used to estimate a true survival function from available data and can be considered the ’best’ estimator of survival probability when no parametric structure is assumed. This estimator is a non-parametric estimator that only requires the time-to-event (or time-to-censoring) t , and the event status e for every subject. With this information, the survival function estimator S(t) is given by:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="280" height="85" src="https://mietwood.com/wp-content/uploads/2025/06/image-8.jpg" alt="survival function estimator" class="wp-image-3117"/></figure>
</div>


<p>where e<sub>j</sub> is the time at which at least one event e occurred, and n<sub>j</sub> is the total number of subjects who have been censored or have not had the event yet at time t<sub>j</sub> .</p>



<p>One of the main drawbacks of the Kaplan-Meier model is that it is not able to take any subject covariates into account &#8211; what means, it does not provide explanatory power in terms of different variables within a group. </p>



<h2 class="wp-block-heading">Understanding the Cox Proportional Hazards Model</h2>



<p>When we want to understand how factors like age, treatment type, or blood pressure affect survival time, the <strong>Cox Proportional Hazards model</strong> is a go-to tool. Instead of trying to pin down the exact risk at every single moment, this model cleverly separates the hazard rate into two distinct parts.</p>



<p>The core idea is to model how specific factors (or <strong>covariates</strong>) have a <strong>multiplicative effect</strong> on an underlying hazard rate. The formula for the model looks like this:</p>



<p class="has-text-align-center">h(t,X)=h<sub>0</sub>​(t)⋅exp(βX)</p>



<p>Let&#8217;s break that down:</p>



<ul class="wp-block-list">
<li><strong>h<sub>0​</sub>(t) is the Baseline Hazard.</strong> Think of this as the underlying risk of the event for a &#8220;standard&#8221; individual (where all covariate values are zero) over time. It&#8217;s the part of the model that changes with time (t), but it&#8217;s &#8220;non-parametric,&#8221; meaning we don&#8217;t assume its shape. We let the data speak for itself.</li>



<li><strong>exp(βX) is the Covariate Effect.</strong> This is the parametric part of the model and it&#8217;s where your specific factors come in.
<ul class="wp-block-list">
<li>X is the set of your covariates (e.g., age, treatment type, or blood pressure).</li>



<li>β are the coefficients, similar to those in a linear regression, that the model estimates.</li>



<li>This component tells us how much an individual&#8217;s unique characteristics increase or decrease their risk compared to the baseline. Applying the exponential function, exp(), ensures this multiplier is always positive, as negative risk doesn&#8217;t make sense.</li>
</ul>
</li>
</ul>



<p>Notice that <strong>only the baseline hazard depends on time</strong>. The covariate effect, exp(βX), is constant over time. This separation is the key to the model&#8217;s power and leads directly to its most important assumption.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>In the Cox proportional hazards model, a&nbsp;<strong>covariate</strong>&nbsp;is a predictor variable that you include in the model to explain or predict the risk (hazard) of a specific event occurring over time, such as death, failure, or relapse. These covariates can be continuous (e.g., age, blood pressure) or categorical (e.g., treatment group, gender).</p>



<p>What the program calculates:</p>



<ul class="wp-block-list">
<li>The model estimates the effect of each covariate on the hazard function—the instantaneous risk of the event happening at a certain time.</li>



<li>It computes coefficients (often denoted as β<em>β</em>) for each covariate, quantifying how the risk changes with a one-unit increase in that variable, assuming other covariates remain constant.</li>



<li>These coefficients translate into <strong>hazard ratios</strong> (exponentiated coefficients), which tell how much the hazard (risk) is multiplied by when the covariate changes by one unit.</li>



<li>The baseline hazard function h0(t)<em>h</em>0(<em>t</em>) represents the hazard if all covariates were zero; the model then scales this baseline hazard depending on the values of the covariates for each individual.</li>



<li>Overall, the Cox model evaluates how your covariates modify the risk of the event over survival time, assuming that these effects multiply the baseline hazard proportionally and do not change with time (proportional hazards assumption).</li>
</ul>



<p>Formally, the hazard function is modeled as</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="403" height="48" src="https://mietwood.com/wp-content/uploads/2025/07/image-23.jpg" alt="" class="wp-image-3247" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-23.jpg 403w, https://mietwood.com/wp-content/uploads/2025/07/image-23-300x36.jpg 300w" sizes="auto, (max-width: 403px) 100vw, 403px" /></figure>



<p>where&nbsp;x1,x2,…,xp<em>x</em>1,<em>x</em>2,…,<em>x</em><em>p</em>&nbsp;are your covariates and&nbsp;β1,β2,…,βp<em>β</em>1,<em>β</em>2,…,<em>β</em><em>p</em>&nbsp;are their estimated effects.</p>



<p><strong>Summary:</strong></p>



<ul class="wp-block-list">
<li>Your <strong>covariate</strong> is any variable in your data that may affect the timing of the event.</li>



<li>The Cox model calculates how these covariates affect the <em>hazard</em> or instantaneous risk of the event happening, summarized through hazard ratios.</li>



<li>This helps you understand which factors increase or decrease risk while accounting for the survival time and censoring in your dataset.</li>
</ul>



<p>This explanation aligns with standard survival analysis literature and is consistent with how lifelines or R <code>coxph</code> functions use covariates in the Cox model.</p>



<h2 class="wp-block-heading">The &#8220;Proportional&#8221; in Proportional Hazards</h2>



<p>The model&#8217;s name comes from its core assumption: <strong>the effect of the covariates is constant over time.</strong> In other words, the hazard ratio between any two individuals remains proportional throughout the entire timeline.</p>



<p>Let&#8217;s make this concrete. Imagine we have a single covariate, x &#8211; age, and two subjects, &#8216;a&#8217; and &#8216;b&#8217;. Their hazard rates are:</p>



<ul class="wp-block-list">
<li>Subject a: h(t,x<sub>a</sub>​)=h<sub>0​</sub>(t)⋅exp(βx<sub>a</sub>​)</li>



<li>Subject b: h(t,x<sub>b</sub>​)=h<sub>0</sub>​(t)⋅exp(βx<sub>b</sub>​)</li>
</ul>



<p>If we look at the ratio of their hazards, the baseline hazard h<sub>0</sub>​(t) cancels out:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="383" height="69" src="https://mietwood.com/wp-content/uploads/2025/07/image-22.jpg" alt="" class="wp-image-3238" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-22.jpg 383w, https://mietwood.com/wp-content/uploads/2025/07/image-22-300x54.jpg 300w" sizes="auto, (max-width: 383px) 100vw, 383px" /></figure>
</div>


<p>As you can see, time (t) has vanished from the right side of the equation. This means that if subject &#8216;a&#8217; has double the risk of subject &#8216;b&#8217; on day 1, they will also have double the risk on day 100 or day 500. The risk difference is because of age difference. Their <strong>relative risk is constant</strong>, or <strong>proportional</strong>. This powerful assumption is a direct consequence of the model&#8217;s structure.</p>



<p>If one has a dataset which violates the proportional hazards assumption, it causes reduction in predictive power of the model. The hazard as an output from the model is a useful tool in assessing the general magnitude of a covariates&#8217; effect on the time of survival. One can interpret the hazard ratio as the weighted average of true hazard ratios over the time period. Therefore, one must not strictly conform to the proportional hazards assumption, but always check if dataset is appropriate to this assumption.</p>



<p>You can check if your dataset meets the proportional hazards assumption of the Cox model using both visual plots and formal statistical tests.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Visual Inspection (Graphical Methods)</h3>



<p>Visual checks are often the first and most intuitive step.</p>



<h4 class="wp-block-heading">Log-Log Survival Plots</h4>



<p>This is a classic method for categorical covariates (like &#8220;treatment group&#8221; vs. &#8220;control group&#8221;).</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<ul class="wp-block-list">
<li><strong>How it works:</strong> You plot the logarithm of time, <code>log(t)</code>, on the x-axis against a special transformation of the survival probability, <code>-log(-log(S(t)))</code>, on the y-axis for each group.</li>



<li><strong>What to look for:</strong> If the proportional hazards assumption holds, the resulting curves for each group should be <strong>roughly parallel</strong> and not cross. If the lines cross or move closer or further apart in a systematic way, the assumption may be violated.</li>
</ul>
</blockquote>



<h4 class="wp-block-heading">Schoenfeld Residuals Plots</h4>



<p>This is the most common and powerful visual method, and it works for both continuous and categorical covariates.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<ul class="wp-block-list">
<li><strong>How it works:</strong> For each event that occurs, a residual is calculated that represents the difference between the observed covariate value and the expected covariate value for the individual who had the event. You then plot these residuals against time.</li>



<li><strong>What to look for:</strong> If the assumption holds, you should see a <strong>random scatter of points around a horizontal line at zero</strong>. If you see any clear pattern or trend (e.g., a line with a positive or negative slope), it suggests the effect of the covariate changes over time, violating the assumption.</li>
</ul>
</blockquote>



<p></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Formal Statistical Tests</h3>



<p>A formal test gives you a p-value to help you decide if any violations you see are statistically significant.</p>



<ul class="wp-block-list">
<li><strong>How it works:</strong> The most common test is based on the <strong>Schoenfeld residuals</strong>. It formally tests whether the slope of a line fitted to the Schoenfeld residuals-vs-time plot is significantly different from zero. This is often done using a function like <code>cox.zph()</code> in R or the <code>check_assumptions()</code> method in Python&#8217;s <code>lifelines</code> library.</li>



<li><strong>How to interpret the results:</strong>
<ul class="wp-block-list">
<li><strong>Null Hypothesis (H_0):</strong> The effect of the covariate is constant over time (the proportional hazards assumption holds).</li>



<li>A <strong>low p-value (e.g., &lt; 0.05)</strong> suggests you should <strong>reject the null hypothesis</strong>. This is evidence that the assumption is violated for that specific covariate.</li>



<li>A <strong>high p-value</strong> means you <strong>fail to reject the null hypothesis</strong>, so it&#8217;s reasonable to assume the proportional hazards assumption is met.</li>
</ul>
</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">E<strong>xample of check_assumptions(data)</strong> <strong>in Python</strong></h2>



<p><strong>The dataset</strong> includes:</p>



<ul class="wp-block-list">
<li><strong>Subject ID</strong></li>



<li><strong>Survival time in days</strong></li>



<li><strong>Event status</strong>&nbsp;(1 = event occurred, 0 = censored)</li>



<li><strong>Age group</strong> (1 &#8211; 4)</li>
</ul>



<p>To use&nbsp;<code>check_assumptions()</code>&nbsp;from&nbsp;<strong>lifelines</strong>, you need to&nbsp;<strong>fit a Cox Proportional Hazards model</strong>, which requires at least&nbsp;<strong>one covariate</strong>&nbsp;(a variable that might affect survival, ex age group).</p>



<pre class="wp-block-code"><code>from lifelines import CoxPHFitter

# Fit the Cox model
cph = CoxPHFitter()
cph.fit(data, duration_col="Survival_in_days", event_col="Status_bool")

# Check proportional hazards assumption
cph.check_assumptions(data)
# or
cph.check_assumptions(data, p_value_threshold=0.05)

</code></pre>



<h3 class="wp-block-heading">What to Do If the Assumption is Violated</h3>



<p>If you find that a key variable violates the assumption, you have options:</p>



<ol start="1" class="wp-block-list">
<li><strong>Stratification:</strong> You can stratify your model by the problematic covariate. This allows the baseline hazard function to be different for each level of that variable, resolving the violation. For example, if &#8220;gender&#8221; violates the assumption, you can stratify by it.</li>



<li><strong>Use Time-Dependent Covariates:</strong> You can modify the model to include an interaction term between the covariate and time. This explicitly models how the covariate&#8217;s effect changes over time.</li>



<li><strong>Choose a Different Model:</strong> If the violation is severe, a parametric model (like a Weibull or Log-Logistic model) that has a built-in shape for the hazard function might be a more appropriate choice for your data.</li>
</ol>



<h2 class="wp-block-heading">Parametric Models</h2>



<p>When the underlying probability distribution of the dataset is known, one can use parametric models to model the survival function of the dataset. Once the underlying model is specified either in terms of the survival times or the logarithm of survival times, the model can be fitted and estimated using the maximum likelihood estimator.</p>



<p>Of course. Parametric models in survival analysis assume that a subject&#8217;s survival time follows a specific, known statistical distribution (like the exponential or Weibull distribution). Unlike non-parametric models (e.g., Kaplan-Meier) which don&#8217;t make assumptions about the data&#8217;s distribution, parametric models are defined by a set of parameters that dictate the shape of the survival and hazard functions.</p>



<p>Understanding them means understanding the <strong>hazard function</strong> each model assumes. The hazard function describes the instantaneous risk of an event occurring at a specific time, given that it hasn&#8217;t occurred yet.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Exponential Model</h3>



<p>This is the simplest parametric model. It assumes the hazard rate is <strong>constant</strong> over time.</p>



<ul class="wp-block-list">
<li><strong>Core Idea:</strong> The risk of the event happening is the same every single day. If a component hasn&#8217;t failed by day 10, its risk of failing on day 11 is the same as it was on day 2.</li>



<li><strong>Hazard Function:</strong> h(t)=λ (a constant)</li>



<li><strong>How to Understand It:</strong> This model is defined by a single parameter, λ (lambda), the hazard rate. It&#8217;s best suited for events that don&#8217;t have a &#8220;memory&#8221; or aging process, such as the failure rate of certain electronic components or the occurrence of random external events.</li>
</ul>



<h3 class="wp-block-heading">Weibull Model</h3>



<p>The Weibull model is a more flexible and widely used model because it does not assume a constant hazard rate.</p>



<ul class="wp-block-list">
<li><strong>Core Idea:</strong> The risk of the event can <strong>increase, decrease, or remain constant</strong> over time. This makes it much more adaptable to real-world scenarios.</li>



<li><strong>Hazard Function:</strong> h(t)=λk(λt)k−1</li>



<li><strong>How to Understand It:</strong> The model is defined by two main parameters:
<ul class="wp-block-list">
<li>λ (<strong>scale parameter</strong>): Stretches or compresses the curve.</li>



<li>k (<strong>shape parameter</strong>): This is the key. It dictates the nature of the hazard.
<ul class="wp-block-list">
<li>If <strong>k&gt;1</strong>, the hazard <strong>increases</strong> over time (e.g., aging, where the risk of failure grows).</li>



<li>If <strong>k&lt;1</strong>, the hazard <strong>decreases</strong> over time (e.g., post-surgery recovery, where the initial risk is high but drops).</li>



<li>If <strong>k=1</strong>, the Weibull model simplifies to the Exponential model with a <strong>constant</strong> hazard.</li>
</ul>
</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">Log-Normal and Log-Logistic Models</h3>



<p>These models are useful for situations where the hazard rate is not monotonic (i.e., it doesn&#8217;t just go up or down).</p>



<ul class="wp-block-list">
<li><strong>Log-Normal Model</strong>
<ul class="wp-block-list">
<li><strong>Core Idea:</strong> Assumes that the <em>logarithm</em> of the survival time follows a normal (bell-shaped) distribution.</li>



<li><strong>Hazard Function:</strong> The hazard rate first <strong>increases</strong> to a peak and then <strong>decreases</strong>.</li>



<li><strong>How to Understand It:</strong> Think of situations where failure is most likely after a certain &#8220;wear-in&#8221; period, but if the subject survives past that peak, the immediate risk then declines. It&#8217;s often used in engineering for component fatigue.</li>
</ul>
</li>



<li><strong>Log-Logistic Model</strong>
<ul class="wp-block-list">
<li><strong>Core Idea:</strong> Similar to the Log-Normal model, it also assumes the logarithm of survival time follows a specific distribution (the logistic distribution).</li>



<li><strong>Hazard Function:</strong> The hazard rate can be <strong>hump-shaped</strong> (increasing then decreasing) or <strong>monotonically decreasing</strong>, depending on its parameters. It&#8217;s more flexible than the Log-Normal model.</li>



<li><strong>How to Understand It:</strong> This model is popular in medical research, especially when studying diseases like cancer where the risk of mortality might peak some time after diagnosis and then fall. It&#8217;s also notable because its survival function has a simple, explicit formula, which makes interpreting odds easier.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading">Accelerated Failure Time (AFT) models</h2>



<p>In contrast to the Cox model, AFT models assume that covariates are proportional with respect to survival time.</p>



<p></p>



<h2 class="wp-block-heading">Metrics for Survival Analysis</h2>



<p>Log-Rank Test</p>



<p>The log-rank test is used to <strong>compare two or more survival functions </strong>with each other. In this sense, it is analogous to the t-test or Pearson’s chi-squared test for survival analysis. Like those tests, the log-rank test tests the null hypothesis H0 that there is no difference between the survival functions being compared in the probability of an event e occurring at any time t.</p>



<p>The <strong>Log-Rank Test</strong> is a statistical test used in survival analysis to <strong>compare the survival distributions of two or more groups</strong>. It&#8217;s particularly useful for determining if there are significant differences in the time it takes for an event (like death, failure, or relapse) to occur between groups.</p>



<h3 class="wp-block-heading" id="examplecalculation">Example Calculation</h3>



<p>Let&#8217;s consider a clinical trial comparing the survival times of patients using two different cancer treatments, Drug A and Drug B. Here&#8217;s a simplified dataset:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Time (months)</th><th>Event (1=death, 0=censored)</th><th>Group (A/B)</th></tr></thead><tbody><tr><td>2</td><td>1</td><td>A</td></tr><tr><td>3</td><td>0</td><td>A</td></tr><tr><td>4</td><td>1</td><td>B</td></tr><tr><td>6</td><td>1</td><td>A</td></tr><tr><td>7</td><td>0</td><td>B</td></tr><tr><td>8</td><td>1</td><td>B</td></tr></tbody></table></figure>



<h4 class="wp-block-heading" id="stepstocalculatethelogranktest">Steps to Calculate the Log-Rank Test:</h4>



<ol class="wp-block-list">
<li><strong>Combine the Data</strong>: List all unique time points from both groups.</li>



<li><strong>Calculate the Number at Risk</strong>: For each time point, determine how many patients are still at risk in each group.</li>



<li><strong>Calculate the Expected Events</strong>: For each time point, calculate the expected number of events (deaths) for each group.</li>



<li><strong>Compute the Test Statistic</strong>: Use the observed and expected events to calculate the test statistic.</li>
</ol>



<h4 class="wp-block-heading" id="detailedcalculation">Detailed Calculation:</h4>



<ol class="wp-block-list">
<li><strong>Combine the Data</strong>:
<ul class="wp-block-list">
<li>Time points: 2, 3, 4, 6, 7, 8</li>
</ul>
</li>



<li><strong>Number at Risk</strong>:
<ul class="wp-block-list">
<li>At time 2: Group A: 2, Group B: 3</li>



<li>At time 3: Group A: 1, Group B: 3</li>



<li>At time 4: Group A: 1, Group B: 2</li>



<li>At time 6: Group A: 1, Group B: 2</li>



<li>At time 7: Group A: 0, Group B: 2</li>



<li>At time 8: Group A: 0, Group B: 1</li>
</ul>
</li>



<li><strong>Expected Events</strong>:
<ul class="wp-block-list">
<li>At time 2: Expected events for Group A = (2/5) * 1 = 0.4, Group B = (3/5) * 1 = 0.6</li>



<li>At time 4: Expected events for Group A = (1/3) * 1 = 0.33, Group B = (2/3) * 1 = 0.67</li>



<li>At time 6: Expected events for Group A = (1/3) * 1 = 0.33, Group B = (2/3) * 1 = 0.67</li>



<li>At time 8: Expected events for Group A = 0, Group B = 1</li>
</ul>
</li>



<li><strong>Test Statistic</strong>:<ul><li>Sum the observed and expected events for each group.</li><li>Calculate the test statistic using the formula:</li></ul>$ \chi^2 = \sum \frac{(O<em>i &#8211; E</em>i)^2}{E_i} $ Where ( O<em>i ) is the observed number of events and ( E</em>i ) is the expected number of events. For our example, the test statistic would be calculated based on the observed and expected events at each time point.</li>
</ol>



<p>If the test statistic is greater than the critical value from the chi-square distribution table (with 1 degree of freedom), we reject the null hypothesis and conclude that there is a significant difference between the survival distributions of the two groups <a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">[1]</a> <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">[2]</a>.</p>



<h2 class="wp-block-heading">Example of Survival Analysis Models in Python</h2>



<p>The code demonstrate a plot of survival and hazard functions.</p>



<pre class="wp-block-code"><code>import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Create a Sample Dataset ---
data_dict = {
    'time': &#91;6, 7, 10, 15, 18, 22, 25, 30, 32, 38, 40, 45],
    'event': &#91;1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1]  # 1 = event, 0 = censored
}
data = pd.DataFrame(data_dict)
data = data.sort_values(by='time').reset_index(drop=True)

# --- 2. Calculate Kaplan-Meier Survival Function ---
km_data = data.copy()
unique_times = sorted(km_data&#91;'time'].unique())
survival_prob = 1.0
results = &#91;]

for t in unique_times:
    at_risk = (km_data&#91;'time'] &gt;= t).sum()
    events = km_data&#91;(km_data&#91;'time'] == t) &amp; (km_data&#91;'event'] == 1)].shape&#91;0]
    
    if at_risk &gt; 0:
        survival_prob *= (1 - events / at_risk)
    
    results.append({'time': t, 'survival': survival_prob})

km_results = pd.DataFrame(results)

# --- 3. Plot the Survival Function ---
plt.figure(figsize=(10, 6))
plt.step(km_results&#91;'time'], km_results&#91;'survival'], where='post', label='Kaplan-Meier Estimate')
plt.scatter(data&#91;data&#91;'event'] == 0]&#91;'time'], km_results.loc&#91;data&#91;data&#91;'event']==0].index, 'survival'],
            marker='+', color='red', s=100, label='Censored')
plt.title('Survival Function (Kaplan-Meier Estimate)')
plt.xlabel('Time (Months)')
plt.ylabel('Survival Probability')
plt.ylim(0, 1.05)
plt.xlim(0, max(data&#91;'time']) + 5)
plt.grid(True)
plt.legend()
plt.savefig('manual_survival_function_plot.png')
plt.close()

# --- 4. Calculate Nelson-Aalen Cumulative Hazard ---
na_data = data.copy()
hazard = 0.0
na_results_list = &#91;]

for t in unique_times:
    at_risk = (na_data&#91;'time'] &gt;= t).sum()
    events = na_data&#91;(na_data&#91;'time'] == t) &amp; (na_data&#91;'event'] == 1)].shape&#91;0]

    if at_risk &gt; 0:
        hazard += events / at_risk
    
    na_results_list.append({'time': t, 'cumulative_hazard': hazard})

na_results = pd.DataFrame(na_results_list)

# --- 5. Plot the Cumulative Hazard Function ---
plt.figure(figsize=(10, 6))
plt.step(na_results&#91;'time'], na_results&#91;'cumulative_hazard'], where='post', label='Nelson-Aalen Estimate')
plt.title('Cumulative Hazard Function (Nelson-Aalen Estimate)')
plt.xlabel('Time (Months)')
plt.ylabel('Cumulative Hazard')
plt.xlim(0, max(data&#91;'time']) + 5)
plt.grid(True)
plt.legend()
plt.savefig('manual_cumulative_hazard_plot.png')
plt.close()</code></pre>



<p><a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">[1]</a>: <a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">DATAtab</a> <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">[2]</a>: <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">Real Statistics Using Excel</a></p>



<p>References</p>



<p>[1] <a href="https://datatab.net/tutorial/log-rank-test" target="_blank" rel="noopener">Log-Rank Test: A Beginner’s Guide &#8211; DATAtab</a></p>



<p>[2] <a href="https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/" target="_blank" rel="noopener">Log-Rank Test &#8211; Real Statistics Using Excel</a></p>



<ol start="152" class="wp-block-list">
<li></li>
</ol>
<p>The post <a rel="nofollow" href="https://mietwood.com/survival-analysis-models">Survival Analysis Models</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Advanced Programming in SQL and Python &#8211; the course for business analysts</title>
		<link>https://mietwood.com/advanced-programming-in-sql-and-python</link>
					<comments>https://mietwood.com/advanced-programming-in-sql-and-python#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Tue, 01 Jul 2025 07:45:06 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3144</guid>

					<description><![CDATA[<p>Advanced Programming in SQL and Python &#8211; the course for business analysts &#8211; this course aims to equip students with advanced skills in SQL and Python, focusing on their application in business analytics. Through seven mini analytical case studies, students will gain hands-on experience in solving real-world business problems. Course Structure: Week 1-2: Introduction and...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/advanced-programming-in-sql-and-python">Advanced Programming in SQL and Python &#8211; the course for business analysts</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Advanced Programming in SQL and Python &#8211; the course for business analysts &#8211; this course aims to equip students with advanced skills in SQL and Python, focusing on their application in business analytics. Through seven mini analytical case studies, students will gain hands-on experience in solving real-world business problems.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="249" src="https://mietwood.com/wp-content/uploads/2025/07/image.jpg" alt="python sql trend" class="wp-image-3148" srcset="https://mietwood.com/wp-content/uploads/2025/07/image.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/07/image-300x73.jpg 300w, https://mietwood.com/wp-content/uploads/2025/07/image-768x187.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">python sql trend</figcaption></figure>



<h4 class="wp-block-heading" id="coursestructure"><strong>Course Structure:</strong></h4>



<p><strong>Week 1-2: Introduction and Setup</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Overview of SQL and Python in business analytics.</li>



<li><strong>Lab:</strong> Setting up the environment (SQL databases, Python IDEs).</li>



<li><strong>Project 1:</strong> Data Import and Cleaning
<ul class="wp-block-list">
<li><strong>SQL:</strong> Importing data from various sources, cleaning and preprocessing.</li>



<li><strong>Python:</strong> Using pandas for data cleaning and manipulation.</li>
</ul>
</li>
</ul>



<p><strong>Week 3-4: Data Exploration and Visualization</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Techniques for data exploration and visualization.</li>



<li><strong>Lab:</strong> SQL queries for data exploration, Python libraries for visualization (matplotlib, seaborn).</li>



<li><strong>Project 2:</strong> Exploratory Data Analysis (EDA)
<ul class="wp-block-list">
<li><strong>SQL:</strong> Writing complex queries to explore data.</li>



<li><strong>Python:</strong> Visualizing data trends and patterns.</li>
</ul>
</li>
</ul>



<p><strong>Week 5-6: Statistical Analysis</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Statistical methods for business analytics.</li>



<li><strong>Lab:</strong> SQL functions for statistical analysis, Python libraries (numpy, scipy).</li>



<li><strong>Project 3:</strong> Statistical Analysis of Sales Data
<ul class="wp-block-list">
<li><strong>SQL:</strong> Calculating statistical measures (mean, median, standard deviation).</li>



<li><strong>Python:</strong> Performing hypothesis testing and regression analysis.</li>
</ul>
</li>
</ul>



<p><strong>Week 7-8: Predictive Modeling</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Introduction to predictive modeling techniques.</li>



<li><strong>Lab:</strong> SQL for data preparation, Python for model building (scikit-learn).</li>



<li><strong>Project 4:</strong> Predictive Sales Forecasting
<ul class="wp-block-list">
<li><strong>SQL:</strong> Preparing data for modeling.</li>



<li><strong>Python:</strong> Building and evaluating predictive models.</li>
</ul>
</li>
</ul>



<p><strong>Week 9-10: Time Series Analysis</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Time series analysis and forecasting.</li>



<li><strong>Lab:</strong> SQL for time series data manipulation, Python libraries (statsmodels).</li>



<li><strong>Project 5:</strong> Time Series Forecasting
<ul class="wp-block-list">
<li><strong>SQL:</strong> Extracting and transforming time series data.</li>



<li><strong>Python:</strong> Building time series models and forecasting.</li>
</ul>
</li>
</ul>



<p><strong>Week 11-12: Machine Learning Integration</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Integrating machine learning with SQL databases.</li>



<li><strong>Lab:</strong> SQL for data storage and retrieval, Python for machine learning (TensorFlow, Keras).</li>



<li><strong>Project 6:</strong> Customer Segmentation
<ul class="wp-block-list">
<li><strong>SQL:</strong> Storing and retrieving data for machine learning.</li>



<li><strong>Python:</strong> Building and deploying machine learning models.</li>
</ul>
</li>
</ul>



<p><strong>Week 13-14: Advanced Topics and Final Project</strong></p>



<ul class="wp-block-list">
<li><strong>Lecture:</strong> Advanced topics in SQL and Python (optimization, big data).</li>



<li><strong>Lab:</strong> SQL performance tuning, Python for big data (PySpark).</li>



<li><strong>Project 7:</strong> Comprehensive Business Analytics Project
<ul class="wp-block-list">
<li><strong>SQL:</strong> Optimizing queries for large datasets.</li>



<li><strong>Python:</strong> Analyzing and visualizing large datasets.</li>
</ul>
</li>
</ul>



<h4 class="wp-block-heading" id="assessment"><strong>Assessment:</strong></h4>



<ul class="wp-block-list">
<li><strong>Projects:</strong> Each mini project will be assessed based on accuracy, efficiency, and creativity.</li>



<li><strong>Final Project:</strong> A comprehensive project integrating all learned skills.</li>
</ul>



<h4 class="wp-block-heading" id="resources"><strong>Resources:</strong></h4>



<ul class="wp-block-list">
<li><strong>Books:</strong> &#8220;SQL for Data Analytics&#8221; by Upom Malik, Matt Goldwasser, and Benjamin Johnston; &#8220;Python for Data Analysis&#8221; by Wes McKinney.</li>



<li><strong>Online Tutorials:</strong> SQLZoo, DataCamp, Coursera.</li>
</ul>



<h2 class="wp-block-heading">Advanced Programming in SQL and Python</h2>



<p><a href="https://mietwood.com/advanced-programming-in-data-analysis">Advanced Programming in Data Analysis</a></p>



<p><a href="https://datascience.umcs.pl" target="_blank" rel="noopener">https://datascience.umcs.pl</a></p>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/advanced-programming-in-sql-and-python">Advanced Programming in SQL and Python &#8211; the course for business analysts</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/advanced-programming-in-sql-and-python/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Python for analyst &#8211; string split</title>
		<link>https://mietwood.com/python-for-analyst-string-split</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 02 Apr 2025 12:20:54 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=2991</guid>

					<description><![CDATA[<p>The task is to convert html field to url. Row date we can get via this query We got data like this We can use string_split to convert html field to url. After that we got following result Happy coding.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analyst-string-split">Python for analyst &#8211; string split</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>The task is to convert html field to url.</p>



<p>Row date we can get via this query</p>



<pre class="wp-block-code"><code>select &#91;Id], &#91;Name], html
 FROM &#91;DBPromoHeader]
where Html is not null and Html like '%href%'</code></pre>



<p>We got data like this</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="231" src="https://mietwood.com/wp-content/uploads/2025/04/image.jpg" alt="" class="wp-image-2992" srcset="https://mietwood.com/wp-content/uploads/2025/04/image.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/04/image-300x68.jpg 300w, https://mietwood.com/wp-content/uploads/2025/04/image-768x173.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p> We can use string_split to convert html field to url.</p>



<pre class="wp-block-code"><code>select &#91;Id], &#91;Name], html, <mark style="background-color:var(--global-palette8)" class="has-inline-color">replace(replace(value,'href="','https://mietwood.com/'),'"','')</mark> Promo_url
 FROM &#91;DBPromoHeader]
<mark style="background-color:var(--global-palette8)" class="has-inline-color"> <strong>cross apply string_split(Html,' ')</strong></mark>
 where Html is not null and Html like '%href%' and value like 'href%'
 order by &#91;Id]</code></pre>



<p>After that we got following result</p>



<figure class="wp-block-image size-full"><img decoding="async" src="https://mietwood.com/wp-content/uploads/2025/04/image-1.jpg" alt="" class="wp-image-2993"/></figure>



<p>Happy coding.</p>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analyst-string-split">Python for analyst &#8211; string split</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Matplotlib vs Seaborn</title>
		<link>https://mietwood.com/matplotlib-vs-seaborn</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 23 Jan 2025 20:11:44 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=2847</guid>

					<description><![CDATA[<p>Tworzenie wizualizacji danych w Pythonie: Matplotlib kontra Seaborn Wizualizacja danych jest kluczowym elementem analizy danych, ponieważ umożliwia szybkie zrozumienie wzorców, zależności i trendów ukrytych w danych. W Pythonie dwie najbardziej popularne biblioteki do tworzenia wizualizacji to Matplotlib i Seaborn. Każda z nich ma swoje zalety i zastosowania, dlatego w tym artykule porównamy je, aby pomóc...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/matplotlib-vs-seaborn">Matplotlib vs Seaborn</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Tworzenie wizualizacji danych w Pythonie: Matplotlib kontra Seaborn</strong></p>



<p>Wizualizacja danych jest kluczowym elementem analizy danych, ponieważ umożliwia szybkie zrozumienie wzorców, zależności i trendów ukrytych w danych. W Pythonie dwie najbardziej popularne biblioteki do tworzenia wizualizacji to <strong>Matplotlib</strong> i <strong>Seaborn</strong>. Każda z nich ma swoje zalety i zastosowania, dlatego w tym artykule porównamy je, aby pomóc Ci wybrać odpowiednie narzędzie dla Twoich potrzeb.</p>



<p>Autorem opracowania jest Natalia Kustra, Analityka gospodarcza I rok, II stopień, 2025</p>



<h2 class="wp-block-heading"><strong>1. Wprowadzenie do Matplotlib i Seaborn</strong></h2>



<h3 class="wp-block-heading"><strong>Matplotlib</strong></h3>



<p>Matplotlib to najbardziej wszechstronna i podstawowa biblioteka do tworzenia wykresów w Pythonie. Oferuje ogromną elastyczność, umożliwiając tworzenie niemal każdego rodzaju wykresu, od prostych linii po skomplikowane wykresy 3D. Jednak ta elastyczność często wymaga pisania większej ilości kodu.</p>



<p>Przykład prostego wykresu liniowego w Matplotlib:</p>



<p>import matplotlib.pyplot as plt<br><br>x = [1, 2, 3, 4, 5]<br>y = [10, 12, 8, 15, 10]<br><br>plt.plot(x, y, label=&#8217;Dane&#8217;)<br>plt.title(&#8216;Wykres liniowy&#8217;)<br>plt.xlabel(&#8216;X&#8217;)<br>plt.ylabel(&#8216;Y&#8217;)<br>plt.legend()<br>plt.show()</p>



<h3 class="wp-block-heading"><strong>Seaborn</strong></h3>



<p>Seaborn to biblioteka zbudowana na podstawie Matplotlib, która koncentruje się na uproszczeniu procesu tworzenia wizualizacji. Jest szczególnie użyteczna dla analizy danych statystycznych i oferuje bardziej estetyczne domyślne style wykresów.</p>



<p>Przykład wykresu punktowego w Seaborn:</p>



<p>import seaborn as sns<br>import matplotlib.pyplot as plt<br><br># Przykładowe dane<br>import pandas as pd<br>import numpy as np<br>data = pd.DataFrame({<br>&nbsp;&nbsp;&nbsp; &#8216;X&#8217;: np.random.rand(50),<br>&nbsp;&nbsp;&nbsp; &#8216;Y&#8217;: np.random.rand(50)<br>})<br><br>sns.scatterplot(x=&#8217;X&#8217;, y=&#8217;Y&#8217;, data=data)<br>plt.title(&#8216;Wykres punktowy&#8217;)<br>plt.show()</p>



<h2 class="wp-block-heading"><strong>2. Porównanie kluczowych cech</strong></h2>



<h3 class="wp-block-heading"><strong>Elastyczność</strong></h3>



<ul class="wp-block-list">
<li><strong>Matplotlib</strong>:<ul><li>Umożliwia pełną kontrolę nad każdym aspektem wykresu.</li></ul><ul><li>Sprawdzi się w przypadku niestandardowych i zaawansowanych wizualizacji.</li></ul>
<ul class="wp-block-list">
<li>Kod bywa dłuższy i bardziej skomplikowany.</li>
</ul>
</li>



<li><strong>Seaborn</strong>:<ul><li>Jest zoptymalizowany do tworzenia standardowych wykresów statystycznych, takich jak wykresy pudełkowe, histogramy, czy wykresy gęstości.</li></ul><ul><li>Idealny do szybkiej eksploracji danych.</li></ul>
<ul class="wp-block-list">
<li>Mniejsza elastyczność w porównaniu z Matplotlib.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading"><strong>Łatwość użycia</strong></h3>



<ul class="wp-block-list">
<li><strong>Matplotlib</strong>:<ul><li>Wymaga bardziej szczegółowej konfiguracji wykresów.</li></ul>
<ul class="wp-block-list">
<li>Dla początkujących może być trudniejsza w użyciu.</li>
</ul>
</li>



<li><strong>Seaborn</strong>:<ul><li>Intuicyjne API pozwala na tworzenie skomplikowanych wizualizacji za pomocą kilku linii kodu.</li></ul>
<ul class="wp-block-list">
<li>Świetnie integruje się z biblioteką Pandas.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading"><strong>Wygląd wykresów</strong></h3>



<ul class="wp-block-list">
<li><strong>Matplotlib</strong>:<ul><li>Domyślne style wykresów są podstawowe i wymagają dostosowania.</li></ul>
<ul class="wp-block-list">
<li>Wymaga ręcznego dostrajania, aby uzyskać estetyczne efekty.</li>
</ul>
</li>



<li><strong>Seaborn</strong>:<ul><li>Domyślnie oferuje bardziej estetyczne i nowoczesne style.</li></ul>
<ul class="wp-block-list">
<li>Obsługuje palety kolorów, które są przydatne w analizie danych statystycznych.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading"><strong>Wydajność</strong></h3>



<ul class="wp-block-list">
<li><strong>Matplotlib</strong>:
<ul class="wp-block-list">
<li>Może być nieco wolniejszy przy tworzeniu dużych wykresów, zwłaszcza w przypadku skomplikowanych animacji.</li>
</ul>
</li>



<li><strong>Seaborn</strong>:
<ul class="wp-block-list">
<li>Działa szybciej dla prostych wizualizacji, ale bazuje na Matplotlib, więc w dużej mierze dzieli jej ograniczenia wydajnościowe.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading"><strong>3. Przykłady praktyczne</strong></h2>



<h3 class="wp-block-heading"><strong>Wykres słupkowy: Matplotlib vs. Seaborn</strong></h3>



<p><strong>Matplotlib:</strong></p>



<p>categories = [&#8216;A&#8217;, &#8216;B&#8217;, &#8216;C&#8217;, &#8216;D&#8217;]<br>values = [5, 7, 3, 8]<br><br>plt.bar(categories, values, color=&#8217;skyblue&#8217;)<br>plt.title(&#8216;Wykres słupkowy w Matplotlib&#8217;)<br>plt.xlabel(&#8216;Kategorie&#8217;)<br>plt.ylabel(&#8216;Wartości&#8217;)<br>plt.show()</p>



<p><strong>Seaborn:</strong></p>



<p>data = pd.DataFrame({&#8216;Kategorie&#8217;: [&#8216;A&#8217;, &#8216;B&#8217;, &#8216;C&#8217;, &#8216;D&#8217;], &#8216;Wartości&#8217;: [5, 7, 3, 8]})<br><br>sns.barplot(x=&#8217;Kategorie&#8217;, y=&#8217;Wartości&#8217;, data=data, palette=&#8217;Blues&#8217;)<br>plt.title(&#8216;Wykres słupkowy w Seaborn&#8217;)<br>plt.show()</p>



<h3 class="wp-block-heading"><strong>Wykres pudełkowy:</strong></h3>



<p><strong>Matplotlib:</strong></p>



<p>values = [7, 8, 5, 6, 9, 10, 6, 7, 8]<br><br>plt.boxplot(values)<br>plt.title(&#8216;Wykres pudełkowy w Matplotlib&#8217;)<br>plt.show()</p>



<p><strong>Seaborn:</strong></p>



<p>sns.boxplot(data=values, color=&#8217;lightblue&#8217;)<br>plt.title(&#8216;Wykres pudełkowy w Seaborn&#8217;)<br>plt.show()</p>



<h2 class="wp-block-heading"><strong>4. Kiedy wybrać Matplotlib, a kiedy Seaborn?</strong></h2>



<ul class="wp-block-list">
<li><strong>Wybierz Matplotlib, jeśli:</strong><ul><li>Potrzebujesz pełnej kontroli nad wyglądem wykresów.</li></ul><ul><li>Tworzysz niestandardowe wizualizacje lub animacje.</li></ul>
<ul class="wp-block-list">
<li>Chcesz korzystać z funkcji 3D.</li>
</ul>
</li>



<li><strong>Wybierz Seaborn, jeśli:</strong><ul><li>Chcesz szybko tworzyć estetyczne wykresy z danymi w Pandas.</li></ul><ul><li>Analizujesz dane statystyczne.</li></ul>
<ul class="wp-block-list">
<li>Potrzebujesz domyślnie dobrze wyglądających wykresów.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading"><strong>5. Podsumowanie</strong></h2>



<p>Matplotlib i Seaborn to potężne narzędzia do wizualizacji danych, ale każde z nich ma swoje unikalne zastosowania. Matplotlib zapewnia elastyczność i kontrolę, podczas gdy Seaborn pozwala szybko tworzyć estetyczne wizualizacje. Wybór między nimi zależy od Twoich potrzeb i doświadczenia.</p>



<p>Jeśli dopiero zaczynasz, warto zacząć od Seaborn, aby zaoszczędzić czas i skupić się na analizie danych. W miarę zdobywania doświadczenia Matplotlib pozwoli Ci na tworzenie bardziej zaawansowanych wykresów.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/matplotlib-vs-seaborn">Matplotlib vs Seaborn</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
