<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>data science &#8211; Customer Experience Management</title>
	<atom:link href="https://mietwood.com/tag/data-science/feed" rel="self" type="application/rss+xml" />
	<link>https://mietwood.com</link>
	<description>Customer Experience Can Be Managed</description>
	<lastBuildDate>Sat, 11 Oct 2025 16:12:04 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://mietwood.com/wp-content/uploads/2022/09/cropped-Fav7-32x32.png</url>
	<title>data science &#8211; Customer Experience Management</title>
	<link>https://mietwood.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>15 Essential SQL Tips You Can&#8217;t Live Without</title>
		<link>https://mietwood.com/sql-tips</link>
					<comments>https://mietwood.com/sql-tips#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sat, 11 Oct 2025 15:54:17 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Business Analytics]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3353</guid>

					<description><![CDATA[<p>Whether you&#8217;re optimizing performance or simplifying your queries, these SQL tips from mietwood.com will help you write cleaner, faster, and more efficient code. SQL is the backbone of data-driven decision-making, and mastering it can dramatically improve how you interact with databases. Whether you&#8217;re a seasoned developer or just starting out, writing efficient, readable, and scalable...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/sql-tips">15 Essential SQL Tips You Can&#8217;t Live Without</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Whether you&#8217;re optimizing performance or simplifying your queries, these SQL tips from <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter" target="_blank" rel="noreferrer noopener">mietwood.com</a> will help you write cleaner, faster, and more efficient code.</p>



<p>SQL is the backbone of data-driven decision-making, and mastering it can dramatically improve how you interact with databases. Whether you&#8217;re a seasoned developer or just starting out, writing efficient, readable, and scalable SQL queries is a skill that pays off daily. In this post, I’ve compiled ten essential tips that will help you write smarter SQL—tips that I’ve learned, refined, and shared over time. These aren’t just theoretical best practices; they’re practical techniques that can make your queries faster, your code cleaner, and your debugging easier.</p>



<p>You can try Microsoft SQL Server from here: <a href="https://www.microsoft.com/pl-pl/sql-server/sql-server-downloads" target="_blank" rel="noopener">https://www.microsoft.com/pl-pl/sql-server/sql-server-downloads</a>. And developer edition is <a href="https://go.microsoft.com/fwlink/p/?linkid=2215158&amp;clcid=0x415&amp;culture=pl-pl&amp;country=pl" target="_blank" rel="noopener">here</a> </p>



<p>From avoiding <code>SELECT *</code> to choosing the right join types, each tip is designed to help you think critically about how your queries perform and how they scale. You’ll also learn how to use indexes effectively, filter data early, and make smart choices between <code>EXISTS</code> and <code>IN</code>. Each section includes a short summary and a link to a full post where you can dive deeper into the topic. Whether you&#8217;re optimizing a legacy system or building something new, these tips will help you get the most out of SQL—and avoid common pitfalls that slow down your work.</p>



<h3 class="wp-block-heading"><strong>Select Only What You Need</strong>, SQL Tips no 1</h3>



<p>Avoid <code>SELECT *</code> and specify only the columns you need. This reduces data transfer, memory usage, and improves query speed. For example, instead of pulling all employee data, just select <code>employee_id</code>, <code>first_name</code>, and <code>last_name</code>.  <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter" target="_blank" rel="noreferrer noopener">Read more</a></p>



<h3 class="wp-block-heading" id="1optimizejoinconditionsforperformance"><strong>Optimize Join Conditions for Performance</strong></h3>



<p>Avoid non-SARGable joins that prevent index usage. Instead of applying functions to columns in join conditions, restructure the logic to preserve index efficiency. This dramatically improves query speed. <a href="https://mietwood.com/query-optimization-with-join-condition">Read the full guide</a></p>



<h3 class="wp-block-heading" id="2usethepivotoperatorforbetterreporting"><strong>Use the PIVOT Operator for Better Reporting</strong></h3>



<p>Transform row-based data into columnar format using <code>PIVOT</code>. This is ideal for cross-tab reports and trend analysis, especially when comparing metrics across time or categories. <a href="https://mietwood.com/the-pivot-operator-in-sql">Explore the PIVOT tutorial</a></p>



<h3 class="wp-block-heading" id="3masterrecursivectesforhierarchicaldata"><strong>Master Recursive CTEs for Hierarchical Data</strong></h3>



<p>Recursive Common Table Expressions (CTEs) allow you to elegantly query hierarchical or tree-structured data. They’re powerful for tasks like organizational charts or category trees. <a href="https://mietwood.com/blog">Learn about recursive CTEs</a></p>



<h3 class="wp-block-heading" id="4setthefirstdayoftheweekwithdatefirst"><strong>Set the First Day of the Week with DATEFIRST</strong></h3>



<p>Use <code>SET DATEFIRST</code> to control how SQL Server interprets weekday numbers. This is crucial for accurate time-based reporting and week-based aggregations. <a href="https://mietwood.com/category/sql">See how to use DATEFIRST</a></p>



<h3 class="wp-block-heading" id="5updatemultipletableswithconditions"><strong>Update Multiple Tables with Conditions</strong></h3>



<p>Learn how to structure multi-table updates using joins and conditional logic. This technique is essential for synchronizing data across related tables. <a href="https://mietwood.com/category/sql">Read the multi-table update example</a></p>



<h3 class="wp-block-heading" id="7filterearlywithwhereclauses"><strong>Filter Early with WHERE Clauses</strong></h3>



<p>Apply filters as early as possible to reduce the number of rows processed in joins and aggregations. <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter">Optimize your filtering</a></p>



<h3 class="wp-block-heading" id="8useunionallinsteadofunion"><strong>Use UNION ALL Instead of UNION</strong></h3>



<p><code>UNION ALL</code> is faster than <code>UNION</code> because it skips duplicate elimination. Use it when duplicates aren’t a concern. <a href="https://mietwood.com/10-tips-how-to-make-sql-lighter">Performance tip explained</a></p>



<h3 class="wp-block-heading" id="9avoidfunctionsonindexedcolumns"><strong>Avoid Functions on Indexed Columns</strong></h3>



<p>Using functions like <code>LOWER()</code> or <code>DATEADD()</code> on indexed columns disables index usage. Rewrite conditions to preserve index paths. <a href="https://mietwood.com/query-optimization-with-join-condition">Join optimization example</a></p>



<h3 class="wp-block-heading" id="10exploresqlforbusinessanalytics"><strong>Explore SQL for Business Analytics</strong></h3>



<p>Advanced SQL techniques like statistical analysis, predictive modeling, and time series forecasting are essential for business analysts. Learn how to combine SQL with Python for deeper insights. <a href="https://mietwood.com/advanced-programming-in-sql-and-python">Check out the full course</a></p>



<h2 class="wp-block-heading"><strong>5 additional SQL tips</strong></h2>



<h3 class="wp-block-heading"><strong>Use CTEs for Readability and Reuse</strong></h3>



<p>Common Table Expressions (CTEs) make complex queries easier to read and maintain. They allow you to define temporary result sets that can be referenced multiple times. SQL Tips</p>



<pre class="wp-block-code"><code>WITH recentorders AS (
  SELECT customerid, orderdate
  FROM orders
  WHERE orderdate > CURRENTDATE - INTERVAL '30 days'
)
SELECT customerid, COUNT(*) AS ordercount
FROM recentorders
GROUP BY customer_id;</code></pre>



<h3 class="wp-block-heading" id="12avoidfunctionsonindexedcolumns"><strong>Avoid Functions on Indexed Columns</strong></h3>



<p>Using functions on indexed columns disables index usage, slowing down queries. Instead, transform the value before comparison. SQL Tips</p>



<pre class="wp-block-code"><code>-- Avoid
SELECT FROM users WHERE LOWER(email) = 'test@example.com';
-- Better
SELECT FROM users WHERE email = 'test@example.com';</code></pre>



<h3 class="wp-block-heading" id="13usecaseforconditionallogic"><strong>Use CASE for Conditional Logic</strong></h3>



<p><code>CASE</code> lets you embed conditional logic directly in your queries, useful for categorizing or transforming data.</p>



<pre class="wp-block-code"><code>SELECT name,
  CASE
    WHEN score >= 90 THEN 'Excellent'
    WHEN score >= 75 THEN 'Good'
    ELSE 'Needs Improvement'
  END AS performance
FROM students;</code></pre>



<h3 class="wp-block-heading" id="14optimizeaggregationswithgroupby"><strong>Optimize Aggregations with GROUP BY</strong></h3>



<p>When aggregating data, ensure you&#8217;re grouping only necessary columns to avoid performance hits and incorrect results.</p>



<pre class="wp-block-code"><code>SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;</code></pre>



<h3 class="wp-block-heading" id="15useparameterizedqueriestopreventsqlinjection"><strong>Use Parameterized Queries to Prevent SQL Injection</strong></h3>



<p>Always use parameterized queries in application code to protect against SQL injection.</p>



<pre class="wp-block-code"><code>-- Example in Python with psycopg2
cursor.execute("SELECT * FROM users WHERE username = %s", (username,))</code></pre>



<h2 class="wp-block-heading">Remember &#8211; tips summary</h2>



<h3 class="wp-block-heading"><strong>Use Joins Efficiently</strong></h3>



<p>Choose the right join type—<code>INNER JOIN</code> for matched rows, and avoid <code>CROSS JOIN</code> unless necessary. Efficient joins reduce unnecessary data processing and improve clarity.</p>



<h3 class="wp-block-heading"><strong>Filter Data Early</strong></h3>



<p>Apply filters as soon as possible in your query using <code>WHERE</code> clauses. This minimizes the number of rows processed in joins and aggregations, leading to faster execution. SQL Tips no 3</p>



<h3 class="wp-block-heading"><strong>Use Indexes Wisely</strong></h3>



<p>Indexes speed up data retrieval, especially in <code>WHERE</code>, <code>JOIN</code>, and <code>ORDER BY</code> clauses. But don’t over-index—too many can slow down <code>INSERT</code>, <code>UPDATE</code>, and <code>DELETE</code> operations.</p>



<h3 class="wp-block-heading"><strong>Avoid Subqueries in WHERE Clauses</strong></h3>



<p>Correlated subqueries can be slow. Replace them with joins when possible to improve performance and readability. SQL Tips no 4</p>



<h3 class="wp-block-heading"><strong>Use UNION ALL Instead of UNION</strong></h3>



<p><code>UNION</code> removes duplicates, which is costly. If duplicates aren’t a concern, use <code>UNION ALL</code> for faster results.</p>



<h3 class="wp-block-heading"><strong>Limit Your Results</strong></h3>



<p>Use <code>LIMIT</code> or <code>TOP</code> to restrict the number of rows returned. This is especially useful for pagination or sampling large datasets.</p>



<h3 class="wp-block-heading"><strong>Be Cautious with LIKE and Functions</strong></h3>



<p>Avoid leading wildcards in <code>LIKE</code> and functions in <code>WHERE</code> clauses—they prevent index usage. Instead, use indexed columns and consistent casing.</p>



<h3 class="wp-block-heading"><strong>Use EXISTS Instead of IN</strong></h3>



<p><code>EXISTS</code> is often faster than <code>IN</code> because it stops scanning once a match is found. Use it for subqueries checking row existence.</p>



<h3 class="wp-block-heading"><strong>Use Appropriate Data Types</strong></h3>



<p>Choosing the right data type—like <code>TINYINT</code> over <code>INT</code> or <code>CHAR</code> over <code>VARCHAR</code>—can save space and improve performance.</p>



<h2 class="wp-block-heading">SQL Server Management Studio</h2>



<p><strong>SSMS as a Comprehensive SQL Environment</strong><br>SQL Server Management Studio (<a href="https://learn.microsoft.com/en-us/ssms/" target="_blank" rel="noopener">https://learn.microsoft.com/en-us/ssms/</a>) is a powerful, integrated environment for managing SQL Server infrastructure. It provides tools for writing, executing, and debugging SQL queries, as well as managing databases, tables, views, and stored procedures. SSMS supports both on-premises and cloud-based SQL Server instances, making it versatile for hybrid environments. Its intuitive interface includes Object Explorer for navigating server components and Query Editor for crafting and testing SQL scripts. Whether you&#8217;re a database administrator or developer, SSMS offers a unified workspace that streamlines daily tasks and enhances productivity through built-in templates, syntax highlighting, and error diagnostics.</p>



<p><strong>Security, Performance, and Monitoring Tools</strong><br>SSMS includes robust features for security management, such as configuring roles, permissions, and auditing access. It also provides performance tuning tools like the Database Engine Tuning Advisor and graphical execution plans to help identify bottlenecks. With Activity Monitor, users can track real-time server performance, view active sessions, and analyze resource usage. These tools empower teams to maintain optimal database health and ensure compliance with organizational policies. SSMS also integrates with SQL Server Agent for scheduling jobs and alerts, making it a central hub for automation and proactive monitoring across enterprise environments.</p>



<p><strong>Integration, Extensibility, and Collaboration</strong><br>SSMS supports integration with source control systems like Git, enabling versioning and collaborative development. It allows exporting and importing data via wizards, scripting database objects, and generating reports for documentation. Users can extend SSMS functionality through add-ins and connect to Azure services for cloud-based analytics and storage. Its support for multiple query windows and tabbed editing enhances multitasking, while customizable keyboard shortcuts and themes improve user experience. SSMS continues to evolve with regular updates, ensuring compatibility with the latest SQL Server features and providing a stable platform for modern data operations.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/sql-tips">15 Essential SQL Tips You Can&#8217;t Live Without</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/sql-tips/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Student Profiling for your lecture</title>
		<link>https://mietwood.com/profiling</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Tue, 30 Sep 2025 08:55:38 +0000</pubDate>
				<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3334</guid>

					<description><![CDATA[<p>People profiling problem can be approached as an analysis of co-occurrence, how often lectures are chosen together and correlation, the strength and direction of the relationship between choosing lecture X and choosing other lectures. Frequency Analysis for Profiling This is the most direct approach to identify the most and least selected lectures by your L6...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/profiling">Student Profiling for your lecture</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>People profiling problem can be approached as an analysis of <strong>co-occurrence</strong>, how often lectures are chosen together and <strong>correlation,</strong> the strength and direction of the relationship between choosing lecture X and choosing other lectures.</p>



<h2 class="wp-block-heading">Frequency Analysis for Profiling</h2>



<p>This is the most direct approach to identify the most and least selected lectures by your L6 students.</p>



<ol start="1" class="wp-block-list">
<li><strong>Filter Data:</strong> Isolate the rows (students) who selected <strong>L6</strong>.</li>



<li><strong>Calculate Frequencies:</strong> For this subset of L6 students, count how many of them selected each of the <em>other</em> available lectures (L1, L2, L3, etc.).
<ul class="wp-block-list">
<li><strong>Most Selected:</strong> The lectures with the highest counts.</li>



<li><strong>Least Selected:</strong> The lectures with the lowest counts (or those not selected at all).</li>
</ul>
</li>
</ol>



<h2 class="wp-block-heading">Association Rule Mining &#8211; Co-occurrence</h2>



<p>This sophisticated approach, often used in market basket analysis, can determine which lectures are most <strong>frequently chosen together</strong> with L6.</p>



<ul class="wp-block-list">
<li><strong>Support:</strong> The proportion of L6 students who also selected a specific lecture (e.g., L3).</li>



<li><strong>Confidence:</strong> The likelihood that a student selected L6 <em>given</em> that they selected another lecture (e.g., L3 → L6), or vice-versa.</li>



<li><strong>Lift:</strong> A measure of how much more likely a student is to select L3 if they also selected L6, compared to the overall likelihood of selecting L3. A Lift >1 suggests a <strong>positive association</strong> (students who take one tend to take the other).</li>
</ul>



<h2 class="wp-block-heading">Correlation Analysis &#8211; the Strength and Direction of Relation &#8211; application in Profiling</h2>



<p>This method quantifies the relationship between selecting L6 and selecting any other lecture (Lx). Since the selection data is <strong>binary</strong> (0 for not selected, 1 for selected), you would use a correlation measure suitable for binary variables. Profiling.</p>



<ul class="wp-block-list">
<li><strong>Phi Coefficient (ϕ):</strong> This is a measure of association for two binary variables. It ranges from −1 to +1.
<ul class="wp-block-list">
<li><strong>Strong Positive Correlation (ϕ≈+1):</strong> Students who select <strong>L6</strong> are highly likely to also select <strong>Lx</strong>. This suggests the lectures are perhaps complementary or targeted at the same student group.</li>



<li><strong>Strong Negative Correlation (ϕ≈−1):</strong> Students who select <strong>L6</strong> are highly likely to <strong>not</strong> select <strong>Lx</strong>. This suggests the lectures might be alternatives, require conflicting time slots, or appeal to entirely different student interests.</li>



<li><strong>Weak/No Correlation (ϕ≈0):</strong> Selection of L6 has little to no impact on the selection of Lx.</li>
</ul>
</li>
</ul>



<h2 class="wp-block-heading">Dimensionality Reduction &#8211; Clustering</h2>



<p>For a very large number of lectures, you could use methods like <strong>Principal Component Analysis (PCA)</strong> or <strong>clustering algorithms</strong> to group similar students or lectures together. This can identify underlying student profiles (e.g., &#8220;The Data Science Crowd&#8221; or &#8220;The Humanities Enthusiasts&#8221;) that include L6 as part of their typical selection.</p>



<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="956" height="647" src="https://mietwood.com/wp-content/uploads/2025/09/image-8.jpg" alt="data sample for profiling" class="wp-image-3336" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-8.jpg 956w, https://mietwood.com/wp-content/uploads/2025/09/image-8-300x203.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-8-768x520.jpg 768w" sizes="(max-width: 956px) 100vw, 956px" /><figcaption class="wp-element-caption">Dataset for profiling</figcaption></figure>



<p>In this table there are students (in rows) selection of lectures (in column). My lecture is L6. I would like to know a profile of my students. so which lectures they selected the most and which the least. How strong the relations are (positive and negative if they omit some lectures.</p>



<h2 class="wp-block-heading">Profile of L6 Students: Most and Least Selected Lectures</h2>



<p>This is a <strong>Frequency Analysis</strong> of the lectures selected by the 11 students who chose L6. The percentages are based on the total number of L6 students (15).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="271" height="357" src="https://mietwood.com/wp-content/uploads/2025/09/image-9.jpg" alt="frequency analysis for profiling" class="wp-image-3337" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-9.jpg 271w, https://mietwood.com/wp-content/uploads/2025/09/image-9-228x300.jpg 228w" sizes="(max-width: 271px) 100vw, 271px" /><figcaption class="wp-element-caption">Frequency analysis</figcaption></figure>
</div>


<p><strong>Most Selected Lectures &#8211; The &#8220;Typical Package&#8221;:</strong> Your L6 students most frequently select <strong>L4</strong> (67%), <strong>L3</strong> (53%), and <strong>L5</strong> (53%). These three form the core lecture package with L6. <strong>Least Selected Lecture:</strong> <strong>L12</strong> is the least popular choice, selected by only 3 out of 15 students (20%).</p>



<h2 class="wp-block-heading">Strength of Relation: Correlation Analysis</h2>



<p>The <strong>Phi Coefficient</strong> quantifies the strength and direction of the relationship between choosing L6 and choosing any other lecture, using all 25 students in the dataset.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" width="716" height="513" src="https://mietwood.com/wp-content/uploads/2025/09/image-10.jpg" alt="Strength of Relation in profiling: Correlation Analysis" class="wp-image-3338" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-10.jpg 716w, https://mietwood.com/wp-content/uploads/2025/09/image-10-300x215.jpg 300w" sizes="(max-width: 716px) 100vw, 716px" /><figcaption class="wp-element-caption">Strength of Relation: Correlation Analysis</figcaption></figure>
</div>


<p><strong>Strongest Positive Relations (Complementary):</strong> <strong>L4</strong> (ϕ=0.263) and <strong>L5</strong> (ϕ=0.230) show the strongest positive correlation with L6. This suggests that students interested in L6 are often those who also select L4 and L5.</p>



<p><strong>Strongest Negative Relation (Alternative/Avoided):</strong> <strong>L12</strong> (ϕ=−0.218) shows the only notable negative correlation. This confirms the frequency finding, suggesting L12 may be an alternative path or have a conflicting time/prerequisite with L6.</p>



<p><strong>Weak/No Relation:</strong> Lectures like L3 and L7 have a high selection frequency but a very weak (L3) or zero (L7) correlation. This indicates that while many L6 students <em>do</em> take these, they are likely popular lectures chosen by many students across the board, and the choice of L6 is not a significant predictor of their selection.</p>



<h2 class="wp-block-heading"><strong>Association Rule Mining</strong></h2>



<p>By analyzing the entire student population, we can discover <strong>general student curriculum patterns</strong> that exist beyond your specific L6 cohort. I used <strong>Association Rule Mining</strong> metrics (<strong>Support</strong>, <strong>Confidence</strong>, and <strong>Lift</strong>) to find lecture pairs that are frequently selected together.</p>



<ul class="wp-block-list">
<li><strong>Support:</strong> The percentage of all 25 students who selected both lectures.</li>



<li><strong>Lift:</strong> A measure of how much the selection of one lecture <em>increases</em> the chance of selecting the other. A Lift&gt;1.2 indicates a strong, meaningful positive association.</li>
</ul>



<p>Here are the top co-selected lecture groups (pairs) among the entire student population, filtered for those selected by at least 16% of students and showing a strong positive association (Lift&gt;1.2):</p>



<p><strong>Python</strong></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd
from io import StringIO
import itertools

# Read the data
df = pd.read_csv(StringIO(csv_data), sep=';')
N_students = len(df)

lecture_cols = &#91;col for col in df.columns if col.startswith('L')&#93;
lecture_df = df&#91;lecture_cols&#93;.fillna(0).astype(int)

# --- Association Rule Mining (Pairs) ---
# 1. Calculate Support for individual lectures
support_single = lecture_df.sum() / N_students

# 2. Calculate Support and Lift for all pairs
association_rules = []
for l1, l2 in itertools.combinations(lecture_cols, 2):
    # Calculate Support for the pair: count students who selected both
    co_selection_count = (lecture_df&#91;l1&#93; * lecture_df&#91;l2&#93;).sum()
    support_pair = co_selection_count / N_students

    # Calculate Lift: (Support(L1 and L2)) / (Support(L1) * Support(L2))
    # Handle division by zero if single support is 0, though unlikely here
    if support_single&#91;l1&#93; > 0 and support_single&#91;l2&#93; > 0:
        lift = support_pair / (support_single&#91;l1&#93; * support_single&#91;l2&#93;)
    else:
        lift = 0

    # Calculate Confidence (L1 -> L2)
    confidence_l1_to_l2 = support_pair / support_single&#91;l1&#93; if support_single&#91;l1&#93; > 0 else 0

    association_rules.append({
        'Antecedent': l1,
        'Consequent': l2,
        'Support': support_pair,
        'Confidence (L1 -> L2)': confidence_l1_to_l2,
        'Lift': lift
    })

# Convert to DataFrame
rules_df = pd.DataFrame(association_rules)

# Filter for meaningful associations:
# 1. Minimum Support: Selected by at least 4 students (4/25 = 0.16)
# 2. Lift > 1.2: A strong positive relationship
min_support = 4 / N_students  # 0.16

# Filter and sort the results by Lift
top_associations = rules_df[
    (rules_df&#91;'Support'&#93; >= min_support) &amp;
    (rules_df&#91;'Lift'&#93; > 1.2)
].sort_values(by='Lift', ascending=False).reset_index(drop=True)

# Add the reverse rules (L2 -> L1) to the table where Lift is high.
# Since Lift is symmetrical, only one direction needs to be calculated, but Confidence is not.

# Helper function to get Confidence (L2 -> L1) for presentation
def get_confidence_l2_to_l1(row):
    l1 = row&#91;'Antecedent'&#93;
    l2 = row&#91;'Consequent'&#93;
    support_pair = row&#91;'Support'&#93;
    return support_pair / support_single&#91;l2&#93; if support_single&#91;l2&#93; > 0 else 0

top_associations&#91;'Confidence (L2 -> L1)'&#93; = top_associations.apply(get_confidence_l2_to_l1, axis=1)

# Reorder columns for presentation
top_associations = top_associations[&#91;'Antecedent', 'Consequent', 'Support', 'Confidence (L1 -> L2)', 'Confidence (L2 -> L1)', 'Lift'&#93;]

print("Top Co-Selected Lecture Groups (Pairs):")
print(top_associations.to_markdown(index=False, floatfmt=".3f"))</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">io</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">StringIO</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">itertools</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Read</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span></span>
<span class="line"><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">read_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">StringIO</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">csv_data</span><span style="color: #D8DEE9FF">)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sep</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">N_students</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">len</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">lecture_cols</span><span style="color: #D8DEE9FF"> = &#91;</span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">col</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">startswith</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">L</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)&#93;</span></span>
<span class="line"><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">lecture_cols</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">fillna</span><span style="color: #D8DEE9FF">(0).</span><span style="color: #8FBCBB">astype</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">int</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># --- </span><span style="color: #8FBCBB">Association</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Rule</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Mining</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">Pairs</span><span style="color: #D8DEE9FF">) ---</span></span>
<span class="line"><span style="color: #D8DEE9FF"># 1. </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">individual</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">lectures</span></span>
<span class="line"><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">sum</span><span style="color: #D8DEE9FF">() / </span><span style="color: #8FBCBB">N_students</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># 2. </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">all</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pairs</span></span>
<span class="line"><span style="color: #8FBCBB">association_rules</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">l1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">itertools</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">combinations</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">lecture_cols</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> 2):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pair</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">count</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">students</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">who</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">selected</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">both</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">co_selection_count</span><span style="color: #D8DEE9FF"> = (</span><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">lecture_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93;).</span><span style="color: #8FBCBB">sum</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">co_selection_count</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">N_students</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF">: (</span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">)) / (</span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF">) </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">))</span></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Handle</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">division</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">zero</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">single</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> 0</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">though</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">unlikely</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">here</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; &gt; 0 </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93; &gt; 0:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">lift</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> / (</span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #81A1C1">*</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93;)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">lift</span><span style="color: #D8DEE9FF"> = 0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    # </span><span style="color: #8FBCBB">Calculate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">confidence_l1_to_l2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF">&#93; &gt; 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> 0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">association_rules</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Antecedent</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">l1</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Consequent</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">l2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">support_pair</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF">)&#39;: </span><span style="color: #8FBCBB">confidence_l1_to_l2</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">lift</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Convert</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span></span>
<span class="line"><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">association_rules</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Filter</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">meaningful</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">associations</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF"># 1. </span><span style="color: #8FBCBB">Minimum</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Support</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">Selected</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">at</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">least</span><span style="color: #D8DEE9FF"> 4 </span><span style="color: #8FBCBB">students</span><span style="color: #D8DEE9FF"> (4/25 = 0.16)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># 2. </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> &gt; 1.2: </span><span style="color: #8FBCBB">A</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">strong</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">positive</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">relationship</span></span>
<span class="line"><span style="color: #8FBCBB">min_support</span><span style="color: #D8DEE9FF"> = 4 / </span><span style="color: #8FBCBB">N_students</span><span style="color: #D8DEE9FF">  # 0.16</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Filter</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sort</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">results</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span></span>
<span class="line"><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF">[</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Support</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; &gt;= </span><span style="color: #8FBCBB">min_support</span><span style="color: #D8DEE9FF">) &amp;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #8FBCBB">rules_df</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Lift</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; &gt; 1.2)</span></span>
<span class="line"><span style="color: #D8DEE9FF">].</span><span style="color: #8FBCBB">sort_values</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Lift</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ascending</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">False</span><span style="color: #D8DEE9FF">).</span><span style="color: #8FBCBB">reset_index</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">drop</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Add</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">reverse</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">rules</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">table</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">where</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">high</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Since</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Lift</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">symmetrical</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">only</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">one</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">direction</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">needs</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">be</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">calculated</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">but</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Helper</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">get</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Confidence</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">L2</span><span style="color: #D8DEE9FF"> -&gt; </span><span style="color: #8FBCBB">L1</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">presentation</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">get_confidence_l2_to_l1</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">l1</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Antecedent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Consequent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Support</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_pair</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">support_single</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #8FBCBB">l2</span><span style="color: #D8DEE9FF">&#93; &gt; 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> 0</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Confidence (L2 -&gt; L1)</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; = </span><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">apply</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">get_confidence_l2_to_l1</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">axis</span><span style="color: #D8DEE9FF">=1)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Reorder</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">columns</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">presentation</span></span>
<span class="line"><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">[&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Antecedent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Consequent</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Support</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Confidence (L1 -&gt; L2)</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Confidence (L2 -&gt; L1)</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">Lift</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Top Co-Selected Lecture Groups (Pairs):</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">top_associations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">to_markdown</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">False</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">floatfmt</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">.3f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">))</span></span></code></pre></div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="655" height="384" src="https://mietwood.com/wp-content/uploads/2025/09/image-11.jpg" alt="" class="wp-image-3339" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-11.jpg 655w, https://mietwood.com/wp-content/uploads/2025/09/image-11-300x176.jpg 300w" sizes="auto, (max-width: 655px) 100vw, 655px" /></figure>
</div>


<h2 class="wp-block-heading">Interpretation of Lecture Groups</h2>



<p>The <strong>Lift</strong> values indicate the strength of the relationships:</p>



<h3 class="wp-block-heading">The Strongest Cohort (The 2.2+ Lift Groups)</h3>



<p>These are the three strongest, non-obvious combinations. A Lift≈2.2 means that students who take one lecture are more than <strong>twice as likely</strong> to take the associated lecture compared to the general student population.</p>



<ul class="wp-block-list">
<li><strong>L9 ↔ L12 ↔ L8:</strong> These three lectures form a tight cluster. Specifically, <strong>L8 → L9</strong> and <strong>L9 → L12</strong> have extremely high association scores. This suggests a dedicated academic track or a highly correlated set of topics.</li>
</ul>



<h3 class="wp-block-heading">High Confidence Groups (The 1.5+ Lift Groups)</h3>



<p>These are groups where the selection of one lecture is a very strong predictor for the other:</p>



<ul class="wp-block-list">
<li><strong>L2 → L4 (87.5% Confidence):</strong> If a student selects L2, there is a very high, <strong>87.5% chance</strong> they also select L4. This suggests L2 might be a prerequisite, a foundational course, or a direct complement to L4.</li>



<li><strong>L2 → L3 (75.0% Confidence):</strong> Similarly, L2 and L3 are frequently chosen together, indicating a strong connection.</li>



<li><strong>L1 ↔ L2:</strong> These two lectures are highly associated, suggesting they are often taken in tandem.</li>
</ul>



<pre class="wp-block-verse">Does this mean that I should go to lecturers L5 and L2 and ask them for promotion of my L6, because it is most likely they find satisfactory experience on my lecture?</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Data-Driven Strategy for L6 Promotion</h2>



<h3 class="wp-block-heading">1. <strong>Prioritize L5 (High Co-Selection &amp; High Confidence)</strong></h3>



<p>You should absolutely focus on the L5 lecturer.</p>



<ul class="wp-block-list">
<li><strong>L5 → L6 Confidence: 72.7%</strong>
<ul class="wp-block-list">
<li>This means nearly 3 out of every 4 students who take L5 also choose your L6 lecture.</li>



<li><strong>Action:</strong> The L5 lecturer is teaching the same target audience as you. You could ask them to mention L6 as a <strong>natural follow-up</strong> or <strong>complementary course</strong> to their students.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">2. <strong>Prioritize L2 (High Predictive Power)</strong></h3>



<p>The relationship with L2 is even more predictive of a student landing in your lecture.</p>



<ul class="wp-block-list">
<li><strong>L2 → L6 Confidence: 75.0%</strong>
<ul class="wp-block-list">
<li>This means 3 out of every 4 students who take L2 end up in L6.</li>



<li><strong>Action:</strong> The L2 lecturer is essentially teaching a foundational course for a majority of your class. Ask them to promote L6 as the <strong>direct next step</strong> or <strong>most relevant application course</strong> for their content.</li>
</ul>
</li>
</ul>



<h3 class="wp-block-heading">3. <strong>The Importance of Satisfaction (What the Data Doesn&#8217;t Say)</strong></h3>



<p>The data confirms a strong <strong>selection link</strong>, but not a <strong>satisfaction link</strong>.</p>



<ul class="wp-block-list">
<li><strong>Selection:</strong> L5 and L2 students <em>are</em> your target market.</li>



<li><strong>Satisfaction:</strong> To confirm they find your lecture satisfactory, you&#8217;d need student feedback/evaluation data. A student who disliked L6 might still be highly likely to take it if it&#8217;s a required course for a specific program, for example.</li>
</ul>



<p>The strongest rationale for promotion is simply the high overlap: you are addressing a student cohort that already has a demonstrated interest pattern (L5/L2 → L6).</p>



<h3 class="wp-block-heading">Secondary Focus: L4 and L3</h3>



<p>While L4 and L3 have high total co-selection with L6 (L4: 66.7%, L3: 53.3%), their Lift and correlation scores are weaker. This means they are generally popular lectures, and while many of your students take them, the choice of L4 or L3 <strong>does not strongly predict</strong> the choice of L6.</p>



<p>Focusing your limited promotional time on the lecturers for <strong>L5 and L2</strong> will yield the highest return.</p>



<h2 class="wp-block-heading"><strong>Country</strong> and <strong>University</strong> as moderating factors</h2>



<p>The factors of <strong>Country</strong> and <strong>University</strong> do show clear patterns in lecture selection among your L6 students, suggesting that existing relationships or shared academic paths likely influence their choices.</p>



<p>Here is the analysis of the moderating factors, based on the 15 students in your L6 lecture:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="724" height="195" src="https://mietwood.com/wp-content/uploads/2025/09/image-12.jpg" alt="" class="wp-image-3340" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-12.jpg 724w, https://mietwood.com/wp-content/uploads/2025/09/image-12-300x81.jpg 300w" sizes="auto, (max-width: 724px) 100vw, 724px" /></figure>
</div>


<p>Analyzing the two largest university groups shows even sharper differences, which is expected as they are likely organized groups of students who know each other.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="716" height="256" src="https://mietwood.com/wp-content/uploads/2025/09/image-13.jpg" alt="" class="wp-image-3341" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-13.jpg 716w, https://mietwood.com/wp-content/uploads/2025/09/image-13-300x107.jpg 300w" sizes="auto, (max-width: 716px) 100vw, 716px" /></figure>
</div>


<h2 class="wp-block-heading">Theoretical Background for Moderating Factors</h2>



<h3 class="wp-block-heading">The Influence of Country: Cultural and Institutional Homophily</h3>



<p>The tendency for students from the same country (e.g., Spain or Morocco) to share similar lecture profiles can be explained by <strong>Homophily</strong> and <strong>Institutional Alignment</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Theory/Concept</td><td>Explanation</td><td>Application to Your Data</td></tr></thead><tbody><tr><td><strong>Cultural Homophily</strong></td><td>The principle that <em>&#8220;birds of a feather flock together.&#8221;</em> Individuals prefer to associate and bond with others who are similar to themselves (e.g., same nationality, language, cultural background).</td><td>Students from the same country are likely to <strong>communicate about their choices</strong> primarily in their shared native language (e.g., Spanish for Spain, Arabic/French for Morocco). This exchange promotes the selection of a common set of lectures (e.g., Spanish students favoring <strong>L4</strong> and <strong>L1</strong>).</td></tr><tr><td><strong>Institutional Alignment / Mobility Programs</strong></td><td>The structured academic agreements between home and host institutions dictate which courses are approved for credit.</td><td>Exchange programs often pre-approve specific &#8220;study packages.&#8221; If the University of Malaga exchange agreement primarily covers a field requiring <strong>L4</strong> and <strong>L7</strong>, those students will select that bundle. Your finding that Malaga students disproportionately select <strong>L7</strong> strongly supports this institutional influence.</td></tr><tr><td><strong>Country-Level Curriculum/Prerequisites</strong></td><td>Students from the same country may have completed similar foundational courses at home, making a certain set of lectures (like L6) compatible.</td><td>If Spanish universities standardize a curriculum where L4 is a logical next step to a prerequisite, those Spanish students will follow that path, leading to the high <strong>L4</strong> selection.</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">The Influence of University/Peer Group: Social Network Effects</h3>



<p>The even stronger, more granular influence of the specific university groups (like the unique L7 selection by U. Malaga students) is supported by <strong>Social Influence Theory</strong> and <strong>Bounded Rationality</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Theory/Concept</td><td>Explanation</td><td>Application to Your Data</td></tr></thead><tbody><tr><td><strong>Social Proof / Herd Behavior</strong></td><td>A type of conformity where individuals assume the actions of a large group in an attempt to reflect correct behavior. When students face <strong>uncertainty</strong> in a new academic environment, they look to their trusted peers for guidance.</td><td>Students arriving from the same university (e.g., U. Cadi Ayyad) form a <strong>tight local network</strong>. When one or two students select a lecture (like <strong>L5</strong>), the rest of the group follows to reduce the perceived risk and workload associated with taking classes alone.</td></tr><tr><td><strong>Network Constraint / Information Asymmetry</strong></td><td>The limited size of a local network dictates the information flow. Students only trust information from their close peers, ignoring broader student trends.</td><td>The U. Malaga students may not interact with the U. Cadi Ayyad students. Therefore, they only hear about the &#8220;safe&#8221; lecture bundle (e.g., L4, <strong>L7</strong>, L1) from their U. Malaga friends, leading to the highly specific, unique selection of <strong>L7</strong>.</td></tr><tr><td><strong>Convenience/Logistics</strong></td><td>At a practical level, coordinating lecture selection allows peers to have the same class schedule, making group projects, social planning, and travel easier.</td><td>The high co-selection of <strong>L4, L7, L1, and L2</strong> among U. Malaga students suggests they are coordinating their choices to minimize logistical friction, effectively treating the selection as a <strong>group decision</strong>.</td></tr></tbody></table></figure>



<pre class="wp-block-verse">By applying these concepts, you can elevate your analysis from mere correlation to a discussion of <strong>student decision-making under conditions of uncertainty and social constraint.</strong></pre>



<h2 class="wp-block-heading">The lecture title informational role</h2>



<p>Assuming that students red only the lecture titles can you analyze the whole population and discover which lectures students select as a group (l1 &amp; l2 … ect.) based on title informational role.</p>



<p>This is an excellent way to use the data to understand the <strong>thematic logic</strong> that guides students&#8217; decisions, rather than just the frequency. By analyzing the corpus of words in the high-Lift lecture titles, we can discover the &#8220;driving wording&#8221; that creates strong, predictable co-selection patterns.</p>



<p>The primary driving themes are <strong>Economics</strong>, <strong>Management</strong>, and <strong>Business</strong>, but the strongest associations occur when these words are combined with <strong>qualifying terms</strong> like <em>International</em>, <em>Applied</em>, or <em>Process</em>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Analysis of Driving Wording in Lecture Titles</h2>



<p>Based on the highest Lift scores (strongest association patterns), the lecture groups cluster into four distinct thematic tracks driven by specific keywords:</p>



<h3 class="wp-block-heading">1. Driving Theme: International &amp; Political Economy 🌍</h3>



<p>This is the strongest thematic driver in the entire dataset, creating three of the top four co-selection groups.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Lecture Group</td><td>Titles &amp; Key Wording</td><td>Driving Wording Logic</td></tr></thead><tbody><tr><td><strong>L9 &amp; L12</strong> (Lift&nbsp;2.232)</td><td>L9: <strong>International Economics</strong> / L12: <strong>Political economy</strong></td><td>Students seek a deep understanding of how <strong>global power (Political)</strong> and <strong>global markets (International)</strong> interact. The co-selection is driven by the desire to merge theoretical macroeconomics with political strategy.</td></tr><tr><td><strong>L8 &amp; L9</strong> (Lift&nbsp;2.232)</td><td>L8: <strong>International Competitiveness</strong> / L9: <strong>International Economics</strong></td><td>The term <strong>&#8220;International&#8221;</strong> is the central driver. Students are selecting a specialized track in global trade, where L9 provides the foundational theory and L8 provides the policy application (Competitiveness).</td></tr></tbody></table></figure>



<p></p>



<p><strong>Driving Wording:</strong> <strong>International</strong>, <strong>Economy</strong>, <strong>Political</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">2. Driving Theme: Applied Economic Analysis</h3>



<p>This theme links foundational economic knowledge with quantitative skills and real-world application.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Lecture Group</td><td>Titles &amp; Key Wording</td><td>Driving Wording Logic</td></tr></thead><tbody><tr><td><strong>L1 &amp; L2</strong> (Lift&nbsp;2.083)</td><td>L1: <strong>Analysis</strong> of&#8230; <strong>Economic</strong> and Social Indicators / L2: <strong>Applied Economics</strong> Real-World Challenges</td><td>The core terms <strong>&#8220;Analysis&#8221;</strong> and <strong>&#8220;Applied&#8221;</strong> signal a curriculum path focused on practical data skills (L1) to solve real-world problems (L2), appealing to students who want measurable, deployable skills.</td></tr><tr><td><strong>L2 &amp; L3</strong> (Lift&nbsp;1.562)</td><td>L2: <strong>Applied Economics</strong> / L3: <strong>Business Analytics</strong> for <strong>Financial Decisions</strong></td><td>The combination of <strong>&#8220;Applied&#8221;</strong> and <strong>&#8220;Analytics&#8221;</strong> defines a quantitative financial student. They select L2 for the general economic context and L3 for the specific financial toolset.</td></tr></tbody></table></figure>



<p></p>



<p><strong>Driving Wording:</strong> <strong>Applied</strong>, <strong>Analysis</strong>, <strong>Economics</strong>, <strong>Decisions</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">3. Driving Theme: Business Management &amp; Strategy</h3>



<p>This group is driven by a focus on business processes and the organizational changes brought by technology.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Lecture Group</td><td>Titles &amp; Key Wording</td><td>Driving Wording Logic</td></tr></thead><tbody><tr><td><strong>L7 &amp; L11</strong> (Lift&nbsp;1.786)</td><td>L7: <strong>Economics of Innovation</strong> / L11: <strong>People management</strong> in the <strong>digital economy</strong></td><td>The terms <strong>&#8220;Innovation&#8221;</strong> and <strong>&#8220;Digital&#8221;</strong> are the semantic link. Students are building a profile focused on managing organizations in a rapidly changing, technology-driven environment, linking macro strategy (L7) with HR/people skills (L11).</td></tr><tr><td><strong>L5 &amp; L6</strong> (Lift&nbsp;1.212)</td><td>L5: <strong>Business process management</strong> / L6: <strong>Customer Experience Management</strong></td><td>The recurring term <strong>&#8220;Management&#8221;</strong> creates the link. L5 focuses on the <strong>internal</strong> view (Process) and L6 focuses on the <strong>external</strong> view (Customer), showing students seek comprehensive skills for managing the entire value chain.</td></tr></tbody></table></figure>



<p></p>



<p><strong>Driving Wording:</strong> <strong>Management</strong>, <strong>Process</strong>, <strong>Digital</strong>, <strong>Innovation</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Summary of Driving Wording</h2>



<p>The strongest selection drivers are not single words but rather <strong>qualified thematic phrases</strong> that define an academic specialization:</p>



<p><strong>Modern Business Specialization:</strong> Driven by <strong>Management</strong> combined with <strong>Process</strong> or <strong>Digital/Innovation</strong> (L5, L6, L7, L11).</p>



<p><strong>Macro/Global Specialization:</strong> Driven by <strong>International</strong> and <strong>Political Economy</strong> (L9, L12, L8).</p>



<p><strong>Quantitative Specialization:</strong> Driven by <strong>Applied Economics</strong> and <strong>Business Analytics</strong> (L2, L3, L1).</p>



<p>About wording analysis you can read here: <a href="https://mietwood.com/organizational-aspiration-for-social-impact">Organizational aspiration for social impact</a></p>



<h2 class="wp-block-heading">What recommendation to lecture L6 in wording. How can I modify or extend title of L6 to experiment for more students in next period.</h2>



<p>The goal is to maximize enrollment by aligning the L6 title with the <strong>thematic drivers</strong> of the most predictive lectures (L5, L2, L4). Your current title is <strong>L6: Customer Experience Management</strong>.</p>



<p>Here are the recommendations for wording modifications, structured by the target student profile you want to attract:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Recommended Title Modifications for L6</h2>



<h3 class="wp-block-heading">Strategy 1: Attract <strong>L5</strong> Students (Strongest Existing Link)</h3>



<p>The L5 title is <em>Business process management</em>. These students seek <strong>internal efficiency</strong> as a foundation for external success.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Recommendation</td><td>Suggested Title (Experiment 1)</td><td>Rationale</td></tr></thead><tbody><tr><td><strong>Integrate &#8220;Process&#8221;</strong></td><td><strong>L6: Customer Experience Management and Service Design Process</strong></td><td>By including the word <strong>&#8220;Process,&#8221;</strong> you explicitly link L6 to the operational skills L5 students value, making it the logical <strong>next step</strong> for their expertise.</td></tr><tr><td><strong>Focus on Value</strong></td><td><strong>L6: Managing Business Processes for Customer Value and Experience</strong></td><td>This title frames L6 as the <strong>culmination</strong> of L5, showing how mastering L5&#8217;s internal processes directly leads to the high-value outcome of great customer experience.</td></tr></tbody></table></figure>



<p></p>



<h3 class="wp-block-heading">Strategy 2: Attract <strong>L2/L4</strong> Students (Applied &amp; Analytical) </h3>



<p>The L2 title is <em>Applied Economics Real-World Challenges and Solutions</em>. L4 is <em>Business plan</em>. These students are <strong>practical and analytical</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Recommendation</td><td>Suggested Title (Experiment 2)</td><td>Rationale</td></tr></thead><tbody><tr><td><strong>Integrate &#8220;Analytics&#8221;</strong></td><td><strong>L6: Customer Experience Analytics: Data-Driven Strategies for Service Improvement</strong></td><td>The term <strong>&#8220;Analytics&#8221;</strong> strongly attracts L2/L4 students (who also take L3: <em>Business Analytics&#8230;</em>). This signals that L6 is a <strong>quantitative course</strong>, not just a soft skill.</td></tr><tr><td><strong>Focus on &#8220;Metrics/KPIs&#8221;</strong></td><td><strong>L6: Customer Experience Management: Measuring and Optimizing Key Service Metrics</strong></td><td>This appeals to the <strong>Applied/Solutions</strong> mindset, promising tools to measure CX performance and directly influence business outcomes, fitting the L2/L4 focus on solutions and planning.</td></tr></tbody></table></figure>



<p></p>



<h3 class="wp-block-heading">Strategy 3: Attract <strong>L7/L11</strong> Students (Future/Digital Focus) </h3>



<p>The L7/L11 titles feature <strong>Innovation</strong> and <strong>Digital Economy</strong>.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Recommendation</td><td>Suggested Title (Experiment 3)</td><td>Rationale</td></tr></thead><tbody><tr><td><strong>Integrate &#8220;Digital&#8221;</strong></td><td><strong>L6: Digital Customer Experience (DCX) Management</strong></td><td>The term <strong>&#8220;Digital&#8221;</strong> is a powerful modern driver. If your course includes any digital touchpoints (apps, online service, social media), using &#8220;DCX&#8221; will immediately pull in the students from the L7/L11 &#8220;Innovation&#8221; track.</td></tr></tbody></table></figure>



<p></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Final Recommendation: The Best Title to Test</h3>



<p>The most balanced and powerful title that leverages multiple co-selection themes is: <strong>Customer Experience Management: Data, Process, and Digital Strategy</strong></p>



<p>This phrase:</p>



<ul class="wp-block-list">
<li>Includes <strong>Management</strong> (L5, L6, L11 theme).</li>



<li>Includes <strong>Process</strong> (L5 link).</li>



<li>Includes <strong>Data</strong> (L2/L3 Analytics link).</li>



<li>Includes <strong>Digital Strategy</strong> (L7/L11 Innovation link).</li>
</ul>



<p>Thank you for reading.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/profiling">Student Profiling for your lecture</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Fuzzy-Set Qualitative Comparative Analysis</title>
		<link>https://mietwood.com/qualitative-comparative-analysis</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 28 Sep 2025 12:16:51 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3329</guid>

					<description><![CDATA[<p>Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches. The symmetric data analysis (e.g., correlation and multiple regression analysis) &#8230; The asymmetric data analysis (i.e., individual case outcome forecasts) &#8230; Based on: Fuzzy-set Qualitative Comparative Analysis (fsQCA): Guidelines for research practice in Information Systems...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/qualitative-comparative-analysis">Fuzzy-Set Qualitative Comparative Analysis</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="461" src="https://mietwood.com/wp-content/uploads/2025/09/image-7.jpg" alt="Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches" class="wp-image-3330" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-7.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/image-7-300x135.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-7-768x346.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Qualitative comparative analysis (QCA) is an asymmetric data analysis technique that combines the logic and empirical intensity of qualitative approaches</figcaption></figure>



<p>The symmetric data analysis (e.g., correlation and multiple regression analysis) &#8230;</p>



<p>The asymmetric data analysis (i.e., individual case outcome forecasts) &#8230;</p>



<p>Based on: <a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037" target="_blank" rel="noopener">Fuzzy-set Qualitative Comparative Analysis (fsQCA): Guidelines for research practice in Information Systems and marketing</a> (<a href="https://www.sciencedirect.com/author/55387371600/ilias-o-pappas" target="_blank" rel="noopener">Ilias O. Pappas</a>, <a href="https://www.sciencedirect.com/author/7006553735/arch-g-woodside" target="_blank" rel="noopener">Arch G. Woodside</a>)</p>



<p><strong>Qualitative </strong>inductive reasoning with data being analyzed “by case’’ and not “by variable’’, is combined with quantitative empirical testing, as sufficient and necessary conditions identify outcomes through statistical methods. In most cases, QCA are useful in quantitative studies, as it allows the researcher to get a deep view of their data through a quantitative analysis. That has also several characteristics of qualitative analysis.</p>



<p><strong>Case studies</strong> focus on describing, explaining, and forecasting, single and combinatorial conditional antecedents on outcomes while variable studies focus on the similarities of variances of two or more variables. A “condition” is a point or interval range of antecedent or outcome; a “variable” characteristic varies. </p>



<p>Here are few examples of conditions versus variables: “Male” is a condition; “gender” is a variable. “Swedish” is a condition; “nationality” is a variable. “Expert” is a condition; “expertise” is a variable.</p>



<p>The <strong>goal of QCA</strong> is to explain causality in complex real life phenomena. QCA goes through “multiple-conjunctural causation, which refers to “nonlinear, nonadditive, non-probabilistic conception that rejects any form of permanent causality. That stresses different paths which can lead to the same outcome. QCA investigate complex combinations of conditions and diversity. QCA uses Boolean algebra and Boolean minimization algorithms to capture patterns of multiple-conjunctural causation and to simplify complex data structures. </p>



<h2 class="wp-block-heading" id="sect0025">Types qualitative comparative analysis (QCA)</h2>



<h3 class="wp-block-heading" id="sect0030">CsQCA and mvQCA</h3>



<p><strong>CsQCA</strong> is the first variation of QCA. It is a tool created to deal with <strong>complex sets</strong> of binary data. The use of Boolean algebra means that QCA has as input binary data (0 or 1). That make QCA uses logical operations for the procedure. Thus it is very important to dichotomize the use of variables in a useful and meaningful manner.</p>



<p><strong>mvQCA</strong>, treats variables as <strong>multi-valued</strong> instead of dichotomous. MvQCA retains the idea of performing a synthesis of the dataset and cases with the same value on the outcome variable. They are explained by a solution, which contains combinations of variables that explain a number of cases with the outcome.</p>



<p><strong>FsQCA</strong> addresses an important limitation of csQCA, the fact that variables are binary, thus restricting the analysis as it cannot fully capture the complexity in cases that naturally vary by level or degree. This restriction of csQCA is likely an important reason that QCA has not been widely adopted in multiple contexts, including IS and marketing research. FsQCA extends csQCA by integrating fuzzy-sets and fuzzy-logic principles with QCA. The variables can get all the values within the range of 0–1. FsQCA is able to overcome several limitations of both csQCA and mvQCA, and has received increased attention recently. FsQCA applies together with complexity theory, it provides the opportunity to gain deeper and richer insight into data.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-1024x576.jpg" alt="fsQCA and cluster analysis" class="wp-image-3331" srcset="https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-1024x576.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-300x169.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-768x432.jpg 768w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-1536x864.jpg 1536w, https://mietwood.com/wp-content/uploads/2025/09/dynamic-abstract-image-with-mathematical-symbols-on-floating-papers-vibrant-and-conceptual.-18069230-2048x1152.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">fsQCA and cluster analysis</figcaption></figure>



<h3 class="wp-block-heading" id="sect0040">FsQCA and cluster analysis</h3>



<p>Case-based techniques, such as fsQCA and cluster analysis, have been employed as a way of moving beyond variance-based methods. These two techniques have similarities as they both employ multidimensional spaces and often people ask how fsQCA differs from cluster analysis and why do we need it. A main difference between the two methods is the kind of research questions they are able to address.</p>



<p>Specifically, <strong>cluster analysis</strong> answers questions such as which cases are more similar to each other, while fsQCA can identify the different configurations that constitute sufficient and/or necessary conditions for the outcome of interest. Depending on the focus of the study the researcher should choose the most appropriate method. Their differences stem from the fact that <em>“QCA addresses the positioning of cases in [multidimensional] spaces via set theoretic operations while cluster analysis relies on geometric distance measures and concepts of variance minimization”</em> . To this end, prior studies compare fsQCA with cluster analysis and show how fsQCA can handle causal complexity with fine-grained level data, or how it can identify more solutions compared to cluster analysis. A discussion exists in the literature regarding QCA and cluster analysis, and both approaches have differences making them suitable for different types of studies.</p>



<p>read example here: <a href="https://mietwood.com/hierarchical-agglomerative-clustering-for-product-grouping">Hierarchical Agglomerative Clustering for Product Grouping</a></p>



<h2 class="wp-block-heading" id="sect0045">Adoption of fsQCA in relevant studies</h2>



<p>Configurational approaches are becoming more popular over the past few years in different areas, with fsQCA playing a large part in this as most studies will prefer fuzzy-set over crisp-set and multi-value QCA (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0365" target="_blank" rel="noopener">Thiem &amp; Dusa, 2013</a>). In detail, fsQCA has been employed in&nbsp;<a href="https://www.sciencedirect.com/topics/computer-science/information-system" target="_blank" rel="noopener">information systems</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0095" target="_blank" rel="noopener">Fedorowicz, Sawyer, &amp; Tomasino, 2018</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0165" target="_blank" rel="noopener">Liu et al., 2017</a>), online business and marketing (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0235" target="_blank" rel="noopener">Pappas et al., 2016</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0225" target="_blank" rel="noopener">Pappas, 2018</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0420" target="_blank" rel="noopener">Woodside, 2017</a>),&nbsp;<a href="https://www.sciencedirect.com/topics/psychology/consumer-psychology" target="_blank" rel="noopener">consumer psychology</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0325" target="_blank" rel="noopener">Schmitt, Grawe, &amp; Woodside, 2017</a>), strategy and organizational research (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0105" target="_blank" rel="noopener">Fiss, 2011</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0135" target="_blank" rel="noopener">Greckhamer et al., 2018</a>), education (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0245" target="_blank" rel="noopener">Pappas, Giannakos et al., 2017</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0270" target="_blank" rel="noopener">Plewa, Ho, Conduit, &amp; Karpen, 2016</a>),&nbsp;<a href="https://www.sciencedirect.com/topics/social-sciences/data-science" target="_blank" rel="noopener">data science</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0375" target="_blank" rel="noopener">Vatrapu, Mukkamala, Hussain, &amp; Flesch, 2016</a>) and&nbsp;<a href="https://www.sciencedirect.com/topics/computer-science/learning-analytics" target="_blank" rel="noopener">learning analytics</a>&nbsp;(<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0215" target="_blank" rel="noopener">Papamitsiou et al., 2018</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0335" target="_blank" rel="noopener">Sergis, Sampson, &amp; Giannakos, 2018</a>). This tutorial aims to increase the adoption of fsQCA in IS and marketing studies following the call for more empirical work in the area (<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0085" target="_blank" rel="noopener">El Sawy, Malhotra, Park, &amp; Pavlou, 2010</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0105" target="_blank" rel="noopener">Fiss, 2011</a>;&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0415" target="_blank" rel="noopener">Woodside, 2014</a>,&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0268401221000037#bib0420" target="_blank" rel="noopener">2017</a>).</p>



<h2 class="wp-block-heading">Adoption of fsQCA in relevant studies</h2>



<p>FsQCA is useful for both inductive and deductive reasoning for theory building, elaboration, and testing. This analysis allows to identify specific cases in the sample. With this knowledge, the researcher can go back to the cases and use contextual information, not including in the analysis, to further explain and discuss the findings.</p>



<p>A typical variance-based analysis would identify a single best solution, thus limiting the results. FsQCA studies can compare the findings between different data analysis techniques to describe how different stories are hidden in the same dataset. It is recommended to combine fsQCA with other data analysis techniques if possible.</p>



<h2 class="wp-block-heading">How to use fsQCA in a typical e-commerce study</h2>



<h3 class="wp-block-heading">Sampling</h3>



<p>The study examined cognitive and affective perceptions as antecedents of online shopping behavior in personalized e-commerce environments. We used a typical a snowball sampling methodology to recruit participants and controlled for respondents’ previous experience with both online shopping and personalized services. Eventually, the sample comprises 582 individuals with experience in online shopping and personalized services. We collected data through a questionnaire built with measures that have been adopted from the literature. Appendix A (as presented in the original study) lists construct definitions, the questionnaire items used to measure each construct, along with descriptive statistics and loadings.</p>



<h3 class="wp-block-heading">Evaluate constructs for reliability and validity</h3>



<p>Typical with similar quantitative studies, first we<strong> evaluate constructs for reliability and validity</strong>. This is a step that should be always performed when it is appropriate, and it is not directly related with the fsQCA analysis as it depends on the type of variables that are used in the study. Construct reliability and validify, as the name implies, refer to the construct itself and not to the method of analysis used to examine relations between constructs. </p>



<p>Reliability testing, based on the Cronbach alpha indicator, showed acceptable indices of internal consistency since all constructs exceed the cut-off threshold of 0.70. The AVE for all constructs ranged between 0.55 and 0.84, all correlations were lower than 0.80, and square root AVEs for all constructs were larger than their correlations. The findings in detail for the confirmatory analysis may be found in the original paper.</p>



<h3 class="wp-block-heading" id="sect0070">Contrarian case analysis</h3>



<p>Contrarian case analysis is performed outside fsQCA, but we present it here because it can serve as an easy and quick way to examine how many cases in our sample are not explained by main effects, and thus they would not be included in the outcome of a typical variance-based approach, e.g., correlation or regression analysis.</p>



<h3 class="wp-block-heading">Data Calibration</h3>



<p>In fsQCA, different from traditional methods, instead of working with probabilities <strong>data are transformed from ordinal or interval scales into degrees of membership in the target set</strong>, which shows if and how much a case belongs into a specific set. “In essence, a fuzzy membership score attaches a truth value, not a probability, to a statement”. </p>



<p>For example, the variable intention to purchase can be coded as “high intention to purchase”, and we will be looking for the presence or absence of the condition high intention to purchase (“intention to purchase” is the variable; “high intention to purchase” is a condition). Similarly, we code the rest of the variables. </p>



<p>The method computes the presence of a condition or its opposite (i.e., negation). The negation of a condition is referred in the literature as the absence of a condition, and the two terms have been used interchangeably based on how the absence is computed. The term absence has been also used to describe when the condition is irrelevant in a configuration. It is similar to the “do not care” term that is also often used in the literature. </p>



<p>This distinction is not often addressed or clarified, thus we suggest researchers to clearly define these terms in future works to avoid misunderstandings.</p>



<h3 class="wp-block-heading" id="sect0085">Transform data into fuzzy-sets</h3>



<p>In fsQCA we need to calibrate our variables to form fuzzy sets with their values ranging from 0 to 1. Consider a fuzzy set as a group, then the values from 0 to 1 define if and at what amount a case belongs to this group. The fact that all values range from 0 to 1 means that a case with a fuzzy membership score of 1 is a <em>full member</em> of a fuzzy set (fully in the set), and a case with a membership score of 0 is a <em>full non-member</em> of the set (fully out of the set). A membership score of 0.5 is exactly in the middle, thus a case would be both a member of the fuzzy set and a non-member, and is therefore a member of what is known as the <em>intermediate</em> set. The intermediate-set point is the value where there is maximum ambiguity as to whether a case is more in or more out of the target set.</p>



<p>Data calibration may be either <strong>direct or indirect</strong>. In the <strong>direct calibration</strong> the researcher needs to choose exactly three qualitative breakpoints, which define the level of membership in the fuzzy set for each case (fully in, intermediate, fully out). In the <strong>indirect method</strong>, the measurements need to be rescaled based on qualitative assessments. The researcher may choose to calibrate a measure differently, depending on what one is investigating. Either method may be chosen, depending on researcher’s substantive knowledge of both data and underlying theory. The direct method is recommended and is more common, in which the researcher sets three values corresponding to full-set membership, full-set non-membership, and intermediate-set membership. This can lead to more rigorous studies which are easier to be replicated and validated, since it is clearer on how the thresholds have been chosen.</p>



<p>The percentiles allow the calibration of any measure regardless of its original values. In detail, we can compute the 95 %, 50 %, and 5 % of our measures and use these values as the three thresholds in fsQCA software.</p>



<p>Especially in the case of the widely used seven-point Likert scales (1=Not at all, 7=Very much), previous studies suggest that the values of 6, 4, and 2 can be used as thresholds. Similarly, for a five-point Likert scale the thresholds could be 4,3, and 2. </p>



<h3 class="wp-block-heading" id="sect0105">Interpreting and presenting the solutions</h3>



<p>FsQCA software provides all three solutions every time. Complex and parsimonious solutions are computed regardless of any simplifying assumptions employed by the researcher (e.g., choosing the presence or absence/negation of a variables) while the intermediate solution depends on these assumptions. While the intermediate solution includes both core and peripheral conditions, we need an easy way to make the distinction that will help us interpret and present the solutions in a better manner.</p>



<p>To improve the presentation of the findings we can transform the solutions from fsQCA output into a table that is easier to read. Typically, </p>



<ol class="wp-block-list">
<li>the presence of a condition is indicated with a black circle (●), </li>



<li>the absence/negation with a crossed-out circle (⊗), </li>



<li>and the “do not care” condition with a blank space. </li>
</ol>



<p>The negation of a condition is referred in the literature also as absence, and the two terms have been used interchangeably. The distinction between core and peripheral is made by using large and small circles, respectively. The researcher needs to present the overall solution consistency and the overall solution coverage. The overall coverage describes the extent to which the outcome of interest may be explained by the configurations, and is comparable with the R-square reported on regression-based methods. In our example, the results indicate an overall solution coverage of 0.84, which suggests that a substantial proportion of the outcome is covered by the nine solutions.</p>



<p>All graphics and futher explanation you find here: Pappas, I. O., &amp; Woodside, A. G. (2021). Fuzzy-set Qualitative Comparative Analysis (fsQCA): Guidelines for research practice in Information Systems and marketing. <em>International journal of information management</em>, <em>58</em>, 102310.</p>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/qualitative-comparative-analysis">Fuzzy-Set Qualitative Comparative Analysis</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>What are independent variables?</title>
		<link>https://mietwood.com/independent-variables</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 18:44:48 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3316</guid>

					<description><![CDATA[<p>Independent variables, also called predictors, features, or explanatory variables, are the variables in a statistical or machine learning model that are used to explain or predict changes in another variable — the dependent variable, also called the outcome or target. Independent variables in simple terms: Example of independent variables in Customer Management (RFM Model): Suppose...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Independent variables, </strong>also called <strong>predictors</strong>, <strong>features</strong>, or <strong>explanatory variables</strong>, are the variables in a statistical or machine learning model that are used to <strong>explain or predict</strong> changes in another variable — the <strong>dependent variable</strong>, also called the outcome or target.</p>



<h3 class="wp-block-heading" id="insimpleterms"><strong>Independent variables</strong> in simple terms:</h3>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>: Inputs you control or observe.</li>



<li><strong>Dependent variable</strong>: Output you want to understand or predict.</li>
</ul>



<h3 class="wp-block-heading" id="exampleincustomermanagementrfmmodel">Example of i<strong>ndependent variables</strong> in Customer Management (RFM Model):</h3>



<p>Suppose you&#8217;re analyzing customer behavior to predict <strong>churn</strong> (whether a customer will stop buying).</p>



<ul class="wp-block-list">
<li><strong>Independent variables</strong>:
<ul class="wp-block-list">
<li><strong>Recency</strong>: How recently a customer made a purchase.</li>



<li><strong>Frequency</strong>: How often they purchase.</li>



<li><strong>Monetary</strong>: How much they spend.</li>
</ul>
</li>



<li><strong>Dependent variable</strong>:
<ul class="wp-block-list">
<li><strong>Churn</strong>: 1 if the customer churned, 0 if they stayed.</li>
</ul>
</li>
</ul>



<p>In this case, <strong>Recency</strong>, <strong>Frequency</strong>, and <strong>Monetary</strong> are independent variables used to predict the likelihood of <strong>churn</strong>. See also <a href="https://mietwood.com/python-for-business-analytics-2">Python for business analytics &#8211; rfm analysis</a></p>



<h3 class="wp-block-heading" id="whycheckforindependenceamongindependentvariables">Why check for independence among independent variables?</h3>



<p>If independent variables are <strong>highly correlated with other variables, </strong>i.e., it is not truly independent, it can cause <strong>multicollinearity</strong>, which makes model coefficients unstable, reduces interpretability, and can lead to misleading conclusions.</p>



<h2 class="wp-block-heading"><strong>Multicollinearity</strong></h2>



<p><strong>Multicollinearity</strong> refers to a statistical phenomenon in which two or more independent variables in a regression model are highly correlated. This makes it difficult to determine the individual effect of each variable on the dependent variable because they essentially carry overlapping information.</p>



<h3 class="wp-block-heading"><strong>Assessment of Multicollinearity</strong></h3>



<p>To assess multicollinearity, you can use following methods:</p>



<ol class="wp-block-list">
<li><strong>Correlation Matrix</strong>
<ul class="wp-block-list">
<li>Check pairwise correlations between independent variables.</li>



<li>High correlation (e.g., > 0.8 or &lt; -0.8) may indicate multicollinearity.</li>
</ul>
</li>



<li><strong>Variance Inflation Factor (VIF)</strong>
<ul class="wp-block-list">
<li>Measures how much the variance of a regression coefficient is inflated due to multicollinearity.</li>



<li><strong>VIF > 5 or 10</strong> is often considered problematic.</li>
</ul>
</li>



<li><strong>Tolerance</strong>
<ul class="wp-block-list">
<li>Tolerance = 1 / VIF.</li>



<li>Low tolerance values (close to 0) indicate high multicollinearity.</li>
</ul>
</li>



<li><strong>Condition Index and Eigenvalues</strong>
<ul class="wp-block-list">
<li>Part of a more advanced diagnostic using matrix decomposition.</li>



<li>A <strong>condition index > 30</strong> may suggest serious multicollinearity.</li>
</ul>
</li>
</ol>



<h3 class="wp-block-heading"><strong>How to deal with multicollinearity?</strong></h3>



<ul class="wp-block-list">
<li><strong>Remove one of the correlated variables.</strong></li>



<li><strong>Combine variables</strong> (e.g., using PCA or creating an index).</li>



<li><strong>Regularization techniques</strong> like Ridge or Lasso regression.</li>



<li><strong>Centering variables</strong> (subtracting the mean) can help in some cases.</li>
</ul>



<h2 class="wp-block-heading">Calculation example</h2>



<p>Assume, you have data similar to this sample.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="488" height="337" src="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg" alt="" class="wp-image-3317" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-4.jpg 488w, https://mietwood.com/wp-content/uploads/2025/09/image-4-300x207.jpg 300w" sizes="auto, (max-width: 488px) 100vw, 488px" /><figcaption class="wp-element-caption">RFM data sample &#8211; for testing independent variables </figcaption></figure>
</div>


<h2 class="wp-block-heading">Variable independence testing</h2>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd

# Load the data with the specified delimiter
df = pd.read_csv("RFM_analysis_614.csv", delimiter=",")

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Select the independent variables (RFM)
X = df[&#91;'recency', 'freq', 'monetary'&#93;]

# Add a constant for the VIF calculation (required by the statsmodels function)
X = add_constant(X)

# Create a DataFrame to hold the VIF results
vif_data = pd.DataFrame()
vif_data&#91;"Variable"&#93; = X.columns
vif_data&#91;"VIF"&#93; = [variance_inflation_factor(X.values, i) for i in range(X.shape&#91;1&#93;)]

# Exclude the constant row from the final output since it's not a true variable
vif_data = vif_data&#91;vif_data.Variable != 'const'&#93;.reset_index(drop=True)

print(vif_data)

# Save the VIF results to a CSV file
vif_data.to_csv("vif_results.csv", index=False)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Load</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">specified</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span></span>
<span class="line"><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">read_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">RFM_analysis_614.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">delimiter</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">,</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">stats</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">outliers_influence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variance_inflation_factor</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">tools</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">add_constant</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">independent</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">variables</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">RFM</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">df</span><span style="color: #D8DEE9FF">[&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">recency</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">freq</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">monetary</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Add</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">calculation</span><span style="color: #D8DEE9FF"> (</span><span style="color: #8FBCBB">required</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">statsmodels</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">add_constant</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hold</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">results</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">pd</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">DataFrame</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Variable</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = </span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">columns</span></span>
<span class="line"><span style="color: #8FBCBB">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">VIF</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93; = [</span><span style="color: #8FBCBB">variance_inflation_factor</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">values</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF">) </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">range</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">X</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">shape</span><span style="color: #D8DEE9FF">&#91;1&#93;)]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Exclude</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">constant</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">row</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">final</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">output</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">since</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">it</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">s not a true variabl</span><span style="color: #D8DEE9">e</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">Variable</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">!=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">const</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">reset_index</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">drop</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">vif_data</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Save</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">VIF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">results</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">CSV</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">file</span></span>
<span class="line"><span style="color: #D8DEE9">vif_data</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">to_csv</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">vif_results.csv</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">False</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Finally program prints following results</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="179" height="85" src="https://mietwood.com/wp-content/uploads/2025/09/image-5.jpg" alt="" class="wp-image-3318"/></figure>
</div>


<p>Based on the Variance Inflation Factor (VIF) calculation, the columns <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> are <strong>statistically independent</strong> of each other and <strong>not informationally overlapping</strong>.</p>



<p>This means that you can use all three variables together as independent predictors in a statistical model, such as a Cox Proportional Hazards (Cox PH) model, without concern for severe multicollinearity.</p>



<h2 class="wp-block-heading">Variance Inflation Factor (VIF) Results</h2>



<p>The VIF (<a href="https://en.wikipedia.org/wiki/Variance_inflation_factor" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Variance_inflation_factor</a>)  is a measure of how much the variance of an estimated regression coefficient is increased due to collinearity. A common rule of thumb is that a VIF value <strong>less than 5</strong> or sometimes 10 indicates that the correlation between the variables is not high enough to warrant concern.</p>



<p>The calculated VIF values for your RFM variables are very low:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><td>Variable</td><td>VIF</td></tr></thead><tbody><tr><td><strong>recency</strong></td><td>1.437</td></tr><tr><td><strong>freq</strong></td><td>1.488</td></tr><tr><td><strong>monetary</strong></td><td>1.065</td></tr></tbody></table></figure>



<p></p>



<h2 class="wp-block-heading">Conclusion on Independence</h2>



<p>Since all VIF values are close to 1.0 and well below the 5.0 threshold:</p>



<ul class="wp-block-list">
<li><strong>Independent Variables:</strong> You can confidently treat <strong>recency</strong>, <strong>frequency</strong>, and <strong>monetary</strong> as independent variables for your statistical analysis (e.g., in a Cox PH model).</li>



<li><strong>No Informational Overlap:</strong> The variables are providing distinct, non-redundant information to the model. For instance, knowing a customer&#8217;s frequency does not allow the model to strongly predict their recency or monetary value.</li>
</ul>



<p></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/independent-variables">What are independent variables?</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Top IDEs for Python developers</title>
		<link>https://mietwood.com/ides-for-python-developers</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Thu, 25 Sep 2025 09:11:31 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3312</guid>

					<description><![CDATA[<p>Here a top graphical IDEs (Integrated Development Environments) for Python developers as of 2024: 1.&#160;PyCharm 2.&#160;Visual Studio Code (VS Code) 3.&#160;Spyder 4.&#160;Thonny 5.&#160;Wing IDE 6.&#160;Eric 7.&#160;IDLE Summary:For professional development,&#160;PyCharm&#160;and&#160;VS Code&#160;are the most popular. For data science,&#160;Spyder&#160;is widely used. For beginners,&#160;Thonny&#160;or&#160;IDLE&#160;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p><strong>Here a top graphical IDEs (Integrated Development Environments) for Python developers</strong> as of 2024:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">1.&nbsp;<strong>PyCharm</strong></h3>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="421" src="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg" alt="" class="wp-image-3313" srcset="https://mietwood.com/wp-content/uploads/2025/09/image-3.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/09/image-3-300x123.jpg 300w, https://mietwood.com/wp-content/uploads/2025/09/image-3-768x316.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<ul class="wp-block-list">
<li><strong>Developer:</strong> JetBrains</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Advanced code analysis, smart code completion, integrated debugger, Git support, virtual environment management, Django support.</li>



<li><strong>Community (free) and Professional (paid) editions.</strong></li>



<li><strong>Website:</strong> <a href="https://www.jetbrains.com/pycharm/" target="_blank" rel="noreferrer noopener">PyCharm</a></li>
</ul>



<h3 class="wp-block-heading">2.&nbsp;<strong>Visual Studio Code (VS Code)</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Microsoft</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Lightweight but powerful; excellent Python extension; integrated terminal; rich plugin ecosystem; Git support; Jupyter notebook integration.</li>



<li><strong>Website:</strong> <a href="https://code.visualstudio.com/" target="_blank" rel="noreferrer noopener">VS Code</a></li>
</ul>



<h3 class="wp-block-heading">3.&nbsp;<strong>Spyder</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Scientific Python Development Environment Community</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Focused on scientific computing and data science; variable explorer; integrated IPython console; plotting support.</li>



<li><strong>Website:</strong> <a href="https://www.spyder-ide.org/" target="_blank" rel="noreferrer noopener">Spyder</a></li>
</ul>



<h3 class="wp-block-heading">4.&nbsp;<strong>Thonny</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> University of Tartu</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Beginner-friendly; simple UI; built-in debugger; good for learning and education.</li>



<li><strong>Website:</strong> <a href="https://thonny.org/" target="_blank" rel="noreferrer noopener">Thonny</a></li>
</ul>



<h3 class="wp-block-heading">5.&nbsp;<strong>Wing IDE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Wingware</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Powerful debugger; code intelligence; remote development support.</li>



<li><strong>Website:</strong> <a href="https://wingware.com/" target="_blank" rel="noreferrer noopener">Wing IDE</a></li>
</ul>



<h3 class="wp-block-heading">6.&nbsp;<strong>Eric</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Detlev Offenbach</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Full-featured Python and Ruby IDE; integrated debugger; plugin support.</li>



<li><strong>Website:</strong> <a href="https://eric-ide.python-projects.org/" target="_blank" rel="noreferrer noopener">Eric Python IDE</a></li>
</ul>



<h3 class="wp-block-heading">7.&nbsp;<strong>IDLE</strong></h3>



<ul class="wp-block-list">
<li><strong>Developer:</strong> Python Software Foundation (bundled with Python)</li>



<li><strong>Platforms:</strong> Windows, macOS, Linux</li>



<li><strong>Features:</strong> Basic, lightweight, good for quick scripts and learning.</li>



<li><strong>Website:</strong> <a href="https://docs.python.org/3/library/idle.html" target="_blank" rel="noreferrer noopener">IDLE Documentation</a></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Summary:</strong><br>For professional development,&nbsp;<strong>PyCharm</strong>&nbsp;and&nbsp;<strong>VS Code</strong>&nbsp;are the most popular. For data science,&nbsp;<strong>Spyder</strong>&nbsp;is widely used. For beginners,&nbsp;<strong>Thonny</strong>&nbsp;or&nbsp;<strong>IDLE</strong>&nbsp;are great choices.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/ides-for-python-developers">Top IDEs for Python developers</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Scrape a Website and Search Inside PDFs with Python</title>
		<link>https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sat, 30 Aug 2025 09:13:30 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3292</guid>

					<description><![CDATA[<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Ever found yourself on a webpage with dozens of PDF links, needing to find a specific piece of information buried in one of them? 😩 We will teach you how to scrape a website and search inside PDFs with Python. Manually downloading and searching each file is tedious, time-consuming, and prone to errors. What if you could automate the entire process with just a few lines of code?</p>



<p>In this tutorial, we&#8217;ll show you exactly how to do that. We’ll build a powerful yet simple Python script that automatically scans a webpage, finds all the PDF links, and searches for specific text inside each one. Using popular libraries like <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you&#8217;ll learn a practical skill that can save you hours of manual work. Let&#8217;s get started!</p>



<p>Python, Web Scraping, PDF, Automation, BeautifulSoup, PyPDF, requests, Data Extraction, Python Projects, Text Search</p>



<h2 class="wp-block-heading">Scrape a Website</h2>



<p>in the script we <strong>Find</strong> all links on the initial page. <strong>Filter</strong> for links that end with <code>.pdf</code>. For each PDF link: <strong>Download</strong> the PDF file into memory. <strong>Extract</strong> text from every page of the PDF. <strong>Search</strong> the extracted text for your <code>search_string</code>. And finally <strong>Report</strong> which PDF files contain the phrase. Scrape a Website</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import requests
from bs4 import BeautifulSoup
from pypdf import PdfReader
import io

Scrape a Website
def find_linked_pdfs(url):
    """
    Scans a webpage for PDF links and searches for a string within each PDF.

    Args:
        url: The URL of the webpage to scan.
        search_string: The string to search for inside the PDFs.
    """
    print(f"Scanning {url} for PDF links...")
    try:
        # 1. Get the main page to find all links
        base_url_parts = requests.utils.urlparse(url)
        base_url = f"{base_url_parts.scheme}://{base_url_parts.netloc}"
        
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [a&#91;'href'&#93; \
          for a in soup.find_all('a', href=True) \
          if a&#91;'href'&#93;.endswith('.pdf')]
        
        if not pdf_links:
            print("No PDF links found on the page.")
            return

        print(f"Found {len(pdf_links)} PDF files. Now searching inside them...")

    return pdf_links</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">requests</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bs4</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">BeautifulSoup</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pypdf</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PdfReader</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">io</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">Scrape</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Website</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    Scans a webpage for PDF links and searches for a string within each PDF</span><span style="color: #D8DEE9">.</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">Args</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">webpage</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">scan</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">search_string</span><span style="color: #D8DEE9FF">: </span><span style="color: #8FBCBB">The</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">string</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDFs</span><span style="color: #D8DEE9FF">.</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    print(f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #8FBCBB">Scanning</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span><span style="color: #8FBCBB">url</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span><span style="color: #D8DEE9FF">...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        # 1. </span><span style="color: #8FBCBB">Get</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">main</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">find</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">all</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">links</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url_parts</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">utils</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">urlparse</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">base_url</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url_parts.scheme}://{base_url_parts.netloc}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">requests</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">BeautifulSoup</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">text</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">html.parser</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF"> = [</span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">soup</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">find_all</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">a</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">href</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">) \</span></span>
<span class="line"><span style="color: #D8DEE9FF">          </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">href</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93;.</span><span style="color: #8FBCBB">endswith</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">.pdf</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)]</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">No PDF links found on the page.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #8FBCBB">return</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Found {len(pdf_links)} PDF files. Now searching inside them...</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pdf_links</span></span></code></pre></div>



<p><strong>Handling URLs</strong>: It constructs a full, absolute URL for each PDF, as many links on a page can be relative (e.g., <code>/path/to/file.pdf</code>). Scrape a Website. <a href="https://pypi.org/project/beautifulsoup4/" target="_blank" rel="noopener">https://pypi.org/project/beautifulsoup4/</a></p>



<p><strong>In-Memory Processing</strong>: Instead of saving each PDF to your disk, it uses <code>io.BytesIO</code> to treat the downloaded content as a file in your computer&#8217;s memory. This is faster and cleaner.</p>



<p><strong>Text Extraction</strong>: The <code>pypdf</code> library&#8217;s <code>PdfReader</code> opens this in-memory file. The script then loops through each page, calls <code>extract_text()</code>, and combines the text from all pages.</p>



<p><strong>Searching and Reporting</strong>: Finally, it performs a case-insensitive search on the extracted text and prints the URL of any PDF that contains your search term.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>Search Inside PDF
def find_text_in_pdfs(pdf_links, search_string):

        # 2. Loop through each PDF link

        found_in_files = []
        for pdf_path in pdf_links:
            # Construct absolute URL if the link is relative
            if not pdf_path.startswith(('http://', 'https://')):
                pdf_url = f"{base_url}{pdf_path}"
            else:
                pdf_url = pdf_path

            try:
                # 3. Download the PDF content
                pdf_response = requests.get(pdf_url)
                pdf_response.raise_for_status()

                # Use an in-memory buffer to read the PDF
                pdf_file = io.BytesIO(pdf_response.content)
                reader = PdfReader(pdf_file)
                
                # 4. Extract text and search
                full_text = ""
                for page in reader.pages:
                    full_text += page.extract_text() or ""
                
                if search_string.lower() in full_text.lower():
                    print(f"✔️ Found '{search_string}' in: {pdf_url}")
                    found_in_files.append(pdf_url)

            except Exception as e:
                print(f"⚠️ Could not process {pdf_url}. Reason: {e}")
        
        if not found_in_files:
            print(f"\nSearch complete. The string '{search_string}' was not found in any of the PDFs.")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred fetching the main URL: {e}")
</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">Search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Inside</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        # </span><span style="color: #B48EAD">2.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Loop</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">through</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> []</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> pdf_links</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            # </span><span style="color: #D8DEE9">Construct</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">absolute</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">URL</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">link</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">is</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">relative</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">startswith</span><span style="color: #D8DEE9FF">((</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">https://</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">{base_url}{pdf_path}</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">else</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pdf_path</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #81A1C1">try</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">3.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Download</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">content</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">raise_for_status</span><span style="color: #D8DEE9FF">()</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #D8DEE9">Use</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">an</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in-</span><span style="color: #D8DEE9">memory</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">buffer</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">read</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">PDF</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">io</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">BytesIO</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_response</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">content</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">reader</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">PdfReader</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_file</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                # </span><span style="color: #B48EAD">4.</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reader</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">pages</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">full_text</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">+=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">page</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">extract_text</span><span style="color: #D8DEE9FF">() </span><span style="color: #D8DEE9">or</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">() </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">full_text</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">lower</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">✔️ Found &#39;{search_string}&#39; in: {pdf_url}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #D8DEE9">found_in_files</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Exception</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">⚠️ Could not process {pdf_url}. Reason: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">not</span><span style="color: #D8DEE9FF"> found_in_files</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Search complete. The string &#39;{search_string}&#39; was not found in any of the PDFs.</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">except</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">requests</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">exceptions</span><span style="color: #ECEFF4">.</span><span style="color: #D8DEE9">RequestException</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> e:</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">An error occurred fetching the main URL: {e}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span></code></pre></div>



<p>x</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>if __name__ == "__main__":
    target_url = "https://www.umcs.pl/pl/plany-zajec,10795.htm"
    search_term = "programming"
    pdf_links = find_linked_pdfs(target_url)
    find_text_in_pdfs(pdf_links, search_string)
    </textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">__name__</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">==</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">__main__</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">https://www.umcs.pl/pl/plany-zajec,10795.htm</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">search_term</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">programming</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">pdf_links</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">find_linked_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">target_url</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">find_text_in_pdfs</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">pdf_links</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">search_string</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span></span></code></pre></div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="769" height="255" src="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg" alt="Scrape a Website and Search Inside PDFs with Python" class="wp-image-3293" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-8.jpg 769w, https://mietwood.com/wp-content/uploads/2025/08/image-8-300x99.jpg 300w" sizes="auto, (max-width: 769px) 100vw, 769px" /><figcaption class="wp-element-caption">Scrape a Website and Search Inside PDFs with Python</figcaption></figure>
</div>


<h3 class="wp-block-heading"><strong>Wrapping Up and Next Steps</strong></h3>



<p>Congratulations! You&#8217;ve successfully built a powerful automation script that bridges the gap between web scraping and document analysis. By combining the strengths of <strong>Requests</strong>, <strong>BeautifulSoup</strong>, and <strong>PyPDF</strong>, you can now programmatically find information that was previously locked away inside PDF files on any website. This not only saves an incredible amount of time but also opens up new possibilities for data collection and analysis. Feel free to adapt the code for your own projects and take your web scraping skills to the next level. Scrape a Website.</p>



<p>The applications for this technique extend far beyond a single use case. Imagine using this script for <strong>academic research</strong>, automatically scanning university archives for papers mentioning a specific topic. You could adapt it for <strong>financial analysis</strong> by pulling keywords from dozens of quarterly earnings reports, or for <strong>legal work</strong> by searching through court filings for a particular case name. Job seekers could even use it to scan company websites for PDF job descriptions that contain key skills. </p>



<p>To perform a statistical analysis of the overall economy, you can leverage a variety of online resources, including government and intergovernmental data portals, as well as academic publications. These sources often provide data in structured formats like CSVs and APIs, but also in less-structured formats like HTML tables and PDFs, which can be parsed using Python libraries like Beautiful Soup and pypdf.</p>



<h3 class="wp-block-heading"><strong>Government and Intergovernmental Data Sources</strong></h3>



<p>For raw, official economic data, these are your most reliable sources. They offer a wealth of information on everything from GDP and inflation to employment rates and international trade. Scrape a Website. Search Inside PDF</p>



<ul class="wp-block-list">
<li><strong>Federal Reserve Economic Data (FRED)</strong>: A fantastic resource from the St. Louis Fed, FRED offers over 800,000 economic time series from more than 100 sources. It&#8217;s a goldmine for anyone doing macroeconomic analysis.</li>



<li><strong>The World Bank Open Data</strong>: This portal provides comprehensive global development data, including indicators on economic policy, poverty, gender, and more, making it perfect for cross-country comparisons.</li>



<li><strong>Data.gov</strong>: The home of U.S. government open data, this site aggregates datasets from various federal agencies, including the Bureau of Economic Analysis (BEA) and the Bureau of Labor Statistics (BLS).</li>



<li><strong>United Nations Statistics Division (UNSD)</strong>: The UNSD offers a wide array of international statistics, including the UNdata portal which provides free access to over 60 million statistical records from various UN agencies.</li>



<li><strong>The Bureau of Economic Analysis (BEA)</strong>: The BEA produces some of the most critical U.S. economic statistics, such as GDP, personal income, and corporate profits.</li>
</ul>



<p>You can read about Business analyst carrier path <a href="https://mietwood.com/the-allure-of-business-analysis-as-a-career-path">here</a></p>



<p>The core principle remains the same: automate the discovery of information, no matter the format. Search Inside PDF. Happy coding! 🚀</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/scrape-a-website-and-search-inside-pdfs-with-python">How to Scrape a Website and Search Inside PDFs with Python</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Google Ads Conversion and GA4 Revenue Difference</title>
		<link>https://mietwood.com/google-ads-conversion-and-ga4-revenue-difference</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Wed, 13 Aug 2025 12:54:53 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3257</guid>

					<description><![CDATA[<p>Google Ads Conversion and GA4 Revenue Difference &#8211; It&#8217;s common and frustrating issue for digital marketers. Difference between Google Ads Total Conversion Value and Google Analytics GA4) CPC Revenue can really demolish your day. While a small discrepancy (10-20%) is often considered normal, a large difference signals a need for investigation. Why Google Ads Conversion...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/google-ads-conversion-and-ga4-revenue-difference">Google Ads Conversion and GA4 Revenue Difference</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Google Ads Conversion and GA4 Revenue Difference &#8211; It&#8217;s common and frustrating issue for digital marketers. Difference between Google Ads Total Conversion Value and Google Analytics GA4) CPC Revenue can really demolish your day. While a small discrepancy (10-20%) is often considered normal, a large difference signals a need for investigation. Why Google Ads Conversion Value and GA4 CPC Revenue Difference exists.</p>



<h3 class="wp-block-heading">Attribution Models and Credit</h3>



<p><strong>Google Ads Attribution:</strong> By default, Google Ads uses a data-driven attribution model, which gives credit to various touchpoints along the conversion path. By default ist is 30 days. It will take credit for a conversion as long as a user interacted with one of your ads within 30 days. You can defined shorter lookback window.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1024" height="385" src="https://mietwood.com/wp-content/uploads/2025/08/image-1.jpg" alt="" class="wp-image-3258" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-1.jpg 1024w, https://mietwood.com/wp-content/uploads/2025/08/image-1-300x113.jpg 300w, https://mietwood.com/wp-content/uploads/2025/08/image-1-768x289.jpg 768w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Why Google Ads Conversion Value and GA4 CPC Revenue Difference &#8211; credit days</figcaption></figure>



<p><strong>GA4 Attribution:</strong> GA4&#8217;s default is also data-driven, but it&#8217;s &#8220;cross-channel.&#8221; This means it considers all marketing channels (organic search, direct, social, email, etc.) in the customer journey, not just Google Ads. If a user clicks a Google ad, then later comes back to your site via organic search and converts, GA4&#8217;s data-driven model will distribute credit to both channels, while Google Ads will likely claim all or most of the conversion value.</p>



<h3 class="wp-block-heading">Time-Based Reporting Google Ads Conversion and GA4 Revenue Difference</h3>



<p>The way each platform logs a conversion can create a significant reporting gap, especially when analyzing recent data.</p>



<ul class="wp-block-list">
<li><strong>Google Ads:</strong> Attributes a conversion to the <strong>date of the ad click</strong> or impression.</li>



<li><strong>GA4:</strong> Attributes a conversion to the <strong>date of the actual transaction</strong> or conversion event.</li>
</ul>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="810" height="452" src="https://mietwood.com/wp-content/uploads/2025/08/image-3.jpg" alt="" class="wp-image-3260" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-3.jpg 810w, https://mietwood.com/wp-content/uploads/2025/08/image-3-300x167.jpg 300w, https://mietwood.com/wp-content/uploads/2025/08/image-3-768x429.jpg 768w" sizes="auto, (max-width: 810px) 100vw, 810px" /><figcaption class="wp-element-caption"><a href="https://www.youtube.com/watch?v=kJSxckE3E6k" target="_blank" rel="noopener">https://www.youtube.com/watch?v=kJSxckE3E6k</a></figcaption></figure>



<p>For example, if a user clicks a Google ad on September 20th but makes a purchase on October 5th, Google Ads will report the conversion value in September, while GA4 will report it in October. This &#8220;conversion lag&#8221; can cause large differences when comparing monthly or weekly reports. Google Ads Conversion and GA4 Revenue Difference.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="816" height="418" src="https://mietwood.com/wp-content/uploads/2025/08/image-4.jpg" alt="Google Ads Conversion and GA4 Revenue Difference" class="wp-image-3261" srcset="https://mietwood.com/wp-content/uploads/2025/08/image-4.jpg 816w, https://mietwood.com/wp-content/uploads/2025/08/image-4-300x154.jpg 300w, https://mietwood.com/wp-content/uploads/2025/08/image-4-768x393.jpg 768w" sizes="auto, (max-width: 816px) 100vw, 816px" /><figcaption class="wp-element-caption">Google Ads Conversion and GA4 Revenue Difference</figcaption></figure>



<h3 class="wp-block-heading">Conversion Counting and Definitions</h3>



<p>How you set up your conversion counting and conversion values can lead to different numbers. Especially important is conversion values for e-commerce purchase.</p>



<p><strong>Conversion Counting:</strong> In Google Ads, you can choose to count a conversion &#8220;once&#8221; or &#8220;every&#8221; time it happens. For example, if a user submits a form multiple times, you might want to count it only once. However, for a purchase, you&#8217;d want to count every transaction. If these settings are not aligned between Google Ads and GA4, your numbers will not match.</p>



<p><strong>Conversion Actions:</strong> If you have different conversion actions set up in Google Ads and GA4, or if they are not correctly linked, you will see a discrepancy. For example, if you track a &#8220;purchase&#8221; in GA4 but have a different conversion action for &#8220;leads&#8221; in Google Ads, the numbers will naturally be different.</p>



<h3 class="wp-block-heading">Technical and User-Based Factors</h3>



<p>These are often smaller but can add up to a substantial difference.</p>



<ul class="wp-block-list">
<li><strong>Ad Blockers and User Consent:</strong> Some ad blockers and privacy settings can prevent GA4&#8217;s tracking code from firing, meaning a session and conversion might not be recorded in GA4. However, the Google Ads conversion tag is often less affected, so Google Ads may still report the conversion. Similarly, if a user opts out of tracking via a consent banner, GA4 may not receive the data.</li>



<li><strong>Quick Exits:</strong> A user may click a Google ad, be charged for the click, and then hit the back button before the GA4 tracking tag has a chance to load. Google Ads will count the click, but GA4 won&#8217;t record a session, leading to a discrepancy between clicks and sessions, and ultimately, conversions.</li>



<li><strong>Cross-Device Conversions:</strong> Google Ads uses modeled conversions to account for users who start their journey on one device and finish it on another. This can lead to a higher conversion count in Google Ads than in GA4, especially if Google Signals is not enabled in GA4.</li>



<li><strong>View-Through Conversions:</strong> Google Ads counts &#8220;view-through conversions,&#8221; which are conversions that happen after a user sees a display ad but doesn&#8217;t click on it. GA4 does not track these by default, which can cause Google Ads to report more conversions and higher conversion value.</li>
</ul>



<h3 class="wp-block-heading">What to Do About It</h3>



<p><strong>Align Your Attribution Models:</strong> Consider using the same attribution model in both platforms for a more direct comparison. While Google Ads defaults to data-driven, you can use the &#8220;Model Comparison&#8221; tool in GA4 to see how your data would look under a different model, like &#8220;Last Click.&#8221;</p>



<p>The Model Comparison report, also referred to as the Attribution models report, in Google Analytics 4 (GA4) is a tool that allows you to compare how different attribution models distribute credit for conversions. It helps you understand how various marketing channels, like paid search, social media, and organic search, contribute to a user&#8217;s conversion path. This is a crucial report for understanding the true value of your marketing efforts beyond just the last touchpoint.</p>



<p>The report lets you select a conversion event and then view how its value would be allocated to different channels under two different attribution models. You can then see a percentage change, which highlights which channels are being over or undervalued depending on the model you use.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Key Attribution Models to Compare</h3>



<p>GA4 offers a few attribution models to choose from, with <strong>Data-driven</strong> being the default. Comparing this with other models is where the tool&#8217;s value really shines.</p>



<ul class="wp-block-list">
<li><strong>Data-driven:</strong> This model uses machine learning to analyze all the touchpoints in a user&#8217;s journey, including both converting and non-converting paths. It then assigns credit based on the actual impact of each touchpoint. It&#8217;s considered the most accurate and sophisticated model.</li>



<li><strong>Paid and organic last click:</strong> This is a rule-based model that gives 100% of the conversion credit to the last channel a user clicked on before converting, ignoring any direct traffic.</li>



<li><strong>Google paid channels last click:</strong> This model gives 100% of the credit to the last Google Ads click before a conversion. If there was no Google Ads click, it defaults to the &#8220;Paid and organic last click&#8221; model.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading">Why is this important?</h3>



<p>Understanding how different attribution models impact your data is critical for making smart business decisions. For instance, a <strong>&#8220;last click&#8221;</strong> model might make your paid search campaigns look like they&#8217;re driving all your conversions, leading you to undervalue channels like social media or display ads that introduce users to your brand much earlier in their journey. By using the Model Comparison report, you can see how a channel&#8217;s value changes when you shift from a last-click to a data-driven model, which can lead to better budgeting and optimization.</p>



<p>The video below offers a tutorial on how to use the Model Comparison Tool to analyze your traffic channels and conversion data in Google Analytics. How to Use the Model Comparison Tool in Google Analytics to Compare Your Traffic Channels</p>



<p><strong>Check Your Conversion Settings:</strong> Ensure that your conversion actions are correctly set up and linked, and that the counting method is consistent across both platforms.</p>



<p><strong>Check Your Date Ranges:</strong> When comparing, make sure you&#8217;re using a long enough date range (e.g., a full month or longer) to account for any conversion lag.</p>



<p><strong>Enable Auto-Tagging:</strong> Make sure auto-tagging is enabled in your Google Ads account so that GA4 can properly attribute traffic back to your campaigns.</p>



<p><strong>Don&#8217;t Panic:</strong> It&#8217;s normal to have some level of discrepancy. The key is to understand <em>why</em> the differences exist and use both platforms for what they&#8217;re best at: Google Ads for optimizing your paid campaigns and GA4 for understanding the full, multi-channel customer journey.</p>



<p>Read more <a href="https://mietwood.com/e-commerce-manager-dashboard">here</a></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/google-ads-conversion-and-ga4-revenue-difference">Google Ads Conversion and GA4 Revenue Difference</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Python for analysts most important datetime functions</title>
		<link>https://mietwood.com/python-for-analysts-most-important-datetime-functions</link>
					<comments>https://mietwood.com/python-for-analysts-most-important-datetime-functions#comments</comments>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 20 Jul 2025 16:18:38 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3211</guid>

					<description><![CDATA[<p>Python’s powerful date and time functions using the datetime and pandas libraries gives you a robust date table ready for Power BI and other business intelligence and analytical tools. Python for analysts most important datetime functions. Mastering Date and Time Functions in Python for Power BI Date Tables When working with Power BI, a well-structured...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analysts-most-important-datetime-functions">Python for analysts most important datetime functions</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Python’s powerful <strong>date and time functions</strong> using the <code>datetime</code> and <code>pandas</code> libraries gives you a robust date table ready for Power BI and other business intelligence and analytical tools. Python for analysts most important datetime functions.</p>



<h2 class="wp-block-heading" id="masteringdateandtimefunctionsinpythonforpowerbidatetables">Mastering Date and Time Functions in Python for Power BI Date Tables</h2>



<p>When working with Power BI, a well-structured <strong>Date Table</strong> is essential for time intelligence calculations like YTD, QTD, MTD, and custom period comparisons. While Power BI has built-in date table features, using <strong>Python</strong> to generate a custom date table gives you full control over the structure, granularity, and logic.</p>



<p>In this post, we’ll explore Python’s powerful <strong>date and time functions</strong> using the <code>datetime</code> and <code>pandas</code> libraries, and show how to create a robust date table ready for Power BI.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="1pythondateandtimebasics">Python for analyst – date and time functions &#8211; basics</h2>



<p>Python provides the <a href="https://docs.python.org/3/library/datetime.html" target="_blank" rel="noopener"><code>datetime</code> module</a> to work with dates and times. Here&#8217;s a quick overview:</p>



<pre class="wp-block-code"><code>from datetime import datetime, timedelta, date

# Current date and time
now = datetime.now()
print("Now:", now)

# Just the date
today = date.today()
print("Today:", today)

# Add 7 days
next_week = today + timedelta(days=7)
print("Next week:", next_week)

# Subtract 30 days
last_month = today - timedelta(days=30)
print("30 days ago:", last_month)
</code></pre>



<p>These functions are the foundation for generating date ranges and calculating custom columns like fiscal periods or holidays. Python for analysts most important datetime functions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="2creatingadaterangewithpandas">Creating a Date Range with Pandas</h2>



<p>To build a date table, we need a continuous range of dates. <code>pandas.date_range()</code> is perfect for this:</p>



<pre class="wp-block-code"><code>import pandas as pd

# Generate a date range from 2020 to 2030
date_range = pd.date_range(start='2020-01-01', end='2030-12-31', freq='D')
df = pd.DataFrame({'Date': date_range})
</code></pre>



<p>This gives us a DataFrame with one row per day — the backbone of our date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="3enrichingthedatetable">Enriching the Date Table</h2>



<p>Now let’s add useful columns for Power BI:</p>



<pre class="wp-block-code"><code>df&#91;'Year'] = df&#91;'Date'].dt.year
df&#91;'Month'] = df&#91;'Date'].dt.month
df&#91;'MonthName'] = df&#91;'Date'].dt.strftime('%B')
df&#91;'Quarter'] = df&#91;'Date'].dt.quarter
df&#91;'Day'] = df&#91;'Date'].dt.day
df&#91;'Weekday'] = df&#91;'Date'].dt.weekday + 1  # Monday = 1
df&#91;'WeekdayName'] = df&#91;'Date'].dt.strftime('%A')
df&#91;'IsWeekend'] = df&#91;'Weekday'].isin(&#91;6, 7])
df&#91;'Week'] = df&#91;'Date'].dt.isocalendar().week
df&#91;'DayOfYear'] = df&#91;'Date'].dt.dayofyear
</code></pre>



<p>These columns allow for slicing and dicing your data in Power BI by year, month, weekday, and more.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="4fiscalcalendarsupport">Fiscal Calendar Support</h2>



<p>Many businesses use fiscal calendars that don’t align with the calendar year. Here’s how to add a fiscal year starting in July:</p>



<pre class="wp-block-code"><code>df&#91;'FiscalYear'] = df&#91;'Date'].apply(lambda x: x.year if x.month &lt; 7 else x.year + 1)
df&#91;'FiscalMonth'] = df&#91;'Date'].apply(lambda x: x.month - 6 if x.month &gt;= 7 else x.month + 6)
df&#91;'FiscalQuarter'] = ((df&#91;'FiscalMonth'] - 1) // 3) + 1
</code></pre>



<p>This logic adjusts the fiscal year, month, and quarter based on a July start. Python for analysts most important datetime functions</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="5flagsfortimeintelligence">5. Flags for Time Intelligence</h2>



<p>Power BI benefits from flags that simplify DAX calculations:</p>



<pre class="wp-block-code"><code>today = pd.to_datetime('today').normalize()

df&#91;'IsToday'] = df&#91;'Date'] == today
df&#91;'IsCurrentMonth'] = (df&#91;'Date'].dt.month == today.month) &amp; (df&#91;'Date'].dt.year == today.year)
df&#91;'IsCurrentYear'] = df&#91;'Date'].dt.year == today.year
</code></pre>



<p>You can also add flags for holidays, fiscal periods, or custom business logic. Python for analysts most important datetime functions</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="6exportingtocsvforpowerbi">Exporting to CSV for Power BI</h2>



<p>Once your date table is ready, export it:</p>



<pre class="wp-block-code"><code>df.to_csv('DateTable.csv', index=False)
</code></pre>



<p>You can now import this CSV into Power BI as a static date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="7sampleoutput">7. Sample Output</h2>



<p>Here’s a preview of what your date table might look like:</p>



<figure class="wp-block-table aligncenter is-style-regular has-small-font-size"><table><thead><tr><th></th><th></th><th class="has-text-align-center" data-align="center"></th><th></th><th></th><th></th><th></th><th></th><th></th></tr></thead><tbody><tr><td>2025-01-01</td><td>2025</td><td class="has-text-align-center" data-align="center">1</td><td>January</td><td>1</td><td>Wednesday</td><td>False</td><td>2025</td><td>False</td></tr><tr><td>2025-07-01</td><td>2025</td><td class="has-text-align-center" data-align="center">7</td><td>July</td><td>3</td><td>Tuesday</td><td>False</td><td>2026</td><td>False</td></tr><tr><td>2025-12-25</td><td>2025</td><td class="has-text-align-center" data-align="center">12</td><td>December</td><td>4</td><td>Thursday</td><td>False</td><td>2026</td><td>False</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Pandas to_datetime() function</h2>



<p>Python for analysts most important datetime functions &#8211; pandas</p>



<pre class="wp-block-code"><code> #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CustomerId              13483 non-null  int64         
 1   Dt_first_rew_income     2987 non-null   datetime64&#91;ns]
 2   Dt_first_purchase       13483 non-null  object        
 3   Dt_last_purchase        13483 non-null  object        

df_cust&#91;'Dt_first_purchase'] = pd.to_datetime(df_cust&#91;'Dt_first_purchase'],format="yyyy-mm-dd")
df_cust&#91;'Dt_last_purchase'] = pd.to_datetime(df_cust&#91;'Dt_last_purchase'],format="yyyy-mm-dd")

 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   CustomerId              13483 non-null  int64         
 1   Dt_first_rew_income     2987 non-null   datetime64&#91;ns]
 2   Dt_first_purchase       13483 non-null  datetime64&#91;ns]
 3   Dt_last_purchase        13483 non-null  datetime64&#91;ns]</code></pre>



<h2 class="wp-block-heading" id="8advancedtips">Advanced Tips</h2>



<ul class="wp-block-list">
<li><strong>Holidays</strong>: Use external APIs or CSVs to mark public holidays.</li>



<li><strong>Week Start</strong>: Adjust <code>Weekday</code> to match your locale (e.g., Monday vs. Sunday).</li>



<li><strong>Time Zones</strong>: Use <code>pytz</code> or <code>zoneinfo</code> for timezone-aware datetime handling.</li>



<li><strong>Dynamic Updates</strong>: Automate the script to regenerate the table monthly or yearly.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Certainly! Here&#8217;s a concise <strong>400-word post</strong> on <strong>SQL Date and Time Functions</strong>, with examples, tailored for building a <strong>Date Table in Power BI</strong>:</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="sqldateandtimefunctionsforpowerbidatetables">SQL Date and Time Functions for Power BI Date Tables</h2>



<p>When building reports in Power BI, a comprehensive <strong>Date Table</strong> is essential for enabling time-based calculations like YTD, MTD, and custom period comparisons. While Power BI can auto-generate a date table, using <strong>SQL</strong> to create one gives you full control over its structure and logic.</p>



<p>Let’s explore key <strong>SQL Server date and time functions</strong> and how to use them to build a robust date table.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="1generatingadaterange">1. Generating a Date Range</h2>



<p>To create a date table, you need a continuous range of dates. In SQL Server, you can use a loop or a recursive CTE:</p>



<pre class="wp-block-code"><code>DECLARE @StartDate DATE = '2020-01-01';
DECLARE @EndDate DATE = '2030-12-31';

WITH DateCTE AS (
    SELECT @StartDate AS DateValue
    UNION ALL
    SELECT DATEADD(DAY, 1, DateValue)
    FROM DateCTE
    WHERE DateValue &lt; @EndDate
)
SELECT * INTO DateTable FROM DateCTE
OPTION (MAXRECURSION 32767);

select * from DateTable
</code></pre>



<p>Here the example </p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="440" height="575" src="https://mietwood.com/wp-content/uploads/2025/07/image-17.jpg" alt="Python for analysts most important datetime functions in sql" class="wp-image-3213" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-17.jpg 440w, https://mietwood.com/wp-content/uploads/2025/07/image-17-230x300.jpg 230w" sizes="auto, (max-width: 440px) 100vw, 440px" /><figcaption class="wp-element-caption">Python for analysts most important datetime functions in sql</figcaption></figure>



<p>This creates a table with one row per day between 2020 and 2030.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="2addingdateattributes">2. Adding Date Attributes</h2>



<p>Once you have the base dates, enrich them with useful columns:</p>



<pre class="wp-block-code"><code>ALTER TABLE DateTable ADD 
    Year INT,
    Month INT,
    MonthName VARCHAR(20),
    Quarter INT,
    Weekday INT,
    WeekdayName VARCHAR(20);

UPDATE DateTable
SET 
    Year = YEAR(DateValue),
    Month = MONTH(DateValue),
    MonthName = DATENAME(MONTH, DateValue),
    Quarter = DATEPART(QUARTER, DateValue),
    Weekday = DATEPART(WEEKDAY, DateValue),
    WeekdayName = DATENAME(WEEKDAY, DateValue);
</code></pre>



<p>These columns allow for flexible filtering and grouping in Power BI.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading" id="3fiscalcalendarandflags">3. Fiscal Calendar and Flags</h2>



<p>You can also add fiscal logic and flags:</p>



<pre class="wp-block-code"><code>ALTER TABLE DateTable ADD FiscalYear INT;

UPDATE DateTable
SET FiscalYear = CASE 
    WHEN MONTH(DateValue) &gt;= 7 THEN YEAR(DateValue) + 1
    ELSE YEAR(DateValue)
END;
</code></pre>



<p>Add flags like <code>IsWeekend</code>, <code>IsToday</code>, or <code>IsCurrentMonth</code> to simplify DAX expressions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p></p>



<h2 class="wp-block-heading" id="conclusion">Conclusion</h2>



<p>Python offers a flexible and powerful way to create a <strong>custom date table</strong> for Power BI. With just a few lines of code, you can generate a rich dataset that supports advanced time intelligence and reporting needs.</p>



<p>Whether you&#8217;re working with fiscal calendars, custom flags, or multilingual support, Python gives you the tools to tailor your date table exactly to your business requirements.</p>



<p>SQL’s date and time functions like <code>DATEADD</code>, <code>DATEPART</code>, <code>DATENAME</code>, and <code>YEAR</code> are powerful tools for building a custom date table. Once created, export it to Power BI or use it as a view for dynamic reporting.</p>



<p>Would you like a ready-to-run SQL script for a complete date table?</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Find more resources in our course of <a href="https://mietwood.com/programowanie-zaawansowane-w-analityce">Advanced programming for business analysts</a></p>
<p>The post <a rel="nofollow" href="https://mietwood.com/python-for-analysts-most-important-datetime-functions">Python for analysts most important datetime functions</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://mietwood.com/python-for-analysts-most-important-datetime-functions/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</title>
		<link>https://mietwood.com/bilgoraj-zalew-bojary-plywanie-100-km</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Sun, 13 Jul 2025 13:56:53 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3173</guid>

					<description><![CDATA[<p>Damian Błaszczyk będzie pływał w Zalewie Bojary, stawiając sobie za cel pokonanie 100 km. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km. Przygotowaliśmy zadanie matematyczne jak może wyglądać takie pływanie i w jaki sposób pokonać 100km. Rozwiązanie zadania w dalszej części artykułu. Cel tego wydarzenia Ośrodek Sportu i Rekreacji w Biłgoraju &#8220;Na Fali&#8221; ma zaszczyt zaprosić na wyjątkowe wydarzenie,...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/bilgoraj-zalew-bojary-plywanie-100-km">Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Damian Błaszczyk będzie pływał w Zalewie Bojary, stawiając sobie za cel pokonanie 100 km. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km. Przygotowaliśmy zadanie matematyczne jak może wyglądać takie pływanie i w jaki sposób pokonać 100km. Rozwiązanie zadania w dalszej części artykułu.</p>



<p><strong>Cel tego wydarzenia</strong></p>



<p><strong>Ośrodek Sportu i Rekreacji w Biłgoraju &#8220;Na Fali&#8221; ma zaszczyt zaprosić na wyjątkowe wydarzenie, które po raz drugi zjednoczy nas wokół szczytnego celu! W dniach 25-27 lipca 2025 roku Zalew Bojary w Biłgoraju stanie się areną drugiej edycji akcji &#8220;Na Fali Nadziei – Przekraczając Granice&#8221;!</strong> &#8211; <a href="https://osir.lbl.pl/aktualnosci/2025/3428" target="_blank" rel="noopener">czytaj tutaj</a></p>



<p>W tym roku wspieramy&nbsp;<strong>Adasia Iwanejko</strong>, 12-latka z&nbsp;Woli Małej, który dzielnie walczy ze śmiertelną dystrofią mięśniową Duchenne’a. Jego jedyną szansą na powrót do zdrowia jest kosztowna terapia genowa w&nbsp;USA, której szacunkowy koszt przekracza 16 milionów złotych. Wierzymy, że dzięki wspólnej mobilizacji możemy zdziałać cuda i&nbsp;pomóc Adasiowi spełnić marzenie o&nbsp;normalnym życiu!</p>



<h2 class="wp-block-heading"><strong>Zadanie</strong> 1. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</h2>



<p>Pływak Damian Błaszczyk ma pokonać dystans 100 km pływając po okręgu przywiązany na linie o długości 100m do słupa o nieistotnej grubości. Lina obraca się wokół słupa zachowująć cały czas długość 100m. Zadanie to jest do wykonania na zalewie Bojary ponieważ zalew ten ma średnicę około 230m. Tak więc zakładamy, że pływak będzie okrążał zalew aż pokona dystans D=100km. Pytanie jest ile okrążeń musi wykonać pływak?</p>



<p>Obliczenia przedstawiono w tabeli poniżej. Pływak wykona 8 rund po 10 okrążeń w prawo, co daje 80 okrążeń i 80 okrążeń w lewo, co daje razem 160 okrążeń. Każde okrążenie daje dystans d=2Pir = 628 m. Tak więc jedna runda 10 okrążeń daje 6,280 m, 8 rund to 50,240 m. Przepłynięcie 100 km wymaga więc 16 rund czyli 160 okrążeń zalewu. Płynąc w tempie 4 km/h zajmuje czas około 25 godzin. W informacji prasowej jest wzmianka, że pływak zamierza pływać 48 godzin, więc jest to realny czas na pokonanie dystansu 100 km. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="755" height="462" src="https://mietwood.com/wp-content/uploads/2025/07/image-8.jpg" alt="" class="wp-image-3178" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-8.jpg 755w, https://mietwood.com/wp-content/uploads/2025/07/image-8-300x184.jpg 300w" sizes="auto, (max-width: 755px) 100vw, 755px" /></figure>



<div class="wp-block-kadence-accordion alignnone"><div class="kt-accordion-wrap kt-accordion-id3173_02a6a7-f3 kt-accordion-has-2-panes kt-active-pane-0 kt-accordion-block kt-pane-header-alignment-left kt-accodion-icon-style-basic kt-accodion-icon-side-right" style="max-width:none"><div class="kt-accordion-inner-wrap" data-allow-multiple-open="false" data-start-open="0">
<div class="wp-block-kadence-pane kt-accordion-pane kt-accordion-pane-1 kt-pane3173_d3b4d2-ba"><div class="kt-accordion-header-wrap"><button class="kt-blocks-accordion-header kt-acccordion-button-label-show" type="button"><span class="kt-blocks-accordion-title-wrap"><span class="kt-blocks-accordion-title">Jak obliczyć czas potrzebny na pokonanie dystansu 100 km, płynąc z prędkością 4 km/h?</span></span><span class="kt-blocks-accordion-icon-trigger"></span></button></div><div class="kt-accordion-panel kt-accordion-panel-hidden"><div class="kt-accordion-panel-inner">
<p>Czas potrzebny na pokonanie dystansu 100 km obliczamy ze wzoru: t = d/v, gdzie d=dystans 100 km, a v=prędkość 4 km/godz. Przepłynięcie 100 km w tempie 4 km/h zajmuje czas około 25 godzin. Około oznacza to, że prędkość pływaka może być nieco większa lub mniejsza, wtedy ten czas ulegnie zmianie.</p>
</div></div></div>
</div></div></div>



<h2 class="wp-block-heading">Zadanie 2</h2>



<p>Teraz zakładamy, że średnicy słupa wynosi 20 cm, co jest bardziej realne. Lina nawija się na słup w ten sposób, że po każdych 10 ciu okrążeniach przeskakuje na kolejną warstwę zwiększając tym samym średnicę słupa. Pytanie: ile okrążeń musi wykonać pływak aby pokonać dystans 100 km? Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<p>Rozwiązanie zadania: </p>



<ul class="wp-block-list">
<li>Długość początkowa liny = promień początkowy okręgu = 100 m.</li>



<li>Średnica liny = 1 cm ( rl = 0,01 m).</li>



<li>Średnica słupa = 20 cm (promień słupa = 10cm = r0 = 0,1m).</li>
</ul>



<p>Etap 1, Pływak w pierwszym okrążeniu pokona dystans d = 2Pir = 628 m. W czasie tego okrążenia lina nawinie się na słup, co spowoduje, że jej długość zmniejszy się o d0 = 2Pi * r0 = 0,628 m. Tak więc drugie okrążenie będzie miało długość 628 m &#8211; 0,628 m = 627,2 m. Itd aż do 10-tego okrążenia. Czyli w okrążeniach 1-10, lina skraca się o d0 = 2*Pi*r0 = 6.28 * 0,1m = 0,628 m per okrążenie, czyli razem skróci się o 6,28m. Po 10 okrążeniach lina przeskakuje na wierzch poprzednio nawiniętej warstwy, co zwiększa promień słupa o grubość liny, czyli o 1 cm (0,01 m). Tak więc w okrążeniach 11-20, lina skraca się o d2 = 2 Pi * (r0+rl) = 6.28 * 0,11 m = 0,69 m.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="926" height="721" src="https://mietwood.com/wp-content/uploads/2025/07/image-5.jpg" alt="Biłgoraj - Zalew Bojary - pływanie 100 km" class="wp-image-3175" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-5.jpg 926w, https://mietwood.com/wp-content/uploads/2025/07/image-5-300x234.jpg 300w, https://mietwood.com/wp-content/uploads/2025/07/image-5-768x598.jpg 768w" sizes="auto, (max-width: 926px) 100vw, 926px" /></figure>



<p>Całość obliczeń przedstawia tabela poniżej. Pływak bedzie płynął na coraz to krótszej linie, co spowoduje, że ostatecznie cała lina zawinie się na słup. Po ilu okrążeniach to nastąpi i jaki dystans pływak pokona w tym czasie? </p>



<p>Jeśli pierwsze 10 okrążeń skraca line o 6,28, drugie 10 okrążeń skraca linę o 6,91 m itd. &#8211; patrz kolumnę czwartą Ubytek*10, to 100 m liny wystarczy na 10 rund (etapów) po 10 okrążeń i 11 rundę 6 okrążeń. Razem będzie to 106 okrążeń. Da to dystans 37,655 m. Jeśli pływak płynął w prawo to teraz powinien płynąć w lewo, co da kolejne 37,655 m. Ciągle nie będzie to pełne 100 km. Potrzebne są jeszcze 4 rundy o 8 okrążeń. Czyli podsumowując: pływak przepłynie łącznie 106 okrążeń w prawo i 106 w lewo a potem jeszcze 48 okrążeń w prawo. Razem da to 260 okrążeń. Z racji tego, że promień okręgu jest zmienny, co powoduje, że pokonywany dystans jest zmienny z każdym okrążeniem, to do pokonania 100 km potrzebne jest 260 okrążeń czyli o 100 wiecej niż w poprzedniej metodzie. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="796" height="685" src="https://mietwood.com/wp-content/uploads/2025/07/image-6.jpg" alt="" class="wp-image-3176" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-6.jpg 796w, https://mietwood.com/wp-content/uploads/2025/07/image-6-300x258.jpg 300w, https://mietwood.com/wp-content/uploads/2025/07/image-6-768x661.jpg 768w" sizes="auto, (max-width: 796px) 100vw, 796px" /></figure>



<div class="wp-block-kadence-accordion alignnone"><div class="kt-accordion-wrap kt-accordion-id3173_7c2f57-98 kt-accordion-has-2-panes kt-active-pane-0 kt-accordion-block kt-pane-header-alignment-left kt-accodion-icon-style-basic kt-accodion-icon-side-right" style="max-width:none"><div class="kt-accordion-inner-wrap" data-allow-multiple-open="false" data-start-open="0">
<div class="wp-block-kadence-pane kt-accordion-pane kt-accordion-pane-1 kt-pane3173_aebbfa-ed"><div class="kt-accordion-header-wrap"><button class="kt-blocks-accordion-header kt-acccordion-button-label-show" type="button"><span class="kt-blocks-accordion-title-wrap"><span class="kt-blocks-accordion-title">Jak obliczyć dystans okrążenia po okręgu o promieniu R, kiedy R zmienia się w wyniku nawijania się liny na okrąg (słup) o promieniu r?</span></span><span class="kt-blocks-accordion-icon-trigger"></span></button></div><div class="kt-accordion-panel kt-accordion-panel-hidden"><div class="kt-accordion-panel-inner">
<ul class="wp-block-list">
<li>Dystans D okrążenia po okręgu o promieniu D = 2PiR, gdzie R = promień okręgu, w naszym przypadku jest to 100m, a 2*Pi = 6.28.  Dystans okrążenia po okręgu 100m wynosi 628 m.</li>



<li>Jeśli lina nawija się na słup o promieniu r = 0,1 m, to podczas jednego okrążenia skraca się o 2Pir = 0,628 m. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</li>
</ul>
</div></div></div>
</div></div></div>



<h2 class="wp-block-heading">Zadanie 3</h2>



<p>Ciekawe zadanie powstaje, jeśli lina będzie się skracać za każdym razem nawijając się na poprzednie warstwy tworząc spiralę. W takiej spirali każde okrążenie jest krótsze o 0,628 m tworząc ciąg liczbowy. Do rozwiązanie tego zadania wykorzystaliśmy Google Gemini <a href="https://gemini.google.com" target="_blank" rel="noopener">https://gemini.google.com</a> , co przedstawia poniższy obrazek:</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="684" height="481" src="https://mietwood.com/wp-content/uploads/2025/07/image-10.jpg" alt="Biłgoraj - Zalew Bojary - pływanie 100 km" class="wp-image-3182" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-10.jpg 684w, https://mietwood.com/wp-content/uploads/2025/07/image-10-300x211.jpg 300w" sizes="auto, (max-width: 684px) 100vw, 684px" /></figure>



<p>Równanie to można uprościć do następującej formy ( Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km )</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="291" height="117" src="https://mietwood.com/wp-content/uploads/2025/07/image-14.jpg" alt="" class="wp-image-3202"/></figure>



<p>i następnie</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="473" height="193" src="https://mietwood.com/wp-content/uploads/2025/07/image-13.jpg" alt="" class="wp-image-3201" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-13.jpg 473w, https://mietwood.com/wp-content/uploads/2025/07/image-13-300x122.jpg 300w" sizes="auto, (max-width: 473px) 100vw, 473px" /></figure>



<p>co po wymnożeniu daje równanie kwadratowe</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="303" height="38" src="https://mietwood.com/wp-content/uploads/2025/07/image-15.jpg" alt="" class="wp-image-3203" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-15.jpg 303w, https://mietwood.com/wp-content/uploads/2025/07/image-15-300x38.jpg 300w" sizes="auto, (max-width: 303px) 100vw, 303px" /></figure>



<p>i następnie</p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="276" height="69" src="https://mietwood.com/wp-content/uploads/2025/07/image-16.jpg" alt="Równanie - Biłgoraj - Zalew Bojary - pływanie 100 km" class="wp-image-3204"/></figure>



<p>W tej sytuacji mamy równanie kwadratowe typu ax^2 + bx + c =0. Używając formuły kwadratowej x = (−b ± pierwiastek kwadratowy z ( b2 − 4ac) ) / 2a​​, dostajemy: n = 160,43. Tak więc w takiej sytuacji pływak wykona 160 okrążenia na ciągle skracającej się linie. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Dla miłośników Pythona, zadanie może być rozwiązane przy pomocy poniższej procedury. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<pre class="wp-block-code"><code>def calculate_turns(rope_length, rope_thickness, bar_radius, shorten_per_turn, turns_per_layer, layer_increase):
    turns = 0
    while rope_length &gt; 0:
        turns += 1
        rope_length -= shorten_per_turn
        if turns % turns_per_layer == 0:
            bar_radius += layer_increase / 2
            shorten_per_turn = 2 * 3.14 * (bar_radius + rope_thickness)
    return turns
# Given values
rope_length = 100  # meters
rope_thickness = 0.01  # meters (1 cm)
bar_radius = 0.1  # meters
shorten_per_turn = 0.628  # meters
turns_per_layer = 10
layer_increase = 0.02  # meters (2 cm)

# Calculate the number of turns
turns = calculate_turns(rope_length, rope_thickness, bar_radius, shorten_per_turn, turns_per_layer, layer_increase)

print(f"The swimmer can make {turns} turns before the rope ends.")
</code></pre>



<p>More python <a href="https://mietwood.com/advanced-programming-in-sql-and-python">here</a></p>



<p>Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h3 class="wp-block-heading" id="100kilometrwpywaniadlanadzieizalewbojarywbigoraju">P<strong>ływania w dobrej sprawie – </strong>Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</h3>



<p>Artykuł opisuje niezwykłe wyzwanie wytrzymałościowe podjęte przez pływaka <strong>Damiana Błaszczyka</strong>, który postanowił przepłynąć <strong>100 kilometrów</strong> w zbiorniku wodnym <strong>Zalew Bojary</strong> w Biłgoraju. Wydarzenie to, będące częścią inicjatywy <strong>„Na Fali Nadziei – Przekraczając Granice”</strong>, ma nie tylko sportowy, ale przede wszystkim charytatywny charakter. Celem tegorocznej edycji jest wsparcie <strong>Adasia Iwanejko</strong>, 12-letniego chłopca chorego na dystrofię mięśniową Duchenne’a. Zebrane środki mają pomóc w sfinansowaniu kosztownej terapii genowej w USA, której koszt przekracza <strong>16 milionów złotych</strong>.</p>



<h3 class="wp-block-heading" id="matematykapywaniamodelowaniedystansu"><strong>Matematyka pływania – modelowanie dystansu</strong></h3>



<p>Autor bloga przedstawia matematyczne modele opisujące, jak można przepłynąć 100 km w jednym zbiorniku. W pierwszym scenariuszu Damian pływa po idealnym okręgu, przymocowany do słupa liną o długości <strong>100 metrów</strong>. Każde okrążenie ma wtedy około <strong>628 metrów</strong> (obliczone ze wzoru na obwód koła: (2\pi r)). Aby osiągnąć 100 km, potrzeba <strong>160 okrążeń</strong>, zmieniając kierunek co 10, by równomiernie obciążać ciało. Przy stałym tempie <strong>4 km/h</strong>, całość zajęłaby około <strong>25 godzin</strong>, choć wydarzenie przewiduje <strong>48 godzin</strong>, co czyni cel realnym.</p>



<h3 class="wp-block-heading" id="realizmlinanawijajcasinasup"><strong>Realizm: lina nawijająca się na słup</strong></h3>



<p>Drugi scenariusz uwzględnia bardziej realistyczny aspekt – lina nawija się na słup, skracając z każdym okrążeniem promień pływania. Słup ma <strong>20 cm średnicy</strong>, a lina <strong>1 cm grubości</strong>. Po każdych 10 okrążeniach lina tworzy nową warstwę, zwiększając promień słupa. W efekcie każde kolejne okrążenie jest krótsze. Obliczenia pokazują, że Damian może wykonać <strong>106 okrążeń w jednym kierunku</strong>, pokonując <strong>37,7 km</strong>. Powtórzenie tego w przeciwnym kierunku daje kolejne 37,7 km. Aby osiągnąć 100 km, potrzeba jeszcze <strong>48 dodatkowych okrążeń</strong>, co daje łącznie <strong>260 okrążeń</strong> – o 100 więcej niż w pierwszym modelu.</p>



<h3 class="wp-block-heading" id="modelspiralnymatematycznaelegancja"><strong>Model spiralny – matematyczna elegancja</strong></h3>



<p>Trzeci model zakłada spiralne nawijanie liny, gdzie każde okrążenie jest krótsze o stałą wartość. Tworzy to ciąg arytmetyczny długości okrążeń. Korzystając z wzoru na sumę ciągu, autor oblicza, że potrzeba około <strong>160,43 okrążeń</strong>, by osiągnąć 100 km. Model ten jest zbliżony do pierwszego, ale bardziej realistyczny i matematycznie elegancki.</p>



<h3 class="wp-block-heading" id="podsumowanie"><strong>Podsumowanie</strong></h3>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. To także przykład, jak pasja i nauka mogą wspólnie służyć wyższemu celowi.</p>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. To także przykład, jak pasja i nauka mogą wspólnie służyć wyższemu celowi.</p>



<p>Artykuł łączy <strong>sport, charytatywność i matematykę</strong> w inspirującą opowieść. Pokazuje nie tylko niezwykłe wyzwanie fizyczne, ale też to, jak matematyka może pomóc zrozumieć i zaplanować realne działania. Wydarzenie nad Zalewem Bojary to dowód ludzkiej wytrwałości i solidarności – każdy przepłynięty metr to krok ku nadziei dla chorego chłopca. To także przykład, jak pasja i nauka mogą wspólnie służyć wyższemu celowi.</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/bilgoraj-zalew-bojary-plywanie-100-km">Biłgoraj &#8211; Zalew Bojary &#8211; pływanie 100 km</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Semantic Search with Elasticsearch</title>
		<link>https://mietwood.com/semantic-search-with-elasticsearch</link>
		
		<dc:creator><![CDATA[Maki Pa]]></dc:creator>
		<pubDate>Fri, 11 Jul 2025 10:11:19 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Customer Experience Management]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://mietwood.com/?p=3158</guid>

					<description><![CDATA[<p>Semantic search with Elasticsearch is must have for modern e-commerce. Elasticsearch is a powerful search engine, scalable data store, and vector database built on Apache Lucene. It’s optimized for speed and relevance on production-scale workloads. You can use Elasticsearch to index your product database and built beautiful Semantic search with Elasticsearch. How AI is changing...</p>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Semantic search with Elasticsearch is must have for modern e-commerce. Elasticsearch is a powerful search engine, scalable data store, and vector database built on Apache Lucene. It’s optimized for speed and relevance on production-scale workloads. You can use Elasticsearch to index your product database and built beautiful Semantic search with Elasticsearch.</p>



<h2 class="wp-block-heading">How AI is c<a href="https://www.nngroup.com/articles/ai-changing-search-behaviors" target="_blank" rel="noopener">hanging search behaviors</a></h2>



<p><a href="https://www.nngroup.com/articles/ai-changing-search-behaviors" target="_blank" rel="noopener">https://www.nngroup.com/articles/ai-changing-search-behaviors</a></p>





<h2 class="wp-block-heading">Semantic search with Elasticsearch &#8211; intro</h2>



<p><a href="https://www.elastic.co/search-labs/blog/introduction-to-vector-search" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/introduction-to-vector-search</a></p>



<h2 class="wp-block-heading">Language model implementation</h2>



<p>As the first we should initiate a language model and prepare product data for input. Semantic search with Elasticsearch require data indexing via vector transformer.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>import pandas as pd
import numpy as np

# function to normalize vectors
def normalize_embedding(embedding):
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm != 0 else embedding

# select products from sql database
def select_products():
    q = """
        SELECT 
         &#91;ProductId&#93;
        ,&#91;ProdIdx&#93;
        ,&#91;ProductName&#93;
        FROM &#91;DB_Products&#93; 
    """
    dfp = read_from_sql_server(q);
    return dfp

# import sentence transformers and initiate model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
print(model)

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

model = model.to(device)
print(model)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">pandas</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pd</span></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">numpy</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">as</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">np</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">function</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">normalize</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">vectors</span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">normalize_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">np</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">linalg</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF"> / </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">norm</span><span style="color: #D8DEE9FF"> != 0 </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">select</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sql</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">database</span></span>
<span class="line"><span style="color: #D8DEE9">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">select_products</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">q</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">        SELECT</span><span style="color: #D8DEE9"> </span></span>
<span class="line"><span style="color: #D8DEE9FF">         &#91;</span><span style="color: #D8DEE9">ProductId</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">ProdIdx</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #D8DEE9">ProductName</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #D8DEE9">FROM</span><span style="color: #D8DEE9FF"> &#91;</span><span style="color: #D8DEE9">DB_Products</span><span style="color: #D8DEE9FF">&#93; </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">&quot;&quot;&quot;</span></span>
<span class="line"><span style="color: #A3BE8C">    dfp = read_from_sql_server(q)</span><span style="color: #D8DEE9">;</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #81A1C1">return</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">dfp</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">initiate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">model</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">sentence_transformers</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">SentenceTransformer</span></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">SentenceTransformer</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">all-MiniLM-L6-v2</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">torch</span></span>
<span class="line"><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">torch</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">cuda</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">torch</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">cuda</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">is_available</span><span style="color: #D8DEE9FF">() </span><span style="color: #8FBCBB">else</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">cpu</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">to</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">device</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">model</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<h2 class="wp-block-heading">Data indexing for semantic search</h2>



<p>Now we can index product data into Elasticsearch database (index). We will index product names as vectors and lexically. This allows hybrid search. Semantic search with Elasticsearch.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># initiate Elasticsearch client
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch('http://localhost:9100')

# check es client
from pprint import pprint
pprint(es.info().body)

# ReCreate the index with dense_vector and text mappings
# step 1
es.indices.delete(index="prod_search_hybrid", ignore_unavailable=True)

# step 2 - mappings
es.indices.create(
    index="prod_search_hybrid",
    mappings={
        "properties": {
            "embedding": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "prodct_name": {
                "type": "text"
            }
        }
    },
)

# list indices and number of documents indexed 
def indices_list():
    indices = es.cat.indices(format='json')
    return [x&#91;'index'&#93; for x in indices]
# -------------------
print(indices_list())
# ------------------------------

# Create documents for embedding
documents = []
for i, r in df_docs.iterrows():
    documents.append({        
        'product_name': r&#91;'ProductName'&#93;&#91;:256&#93;.lower(),        
    })

print(f' Created table of {len(documents)} docs')

# Prepare bulk operations
from tqdm import tqdm                                         # for a prograss bar
operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({'index': {'_index': 'prod_search_hybrid'}})
    operations.append({
        **document,
        'embedding': get_embedding(document&#91;'product_name'&#93;), # vectors for semantic search
        'product_name': document&#91;'product_name'&#93;              # the text field for hybrid search
    })

# Bulk insert the data into Elasticsearch
response = es.bulk(operations=operations)
print(f' Records indexed: {len(response&#91;"items"&#93;)}')</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">initiate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">Elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">client</span></span>
<span class="line"><span style="color: #D8DEE9">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">elasticsearch</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">helpers</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">Elasticsearch</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">http://localhost:9100</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">check</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">client</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pprint</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">pprint</span></span>
<span class="line"><span style="color: #8FBCBB">pprint</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">info</span><span style="color: #D8DEE9FF">().</span><span style="color: #8FBCBB">body</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">ReCreate</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">with</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">dense_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">mappings</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">step</span><span style="color: #D8DEE9FF"> 1</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">delete</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">ignore_unavailable</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">True</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">step</span><span style="color: #D8DEE9FF"> 2 - </span><span style="color: #8FBCBB">mappings</span></span>
<span class="line"><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">create</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">mappings</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &quot;</span><span style="color: #8FBCBB">properties</span><span style="color: #D8DEE9FF">&quot;: {</span></span>
<span class="line"><span style="color: #D8DEE9FF">            &quot;</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">&quot;: {</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">dense_vector</span><span style="color: #D8DEE9FF">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">dims</span><span style="color: #D8DEE9FF">&quot;: 384</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">&quot;: </span><span style="color: #8FBCBB">True</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">similarity</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">cosine</span><span style="color: #D8DEE9FF">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">prodct_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">: </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &quot;</span><span style="color: #8FBCBB">type</span><span style="color: #D8DEE9FF">&quot;: &quot;</span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        }</span></span>
<span class="line"><span style="color: #D8DEE9FF">    }</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">list</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">number</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">of</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indexed</span><span style="color: #D8DEE9FF"> </span></span>
<span class="line"><span style="color: #8FBCBB">def</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices_list</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">cat</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">format</span><span style="color: #D8DEE9FF">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">json</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">return</span><span style="color: #D8DEE9FF"> [</span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">index</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">&#93; </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">x</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">indices</span><span style="color: #D8DEE9FF">]</span></span>
<span class="line"><span style="color: #D8DEE9FF"># -------------------</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">indices_list</span><span style="color: #D8DEE9FF">())</span></span>
<span class="line"><span style="color: #D8DEE9FF"># ------------------------------</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Create</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">embedding</span></span>
<span class="line"><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">i</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">r</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">df_docs</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">iterrows</span><span style="color: #D8DEE9FF">():</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">r</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">ProductName</span><span style="color: #D8DEE9FF">&#39;&#93;&#91;:256&#93;.</span><span style="color: #8FBCBB">lower</span><span style="color: #D8DEE9FF">()</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF">        </span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C"> Created table of {len(documents)} docs</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Prepare</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bulk</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">operations</span></span>
<span class="line"><span style="color: #81A1C1">from</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">import</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF">                                         # </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">a</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">prograss</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">bar</span></span>
<span class="line"><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF"> = []</span></span>
<span class="line"><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">tqdm</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">documents</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">total</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">len</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">documents</span><span style="color: #D8DEE9FF">)):</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span><span style="color: #D8DEE9FF">&#39;</span><span style="color: #8FBCBB">index</span><span style="color: #D8DEE9FF">&#39;: {&#39;</span><span style="color: #8FBCBB">_index</span><span style="color: #D8DEE9FF">&#39;: &#39;</span><span style="color: #8FBCBB">prod_search_hybrid</span><span style="color: #D8DEE9FF">&#39;</span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">})</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">append</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #81A1C1">**</span><span style="color: #8FBCBB">document</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">embedding</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">get_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;&#93;)</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> # </span><span style="color: #8FBCBB">vectors</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">semantic</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">        &#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;: </span><span style="color: #8FBCBB">document</span><span style="color: #D8DEE9FF">&#91;&#39;</span><span style="color: #8FBCBB">product_name</span><span style="color: #D8DEE9FF">&#39;&#93;              # </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">text</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">field</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">hybrid</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">search</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #8FBCBB">Bulk</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">insert</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">data</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #8FBCBB">Elasticsearch</span></span>
<span class="line"><span style="color: #8FBCBB">response</span><span style="color: #D8DEE9FF"> = </span><span style="color: #8FBCBB">es</span><span style="color: #D8DEE9FF">.</span><span style="color: #8FBCBB">bulk</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">=</span><span style="color: #8FBCBB">operations</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #8FBCBB">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #8FBCBB">f</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C"> Records indexed: {len(response&#91;&quot;items&quot;&#93;)}</span><span style="color: #ECEFF4">&#39;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<h2 class="wp-block-heading">Hybrid search</h2>



<p>For hybrid search we combine <strong>match</strong> and <strong>knn</strong> search inside a bool query. The <strong>_name</strong> field return what what part of the query has returned results. This allow to build hybride scoring. As you can see we vectorize <strong>query_h</strong> to <strong>query_vector</strong> using the same procedure get_embeding as we were using during indexing. Semantic search with Elasticsearch.</p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly>query_h = "anti-explosion device"

# Print the query vector for debugging
# query_vector = get_embedding(query_h)
# print("Query Vector:", query_vector)

response = es.search(
    index='prod_search_hybrid',
    body={
        "query": {
            "bool": {
                "should": &#91;
                    {
                        "match": {
                            "product_name": {
                                "query": query_h,
                                "_name": "text_match"
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": "embedding",
                            "query_vector":  query_vector,
                            "k": 10,
                            "num_candidates": 100,
                            "_name": "semantic_search"
                        }
                    }
                &#93;
            }
        }
        ,
        'size': 30
    }
)</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">anti-explosion device</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">the</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">debugging</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">get_embedding</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">query_h</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Query Vector:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_vector</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">es</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">search</span><span style="color: #D8DEE9FF">(</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">index</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">prod_search_hybrid</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">body</span><span style="color: #81A1C1">=</span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">bool</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">should</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> &#91;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">query_h</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                                </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">},</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">knn</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">{</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">field</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">embedding</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">query_vector</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF">  </span><span style="color: #D8DEE9">query_vector</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">k</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">num_candidates</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">100</span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">                            </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span></span>
<span class="line"><span style="color: #D8DEE9FF">                        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">                &#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">            </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">,</span></span>
<span class="line"><span style="color: #D8DEE9FF">        </span><span style="color: #ECEFF4">&#39;</span><span style="color: #A3BE8C">size</span><span style="color: #ECEFF4">&#39;</span><span style="color: #ECEFF4">:</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">30</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #ECEFF4">}</span></span>
<span class="line"><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>After that we can extract product names and scoring and build a list according to our intention. In here we separate products according to field _name and then build list of top 10 lexical match and semantic similarity. More about similarity measures you can read in post <a href="https://mietwood.com/measuring-product-similarity">Measuring product similarity</a></p>



<div class="wp-block-kevinbatdorf-code-block-pro" data-code-block-pro-font-family="Code-Pro-JetBrains-Mono" style="font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)"><span style="display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff"><svg xmlns="http://www.w3.org/2000/svg" width="54" height="14" viewBox="0 0 54 14"><g fill="none" fill-rule="evenodd" transform="translate(1 1)"><circle cx="6" cy="6" r="6" fill="#FF5F56" stroke="#E0443E" stroke-width=".5"></circle><circle cx="26" cy="6" r="6" fill="#FFBD2E" stroke="#DEA123" stroke-width=".5"></circle><circle cx="46" cy="6" r="6" fill="#27C93F" stroke="#1AAB29" stroke-width=".5"></circle></g></svg></span><span role="button" tabindex="0" style="color:#d8dee9ff;display:none" aria-label="Copy" class="code-block-pro-copy-button"><pre class="code-block-pro-copy-button-pre" aria-hidden="true"><textarea class="code-block-pro-copy-button-textarea" tabindex="-1" aria-hidden="true" readonly># Extract product names, scores, and sources
products = [
    (hit&#91;"_source"&#93;&#91;"product_name"&#93;, hit&#91;"_score"&#93;, hit.get("matched_queries", []))
    for hit in response&#91;"hits"&#93;&#91;"hits"&#93;
]

# Separate products into text_match and semantic_search groups
text_match_products = [product for product in products if "text_match" in product&#91;2&#93;]
semantic_search_products = [product for product in products if "semantic_search" in product&#91;2&#93;]

# Sort each group by score in descending order
sorted_text_match_products = sorted(text_match_products, key=lambda x: x&#91;1&#93;, reverse=True)&#91;:10&#93;
sorted_semantic_search_products = sorted(semantic_search_products, key=lambda x: x&#91;1&#93;, reverse=True)&#91;:10&#93;

# Print top 10 text_match products
print("\nTop 10 Text Match Products:")
for product in sorted_text_match_products:
    print(f"Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}")

# Print top 10 semantic_search products
print("\nTop 10 Semantic Search Products:")
for product in sorted_semantic_search_products:
    print(f"Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}")</textarea></pre><svg xmlns="http://www.w3.org/2000/svg" style="width:24px;height:24px" fill="none" viewBox="0 0 24 24" stroke="currentColor" stroke-width="2"><path class="with-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4"></path><path class="without-check" stroke-linecap="round" stroke-linejoin="round" d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"></path></svg></span><pre class="shiki nord" style="background-color: #2e3440ff" tabindex="0"><code><span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Extract</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">names</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">scores</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">sources</span></span>
<span class="line"><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span></span>
<span class="line"><span style="color: #D8DEE9FF">    (</span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_source</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">product_name</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">_score</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #ECEFF4">.</span><span style="color: #88C0D0">get</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">matched_queries</span><span style="color: #ECEFF4">&quot;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> []))</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">hit</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">response</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;&#91;</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">hits</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9FF">]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Separate</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">into</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text_match</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">and</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">semantic_search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">groups</span></span>
<span class="line"><span style="color: #D8DEE9">text_match_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">text_match</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">2</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"><span style="color: #D8DEE9">semantic_search_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> [</span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">if</span><span style="color: #D8DEE9FF"> </span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">semantic_search</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">2</span><span style="color: #D8DEE9FF">&#93;]</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Sort</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">each</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">group</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">by</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">score</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">descending</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">order</span></span>
<span class="line"><span style="color: #D8DEE9">sorted_text_match_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sorted</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">text_match_products</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">key</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reverse</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)&#91;:</span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"><span style="color: #D8DEE9">sorted_semantic_search_products</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9FF"> </span><span style="color: #88C0D0">sorted</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">semantic_search_products</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">key</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">lambda</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">: </span><span style="color: #D8DEE9">x</span><span style="color: #D8DEE9FF">&#91;</span><span style="color: #B48EAD">1</span><span style="color: #D8DEE9FF">&#93;</span><span style="color: #ECEFF4">,</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">reverse</span><span style="color: #81A1C1">=</span><span style="color: #D8DEE9">True</span><span style="color: #D8DEE9FF">)&#91;:</span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF">&#93;</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">top</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">text_match</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Top 10 Text Match Products:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> sorted_text_match_products</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"></span>
<span class="line"><span style="color: #D8DEE9FF"># </span><span style="color: #D8DEE9">Print</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">top</span><span style="color: #D8DEE9FF"> </span><span style="color: #B48EAD">10</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">semantic_search</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">products</span></span>
<span class="line"><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #ECEFF4">&quot;</span><span style="color: #EBCB8B">\n</span><span style="color: #A3BE8C">Top 10 Semantic Search Products:</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span>
<span class="line"><span style="color: #D8DEE9">for</span><span style="color: #D8DEE9FF"> </span><span style="color: #D8DEE9">product</span><span style="color: #D8DEE9FF"> </span><span style="color: #81A1C1">in</span><span style="color: #D8DEE9FF"> sorted_semantic_search_products</span><span style="color: #ECEFF4">:</span></span>
<span class="line"><span style="color: #D8DEE9FF">    </span><span style="color: #88C0D0">print</span><span style="color: #D8DEE9FF">(</span><span style="color: #D8DEE9">f</span><span style="color: #ECEFF4">&quot;</span><span style="color: #A3BE8C">Product Name: {product&#91;0&#93;}, Score: {product&#91;1&#93;}</span><span style="color: #ECEFF4">&quot;</span><span style="color: #D8DEE9FF">)</span></span></code></pre></div>



<p>Full-text search, also known as lexical search, is a technique for fast, efficient searching through text fields in documents. Documents and search queries are transformed to enable returning&nbsp;<a href="https://www.elastic.co/what-is/search-relevance" target="_blank" rel="noreferrer noopener">relevant</a>&nbsp;results instead of simply exact term matches. Fields of type&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/text#text-field-type" target="_blank" rel="noopener"><code>text</code></a>&nbsp;are analyzed and indexed for full-text search.</p>



<p>You can combine full-text search with&nbsp;<a href="https://www.elastic.co/docs/solutions/search/semantic-search" target="_blank" rel="noopener">semantic search using vectors</a>&nbsp;to build modern hybrid search applications. While vector search may require additional GPU resources, the full-text component remains cost-effective by leveraging existing CPU infrastructure.</p>



<p>Another example of vector indexing and sementic search you can find here: <a href="https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb" class="ek-link" target="_blank" rel="noopener">https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb</a></p>



<h2 class="wp-block-heading">Vector search setup and performing hybrid search</h2>



<p><a href="https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/vector-search-set-up-elasticsearch</a></p>



<p><a href="https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch" target="_blank" rel="noopener">https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch</a></p>



<h2 class="wp-block-heading" id="Filtering">Filtering. Semantic search with Elasticsearch</h2>



<p>Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:</p>



<ul class="wp-block-list">
<li><em>Does this timestamp fall into the range 2015 to 2016?</em></li>



<li><em>Is the status field set to &#8220;published&#8221;?</em></li>
</ul>



<p>Filter context is in effect whenever a query clause is passed to a filter parameter, such as the&nbsp;<code>filter</code>&nbsp;or&nbsp;<code>must_not</code>&nbsp;parameters in a&nbsp;<code>bool</code>&nbsp;query. <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context" target="_blank" rel="noopener">Learn more</a>&nbsp;about filter context in the Elasticsearch docs.</p>



<h3 class="wp-block-heading" id="Example:-Keyword-Filtering">Keyword Filtering</h3>



<p>This is an example of adding a keyword filter to the query. The example retrieves the top books that are similar to &#8220;javascript books&#8221; based on their title vectors, and also Addison-Wesley as publisher. Semantic search with Elasticsearch.</p>



<pre class="wp-block-code"><code>response = client.search(
    index="book_index",
    knn={
        "field": "title_vector",
        "query_vector": model.encode("javascript books"),
        "k": 10,
        "num_candidates": 100,
<span style="background-color:var(--global-palette1)" class="has-inline-background"><mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-theme-palette-9-color">        "filter": {"term": {"publisher.keyword": "addison-wesley"}},</mark></span>
    },
)

pprint(response)</code></pre>



<h2 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#query-filter-context" target="_blank" rel="noopener">Query and filter context</a></h2>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#relevance-scores" target="_blank" rel="noopener">Relevance scores</a></h3>



<p>By default, Elasticsearch sorts matching search results by&nbsp;<strong>relevance score</strong>, which measures how well each document matches a query. The relevance score is a positive floating point number, returned in the&nbsp;<code>_score</code>&nbsp;metadata field of the&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search" target="_blank" rel="noreferrer noopener" class="ek-link">search</a>&nbsp;API. The higher the&nbsp;<code>_score</code>, the more relevant the document. While each query type can calculate relevance scores differently, score calculation also depends on whether the query clause is run in a&nbsp;<strong>query</strong>&nbsp;or&nbsp;<strong>filter</strong>&nbsp;context.</p>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#query-context" class="ek-link" target="_blank" rel="noopener">Query context</a></h3>



<p>In the query context, a query clause answers the question&nbsp;<em>How well does this document match this query clause?</em>&nbsp;Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the&nbsp;<code>_score</code>&nbsp;metadata field. Query context is in effect whenever a query clause is passed to a&nbsp;<code>query</code>&nbsp;parameter, such as the&nbsp;<code>query</code>&nbsp;parameter in the&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#request-body-search-query" target="_blank" rel="noreferrer noopener">search</a>&nbsp;API. Semantic search with Elasticsearch</p>



<h3 class="wp-block-heading"><a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/querydsl#filter-context" target="_blank" rel="noopener">Filter context</a></h3>



<p>A filter answers the binary question “Does this document match this query clause?”. The answer is simply &#8220;yes&#8221; or &#8220;no&#8221;. Filtering has several benefits:</p>



<ol class="wp-block-list">
<li><strong>Simple binary logic</strong>: In a filter context, a query clause determines document matches based on a yes/no criterion, without score calculation.</li>



<li><strong>Performance</strong>: Because they don’t compute relevance scores, filters execute faster than queries.</li>



<li><strong>Caching</strong>: Elasticsearch automatically caches frequently used filters, speeding up subsequent search performance.</li>



<li><strong>Resource efficiency</strong>: Filters consume less CPU resources compared to full-text queries.</li>



<li><strong>Query combination</strong>: Filters can be combined with scored queries to refine result sets efficiently.</li>
</ol>



<p>Filters are particularly effective for querying structured data and implementing &#8220;must have&#8221; criteria in complex searches.</p>



<p>Structured data refers to information that is highly organized and formatted in a predefined manner. In the context of Elasticsearch, this typically includes:</p>



<ul class="wp-block-list">
<li>Numeric fields (integers, floating-point numbers)</li>



<li>Dates and timestamps</li>



<li>Boolean values</li>



<li>Keyword fields (exact match strings)</li>



<li>Geo-points and geo-shapes</li>
</ul>



<p>Unlike full-text fields, structured data has a consistent, predictable format, making it ideal for precise filtering operations. Semantic search with Elasticsearch.</p>



<p>Common filter applications include:</p>



<ul class="wp-block-list">
<li>Date range checks: for example is the&nbsp;<code>timestamp</code>&nbsp;field between 2015 and 2016</li>



<li>Specific field value checks: for example is the&nbsp;<code>status</code>&nbsp;field equal to &#8220;published&#8221; or is the&nbsp;<code>author</code>&nbsp;field equal to &#8220;John Doe&#8221;</li>
</ul>



<p>Filter context applies when a query clause is passed to a&nbsp;<code>filter</code>&nbsp;parameter, such as:</p>



<ul class="wp-block-list">
<li><code>filter</code>&nbsp;or&nbsp;<code>must_not</code>&nbsp;parameters in&nbsp;<a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-bool-query" target="_blank" rel="noopener"><code>bool</code></a>&nbsp;queries</li>



<li><code>filter</code>&nbsp;parameter in&nbsp;<a href="https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-constant-score-query" target="_blank" rel="noopener"><code>constant_score</code></a>&nbsp;queries</li>



<li><a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-filter-aggregation" target="_blank" rel="noopener"><code>filter</code></a>&nbsp;aggregations</li>
</ul>



<p>Filters optimize query performance and efficiency, especially for structured data queries and when combined with full-text searches.</p>



<pre class="wp-block-code"><code>GET /_search
{
  "query": {
    "bool": {
      "must": &#91;
        { "match": { "title":   "Search"        }},
        { "match": { "content": "Elasticsearch" }}
      ],
      "filter": &#91;
        { "term":  { "status": "published" }},
        { "range": { "publish_date": { "gte": "2015-01-01" }}}
      ]
    }
  }
}</code></pre>



<p>Read more: <a href="https://mietwood.com/product-search-and-product-classification-for-e-commerce">Product Search and Product classification for E-commerce</a></p>



<h1 class="wp-block-heading">Reciprocal rank fusion</h1>



<p><a href="https://plg.uwaterloo.ca/%7Egvcormac/cormacksigir09-rrf.pdf" target="_blank" rel="noreferrer noopener">Reciprocal rank fusion (RRF)</a>&nbsp;is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results. Semantic search with Elasticsearch.</p>



<p>RRF uses the following formula to determine the score for ranking each document:</p>



<pre class="wp-block-code"><code>score = 0.0
for q in queries:
    if d in result(q):
        score += 1.0 / ( k + rank( result(q), d ) )
return score

# where
# k is a ranking constant
# q is a query in the set of queries
# d is a document in the result set of q
# result(q) is the result set of q
# rank( result(q), d ) is d's rank within the result(q) starting from 1</code></pre>



<p>You can use RRF as part of a&nbsp;<a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search" target="_blank" rel="noreferrer noopener">search</a>&nbsp;to combine and rank documents using separate sets of top documents (result sets) from a combination of&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers" target="_blank" rel="noopener">child retrievers</a>&nbsp;using an&nbsp;<a href="https://www.elastic.co/docs/reference/elasticsearch/rest-apis/retrievers#rrf-retriever" target="_blank" rel="noopener">RRF retriever</a>. A minimum of&nbsp;<strong>two</strong>&nbsp;child retrievers is required for ranking.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="431" height="198" src="https://mietwood.com/wp-content/uploads/2025/07/image-1.jpg" alt="Semantic search with Elasticsearch. RRF retriever combined the results." class="wp-image-3163" srcset="https://mietwood.com/wp-content/uploads/2025/07/image-1.jpg 431w, https://mietwood.com/wp-content/uploads/2025/07/image-1-300x138.jpg 300w" sizes="auto, (max-width: 431px) 100vw, 431px" /><figcaption class="wp-element-caption">Semantic search with Elasticsearch. RRF retriever combined the results.</figcaption></figure>
</div>


<p>We rank the documents based on the RRF formula with a&nbsp;<code>rank_window_size</code>&nbsp;of&nbsp;<code>5</code>&nbsp;truncating the bottom&nbsp;<code>2</code>&nbsp;docs in our RRF result set with a&nbsp;<code>size</code>&nbsp;of&nbsp;<code>3</code>. <strong>We end with&nbsp;<code>_id: 3</code>&nbsp;as&nbsp;<code>_rank: 1</code>,&nbsp;<code>_id: 2</code>&nbsp;as&nbsp;<code>_rank: 2</code>, and&nbsp;<code>_id: 4</code>&nbsp;as&nbsp;<code>_rank: 3</code>. </strong>This ranking matches the result set from the original RRF search as expected.</p>



<p>In this example, we execute the&nbsp;<code>knn</code>&nbsp;and&nbsp;<code>standard</code>&nbsp;retrievers independently of each other. Then we use the&nbsp;<code>rrf</code>&nbsp;retriever to combine the results.</p>



<ol class="wp-block-list">
<li>First, we execute the kNN search specified by the&nbsp;<code>knn</code>&nbsp;retriever to get its global top 50 results.</li>



<li>Second, we execute the query specified by the&nbsp;<code>standard</code>&nbsp;retriever to get its global top 50 results.</li>



<li>Then, on a coordinating node, we combine the kNN search top documents with the query top documents and rank them based on the RRF formula using parameters from the&nbsp;<code>rrf</code>&nbsp;retriever to get the combined top documents using the default&nbsp;<code>size</code>&nbsp;of&nbsp;<code>10</code>.</li>
</ol>



<p>Note that if&nbsp;<code>k</code>&nbsp;from a knn search is larger than&nbsp;<code>rank_window_size</code>, the results are truncated to&nbsp;<code>rank_window_size</code>. If&nbsp;<code>k</code>&nbsp;is smaller than&nbsp;<code>rank_window_size</code>, the results are&nbsp;<code>k</code>&nbsp;size.</p>



<pre class="wp-block-code"><code>GET example-index/_search
{
    "retriever": {
        "rrf": {
            "retrievers": &#91;
                {
                    "standard": {
                        "query": {
                            "term": {
                                "text": "shoes"
                            }
                        }
                    }
                },
                {
                    "knn": {
                        "field": "vector",
                        "query_vector": &#91;1.25, 2, 3.5],
                        "k": 50,
                        "num_candidates": 100
                    }
                }
            ],
            "rank_window_size": 50,
            "rank_constant": 20
        }
    }
}</code></pre>
<p>The post <a rel="nofollow" href="https://mietwood.com/semantic-search-with-elasticsearch">Semantic Search with Elasticsearch</a> appeared first on <a rel="nofollow" href="https://mietwood.com">Customer Experience Management</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
