Semantic Search
Since the advent of ChatGPT in November 2022, there is not a single day goes by without hearing or reading about vector or semantic search. It’s everywhere and so prevalent that we often get the impression this is a new cutting-edge technology.
Vector vs lexical search
An easy way to introduce vector search is by comparing it to the more conventional lexical search that you’re probably used to. Vector search, also commonly known as semantic search, and lexical search work very differently.
Lexical search is the kind of search that we’ve all been using for years. To summarize it very briefly, it doesn’t try to understand the real meaning of what is indexed and queried, instead, it makes a big effort to lexically match the literals of the words or variants of them like stemming words, or synonyms, etc.. That makes what the user types in a query with all the literals that have been previously indexed into the database. The similarity is replaced by ranking algorithm, such as TF-IDF.
Documents are tokenized and analyzed. Then, the resulting terms are indexed in an inverted index, which simply maps the analyzed terms to the documents containing them. Searching for “yellow texas roses” will match all documents with varying scores.
Semantic search – the whole purpose of semantic search is to index data in such a way that it can be searched based on the meaning it represents.
What’s the difference between semantic search and lexical search?
Lexical search doesn’t try to understand the real meaning of what is indexed and queried- it matches the literals of the words or their variants. In contrast, vector search indexes data in a way that allows it to be searched based on the meaning it represents.
Read more about vector similarity https://www.elastic.co/search-labs/blog/introduction-to-vector-search
Semantic search with elasticsearch
Load model. See details here and here.
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-base")
VECTOR_DIMENSION = model.get_sentence_embedding_dimension()
print(f"Model loaded successfully.VECTOR_DIMENSION {VECTOR_DIMENSION}")
Start elastic server
from elasticsearch import Elasticsearch
ES_HOST = "http://localhost:9000"
es = Elasticsearch(hosts=[ES_HOST])
print(f"Connection successful: {es.info().body['cluster_name']}")
Index product data to elastic database
# Fetch products from database
def select_products():
q = """
SELECT *
FROM [DB_Products]
where [BrandName] in ('PCE','De Walt')
"""
dfp = read_from_sql_server(q, odbc_conect='DSN=SQLxx')
return dfp
# >-------------------------------------------------
df_docs = select_products()
print(df_docs.info())
# -------------------------------------------------# Index product names into elastic
index_mapping = {
"properties": {
"embedding": {
"type": "dense_vector",
"dims": VECTOR_DIMENSION,
"index": True,
"similarity": "cosine"
},
"product_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"product_name_org": {
"type": "keyword",
"ignore_above": 256
},
"prodidx": {
"type": "keyword",
"ignore_above": 256
},
"img_url": {
"type": "keyword",
"ignore_above": 512
},
}
}
# --- 4. Create the Index -------------------------------------------------
INDEX_NAME = 'prod_names_for_search_hybrid'
import time
es.options(ignore_status=[400, 404]).indices.delete(index=INDEX_NAME, ignore_unavailable=True)
time.sleep(3) # Give a moment for index deletion to propagate
if not es.indices.exists(index=INDEX_NAME):
es.indices.create(index=INDEX_NAME, mappings=index_mapping)
print("Index created.")
else:
print("Index already exists.")
# -----------------------------------------------------------------------
# Indexing product data -------------------------------------------------
product_names = df_docs['ProductName'].apply(lambda x: x.lower()[:256]).to_list()
i = 1
for name in product_names:
print(f"Indexing:{i} {name}")
i += 1
vector = model.encode(f"passage: {name}").tolist()
doc = {
"Product_name": name,
"ProductNameVector": vector
}
es.index(index=INDEX_NAME, document=doc, refresh=True)
# ---------------------------------------------------------------------You can check your index using diagnostic functions. Semantic search.
# --- Diagnostic functions ---------------------------------------
def indices_list():
return [index['index'] for index in es.cat.indices(format='json')]
def count_records(index_name):
return es.count(index=index_name)['count']
# -----------------------------------------------------------------
# Get indices and record count ------------------------------------
for i in indices_list():
print(i, count_records(i))
# Get mapings
mapping = es.indices.get_mapping(index=INDEX_NAME)
fields = mapping[INDEX_NAME]['mappings']['properties']
for field, details in fields.items():
print(f"Model {INDEX_NAME}, fields {field}: {details['type']}")
#------------------------------------------------------------------Semantic search
Now when you have index you can search. You can read more in my post Semantic Search with Elasticsearch
# Semantic Search ---
query_text = "yellow texas rose"
query_vector = model.encode(f"query: {query_text}").tolist()
knn_query = {
"field": "embedding", #"ProductNameVector",
"query_vector": query_vector,
"k": 10,
"num_candidates": 50
}
response = es.search(index=INDEX_NAME, knn=knn_query, source=["product_name"])
for hit in response['hits']['hits']:
print(f" - Product: {hit['_source']['product_name']} (Score: {hit['_score']:.4f})")
You can perform also lexical search or finnali hybrid search.
def search_score_script(query_h):
response = es.search(
index=INDEX_NAME,
body={
"_source": ['product_name','prodidx','img_url'],
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"product_name": {
"query": query_h,
'operator': 'and',
"_name": "text_match"
}
}
},
{
"knn": {
"field": "embedding",
"query_vector": model.encode(f"query: {query_text}").tolist(),
"k": 30,
"num_candidates": 300,
"_name": "semantic_search"
}
}
]
}
},
}
},
"size": 100
}
)
return response
response = search_score_script(query_h)
products = [
{
"product_name": hit["_source"]["product_name"],
"score": hit["_score"],
"matched_queries": hit.get("matched_queries", []),
"prodidx": hit["_source"]["prodidx"],
}
for hit in response["hits"]["hits"]
]
print([(p["product_name"],p["score"] for p in products])
Sentence Transformers (e.g., MiniLM, BERT variants) are the best choice models for semantic search
- Type: Dense vector models.
- Pros:
- Rich semantic understanding.
- Multilingual support.
- Fine-tuning possible for domain-specific needs.
- Popular Models:
msmarco-MiniLM-L-12-v3– optimized for asymmetric search (short queries vs. long product descriptions) 3.all-MiniLM-L6-v2– fast and lightweight for general semantic tasks.
- Use Case: Ideal for large-scale product catalogs and multilingual e-commerce platforms 3 4.
3. Hybrid Search (BM25 + Semantic)
- Combine BM25 (keyword relevance) with semantic embeddings using Reciprocal Rank Fusion (RRF).
- Delivers highly relevant results by balancing literal matches and contextual meaning 1.
