Unleashing the Power of Sentence Transformers: Revolutionising Semantic Search and Sentence Similarity

Sakil Ansari
4 min readMar 31, 2023

--

PART III- Your Guide to Build Your First Efficient and Effective Search Engine

In Part I, I have discussed sentence transformer fundamentals, its architecture, and various dataset formats for training the model . In Part II,sentence transformer is finetuned by considering Scenario 1.In this part, I am going to build a semantic search engine with with sentence transformers and Faiss.

Facebook AI Similarity Search (Faiss)

Facebook AI Similarity Search (Faiss) is one of the most popular implementations of efficient similarity search.Faiss is a library that Facebook AI created to make similarity search effective.

We can use Faiss to index a set of vectors, then use another vector (the query vector) to look for the index’s most related vectors.We will study how Faiss not only enables us to build an index and search but also accelerates search times to absurd performance levels.

Sentence transformer and FAISS-based similarity search architecture

Sentence transformer and FAISS-based similarity search architecture

Dataset

This article makes use of the publicly accessible dataset.The dataset has the following three columns:

  • publish_date
  • headline_category
  • headline_text

Data Prepration

I have considered headline_text for the task.

# importing necessary libraries
import numpy as np
import pandas as pd
from datetime import datetime


def get_data_prepared():
"""
Function to prepare the data
"""
data = pd.read_csv("india-news-headlines.csv")
print("data loaded successfully")
sentences=data.headline_text.to_list()
# remove duplicates and NaN
sentences = [word for word in list(set(sentences)) if type(word) is str]
return sentences

Creating Embedding

The embedding for the text is created using a sentence transformer.

# installing sentence transformer by using the following command
# pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

# initialize sentence transformer model
model = SentenceTransformer('bert-base-nli-mean-tokens')

def create_embedding():
"""
Function to create embedding
"""
data,original_data=get_data_prepared()
# create sentence embeddings
sentence_embeddings = model.encode(data)
print("<<<<<<The shape of the text embedding is>>>>>",sentence_embeddings.shape)
return sentence_embeddings

Creating Index using FAISS

The index should be created in order to search for related content after embedding has been built and saved.

#install FAISS by using the following command
# pip install faiss

import faiss

def get_faiss_index():
"""
Function to create faiss index
"""
number_dimension = sentence_embeddings.shape[1]
index = faiss.IndexFlatL2(number_dimension)
index.add(sentence_embeddings)
top_related_content = 5 #modify based on your requirements
xquery = model.encode(["For bigwigs; it is destination Goa"])
D, related_indices = index.search(xquery, top_related_content)
print(related_indices)
return related_indices

Query Text

The next step is to develop a technique that delivers the most relevant content based on user search.

def get_similar_content():
"""
Function to get the similar news content from the dataset
"""
related_indices=get_faiss_index()
df=pd.DataFrame(related_indices)
df1=df.T
df2=df1[0].to_list()
data,original_data=get_data_prepared()
print("<<<<<The recommended similar contents are>>>>>>>>>")
print('\n')
print(original_data['headline_text'].iloc[df2])
get_similar_content()
OUTPUT

<<<<<The recommended similar contents are>>>>>>>>>


90467 President gives away Padma awards
21042 Annual conference on Banga Sahitya
62837 Welcome change
41766 AMC readies for bio-ammo
77320 Point; Counterpoint

Indexes

Flat indexes

IndexFlatL2 calculates the distance in L2 (or Euclidean space) between our query vector and the vectors stored in the index. It is really accurate, straightforward, but not too fast.IndexFlatL2 is computationally expensive, it doesn’t scale well.Our query vector is compared to every other vector when using this index, which is an exhaustive search.With a large dataset, our index quickly becomes too slow.

Cell-probe methods (IndexIVFFlat,IndexFlatL2)

Using a partitioning technique called cell-probe methods is a common way to expedite the process at the expense of losing the guarantee to find the nearest neighbour.This method uses a partition-based method based on Multi-probing.This method includes the following steps:

  • Specify the number of partitions(nlist)
  • Identify the partition the query vector belongs to
  • Use IndexFlatL2 (or another metric) to search between the query vector and all other vectors belonging to that specific partition.

The scope of search is reduced , producing an approximate answer, rather than exact (as produced through IndexFlatL2).The implementation is given below:

nlist = 50  # number of partition

#initialize index using IndexFlatL2--this is quantizer step
quantizer = faiss.IndexFlatL2(number_dimension)

#Feeding L2 into partitioning index :IndexIVFFlat
index = faiss.IndexIVFFlat(quantizer, number_dimension, nlist)
index.train(sentence_embeddings)
index.add(sentence_embeddings)

#Search similar content for query text
D, related_indices = index.search(xquery, top_related_content)

Indexes based on Product Quantization

Product Quantization (PQ) helps in compression vectors in terms of distance/similarity calculation.PQ consists of three steps:

  • Split the original vector into many subvectors
  • For each set of subvectors, create multiple centroids by using clustering
  • Replace each sub-vector in our vector of sub-vectors with the ID of the nearest set-specific centroid.

The implementation is given below:

m = 8  # number of centroid IDs in final compressed vectors
bits = 8 # number of bits in each centroid

quantizer = faiss.IndexFlatL2(number_dimension) # we keep the same L2 distance flat index
index = faiss.IndexIVFPQ(quantizer, number_dimension, nlist, m, bits)
index.train(sentence_embeddings)
index.add(sentence_embeddings)

#Search similar content for query text
D, related_indices = index.search(xquery, top_related_content)

Conclusion

IVF and PQ are performing better in terms of speed, however exhaustive L2 search is superior in terms of accuracy.

Notebook link

(PART IV…..Coming)

--

--

Sakil Ansari
Sakil Ansari

Written by Sakil Ansari

Working as a Data Scientist/ML/NLP/Speech Recognition/Deep learning

No responses yet