Unleashing the Power of Sentence Transformers: Revolutionising Semantic Search and Sentence Similarity

Sakil Ansari
5 min readMar 5, 2023

--

PART II- Fine tuning Sentence Transformer

In my previous article PART I , I have discussed sentence transformer fundamentals, its architecture, and various dataset formats for training the model.In this part , I am going to fine tune sentence transformer by using data format: Scenario 1 .

Semantic similarity:(Image source)

Training Data Preparation

You can find the dataset here. Considering that scenario 1 is being used, the dataset format should be two sentences with a label indicating how similar they are.I used a fake news open source dataset. The dataset comprises a number of fields, but I have selected the title and content(text) of the news from the dataset.The sample data is shown below:

Original sample data

Since the dataset requires labels, I calculated the cosine similarity between the news article’s title and content to demonstrate how similar they are.I have not cleaned the data, although it is highly advised to do so before calculating the cosine similarity. Cleaning the data includes removing stop words, links, ampersands, unnecessary whitespace, and converting the text to lower case, etc.

dataset sample with cosine similarity

The code for preparing training data is given below.

#importing necessary libraries
# the code is saved as data_preparation.py
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from numpy.linalg import norm
import re
import warnings
warnings.filterwarnings("ignore")

def get_data_preparation():
"""
Function to prepare the data: selecting title and text
"""
data=pd.read_csv(r"fake_dataset.csv")
data_select=data[['title','text']]
data_select.dropna(inplace=True)
return data_select

def cosine_sim(row):
"""
define a function to compute cosine similarity row-wise
"""
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(row.values)
cos_sim = cosine_similarity(tfidf)
return cos_sim[0][1]

def get_cosine_similarity():
"""
Function to call cosine_sim and get_data_preparation to get cosine similarith
"""
# apply the function to each row of the DataFrame
data_select=get_data_preparation()
data_select['cosine_sim'] = data_select[['title', 'text']].apply(cosine_sim, axis=1)

#renaming the columns
data_select.columns=['text_a','text_b','label']
return data_select

Great News! Now we have the dataset in the required format.

sample data in required format

The Python snippet shown above is saved as data data_preparation.py

Fine tuning the Model

The dataset has been prepared in the required format, now it is the time to fine-tune the sentence transformer model.First we need to install sentence transformer in our system.

# ! pip install -U sentence-transformers

#installing the necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import InputExample, InputFeatures
from sentence_transformers import SentenceTransformer, InputExample, losses
import pandas as pd
from sentence_transformers import SentenceTransformer, InputExample
from torch.utils.data import DataLoader

#calling get_cosine_similarity method from data_preparation.py to get required
# the data
from data_preparation import get_cosine_similarity

def get_input_examples():
"""
Function to Create InputExamples for fine tuning sentence transformer
"""
train_data = get_cosine_similarity()
train_examples = []
for i, row in train_data.iterrows():
text_a = row['text_a']
text_b = row['text_b']
label = row['label']
example = InputExample(texts=[text_a, text_b], label=label)
train_examples.append(example)
return train_examples

# Define the training loss
train_loss = losses.CosineSimilarityLoss(model=model)

train_examples=get_input_examples()
# Define the training dataloader
train_dataloader = DataLoader(train_examples, batch_size=32, shuffle=True)

# Fine-tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100)

#saving your fine tuned model
model.save("sentence_similarity_semantic_search")

The sentence transformer model has been fine tuned and saved, at last.Once your fine-tuned model is saved, you can see the below files.

Fine-tuned model

This fine-tuned model can be used to make inferences.On hugging face, I have hosted the fine tuned model.

Inferencing

I’ll now demonstrate how we can use our fine tuned model for inference.

Problem Statement:

Develop a system to determine the top five most similar pairs of sentences from a given corpus based on semantic and contextual meaning.

Solution

I’ll be using the fine tuned model that is hosted on hugging face.

# Installing the required libraries
# ! pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, InputExample, losses
import pandas as pd
from sentence_transformers import SentenceTransformer, InputExample
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, util

#downloading the fine tuned model from hugging face
model_name="Sakil/sentence_similarity_semantic_search"
model = SentenceTransformer(model_name)

sentences = ['I like super heroes movies.',
'Batman is my favorite character.',
'Joker is a great villain.',
'Joker is laughing.',
'Batman is driving his favorite car.',
'Super heroes movies are fanstastic.',
'Avtar 2 is a great movie',
'I love eating while watching movies.',
'I dont like to watch serious movies.',
'Marvel has great characters',
'DC has great VFX'
]
#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
for j in range(i+1, len(cos_sim)):
all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))
Inference:Output

Conclusion

In conclusion, fine-tuned sentence transformer models have become a powerful tool for natural language processing tasks. These models have the ability to encode textual inputs into high-dimensional vector representations, which can capture the semantic meaning of the input text.
With the ability to fine-tune these models on specific tasks, they have shown remarkable performance in tasks such as , semantic textual similarity, semantic search ,recommendation system and question answering. These models have great potential for a wide range of applications in various industries, including healthcare, finance, and e-commerce.As research in the field of natural language processing continues to progress, we can expect to see even more powerful and efficient sentence transformer models being developed, opening up new possibilities for language-based applications and improving the overall quality of communication.

PART III(Coming…..)

--

--

Sakil Ansari
Sakil Ansari

Written by Sakil Ansari

Working as a Data Scientist/ML/NLP/Speech Recognition/Deep learning

No responses yet