Custom Models¶
Although PolyFuzz has several models implemented, what if you have developed your own?
What if you want a different similarity/distance measure that is not defined in PolyFuzz?
That is where custom models come in. If you follow the structure of PolyFuzz's BaseMatcher
you can
quickly implement any model you would like.
You simply create a class using BaseMatcher
, make sure it has a function match
that inputs
two lists and outputs a pandas dataframe. That's it!
We start by creating our own model that implements the ratio similarity measure from RapidFuzz:
import numpy as np
import pandas as pd
from rapidfuzz import fuzz
from polyfuzz import PolyFuzz
from polyfuzz.models import BaseMatcher
class MyModel(BaseMatcher):
def match(self, from_list, to_list, **kwargs):
# Calculate distances
matches = [[fuzz.ratio(from_string, to_string) / 100
for to_string in to_list] for from_string in from_list]
# Get best matches
mappings = [to_list[index] for index in np.argmax(matches, axis=1)]
scores = np.max(matches, axis=1)
# Prepare dataframe
matches = pd.DataFrame({'From': from_list,
'To': mappings,
'Similarity': scores})
return matches
MyModel
can now be used within PolyFuzz
and runs like every other model:
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
custom_matcher = MyModel()
model = PolyFuzz(custom_matcher).match(from_list, to_list)
Now we can visualize the results:
model.visualize_precision_recall(kde=True)
fit, transform, fit_transform¶
Although the above model can be used in production using fit
, it does not track its state between fit
and transform
.
This is not necessary here, since edit distances should be recalculated but if you have embeddings that you do not
want to re-calculate, then it is helpful to track the states between fit
and transform
so that embeddings do not need
to be re-calculated. To do so, we can use the re_train
parameter to define what happens if we re-train a model (for example when using fit
)
and what happens when we do not re-train a model (for example when using transform
).
In the example below, when we set re_train=True
we calculate the embeddings from both the from_list
and to_list
if they are defined
and save the embeddings to the self.embeddings_to
variable. Then, when we set re_train=True
, we can prevent redoing the fit
by leveraging
the pre-calculated self.embeddings_to
variable.
import numpy as np
from sentence_transformers import SentenceTransformer
from ._utils import cosine_similarity
from ._base import BaseMatcher
class SentenceEmbeddings(BaseMatcher):
def __init__(self, model_id):
super().__init__(model_id)
self.type = "Embeddings"
self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
self.embeddings_to = None
def match(self, from_list, to_list, re_train=True) -> pd.DataFrame:
# Extract embeddings from the `from_list`
embeddings_from = self.embedding_model.encode(from_list, show_progress_bar=False)
# Extract embeddings from the `to_list` if it exists
if not isinstance(embeddings_to, np.ndarray):
if not re_train:
embeddings_to = self.embeddings_to
elif to_list is None:
embeddings_to = self.embedding_model.encode(from_list, show_progress_bar=False)
else:
embeddings_to = self.embedding_model.encode(to_list, show_progress_bar=False)
# Extract matches
matches = cosine_similarity(embeddings_from, embeddings_to, from_list, to_list)
self.embeddings_to = embeddings_to
return matches
Then, we can use it as follows:
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
custom_matcher = MyModel()
model = PolyFuzz(custom_matcher).fit(from_list)
By using the .fit
function, embeddings are created from the from_list
variable and saved. Then, when we
run model.transform(to_list)
, the embeddings created from the from_list
variable do not need to be recalculated.