Models

Currently, the following models are implemented in PolyFuzz:
1. TF-IDF
2. EditDistance with RapidFuzz
3. FastText and GloVe
4. 🤗 Transformers
5. SentenceTransformers
6. Gensim 7. Spacy

With Flair, we can use all 🤗 Transformers that are publicly available. We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzz.

All models listed above can be found in polyfuzz.models and can be used to create and compare different matchers.

TF-IDF

Although the terms in TF-IDF are usually words, we are going to be using TF-IDF on a character-level. We will be extracting n-grams from a string and count the frequency of these n-grams across all input strings. For example, with 3-grams, the word "hotel" can be defined as "hot", "ote", and "tel".

After generating the n-grams and applying TF-IDF on these "terms", we can use cosine similarity to compare the generated TF-IDF vectors against each other.

We simply load in TFIDF from polyfuzz.models and pass it to our PolyFuzz instance:

from polyfuzz.models import TFIDF
from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

tfidf = TFIDF(n_gram_range=(3, 3), min_similarity=0, matcher_id="TF-IDF")
model = PolyFuzz(tfidf).match(from_list, to_list)

and that's it! You can play around with the TFIDF matcher until you get the results you are looking for. Note that if you increase the min_similarity there is a chance that some strings will not be matched at all.

EditDistance

There are many edit distance functions one could use and the EditDistance model from polyfuzz.models allows you to pass in any distance function. As long as that distance function takes in two strings and spits out a float, you can pass anything!

In the example below, we are going to be using Jaro Winkler Similarity from the jellyfish package to create our custom scorer:

from polyfuzz import PolyFuzz
from polyfuzz.models import EditDistance
from jellyfish import jaro_winkler_similarity

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

jellyfish_matcher = EditDistance(n_jobs=1, scorer=jaro_winkler_similarity)
model = PolyFuzz(jellyfish_matcher).match(from_list, to_list)

RapidFuzz

Edit distance measures are typically quite slow. Moreover, the one that is heavily used, fuzzywuzzy, has a very restrictive licence (GPL). Instead, I decided to create a RapidFuzz matcher which is a fast implementation of fuzzywuzzy and has a less restrictive licence (MIT):

from polyfuzz import PolyFuzz
from polyfuzz.models import RapidFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

rapidfuzz_matcher = RapidFuzz(n_jobs=1)
model = PolyFuzz(rapidfuzz_matcher).match(from_list, to_list)

Embeddings

With Flair, we can use all 🤗 Transformers that are publicly available. The embeddings that are created are compared with cosine similarity in order to understand how similar the created embeddings are to each other.

We simply have to instantiate any Flair WordEmbedding method and pass it through PolyFuzz:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import TransformerWordEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

bert = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert_matcher = Embeddings(bert, min_similarity=0)

models = PolyFuzz(bert_matcher).match(from_list, to_list)

Flair allows you to use pool word embeddings to create more powerful word embeddings. Below, we pool FastText and BERT to create a single embedding representation from which we can calculate the similarity between strings:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import TransformerWordEmbeddings, WordEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

bert = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert_matcher = Embeddings(bert, min_similarity=0)

fasttext = WordEmbeddings('en-crawl')
fasttext_matcher = Embeddings(fasttext, min_similarity=0) 

matchers = [bert_matcher, fasttext_matcher]

models = PolyFuzz(matchers).match(from_list, to_list)

SentenceTransformers

We can use sentence-transformers to generate embeddings from our input list and find the closest matching entities using cosine similarity. We simply have to instantiate our SentenceEmbeddings class and pass it to PolyFuzz:

from polyfuzz import PolyFuzz
from polyfuzz.models import SentenceEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model).match(from_list, to_list)

For a full list of possible models, click this link.

You can also use a custom SentenceTransformer model:

from polyfuzz import PolyFuzz
from polyfuzz.models import SentenceEmbeddings
from sentence_transformers import SentenceTransformer

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
distance_model = SentenceEmbeddings(embedding_model)
model = PolyFuzz(distance_model).match(from_list, to_list)

Gensim

We can use gensim to load in a word embedding model to generate embeddings from our input list and find the closest matching entities using cosine similarity. We simply have to instantiate our GensimEmbeddings class and pass it to PolyFuzz:

from polyfuzz import PolyFuzz
from polyfuzz.models import GensimEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model).match(from_list, to_list)

For a full list of possible models, click this link.

You can also use a custom Gensim model:

from polyfuzz import PolyFuzz
from polyfuzz.models import GensimEmbeddings
import gensim.downloader as api

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding_model = api.load("glove-twitter-25")
distance_model = GensimEmbeddings(embedding_model)
model = PolyFuzz(distance_model).match(from_list, to_list)

Spacy

We can use spacy to load in an embedding model to generate embeddings from our input list and find the closest matching entities using cosine similarity. We simply have to instantiate our SpacyEmbeddings class and pass it to PolyFuzz:

from polyfuzz import PolyFuzz
from polyfuzz.models import SpacyEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model).match(from_list, to_list)

For a full list of possible models, click this link.

You can also use a custom Spacy model:

from polyfuzz import PolyFuzz
from polyfuzz.models import SpacyEmbeddings
import spacy


from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding_model = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
distance_model = SpacyEmbeddings(embedding_model)
model = PolyFuzz(distance_model).match(from_list, to_list)

Universal Sentence Encoder (USE)

The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here for embedding the strings. The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs.

We simply have to instantiate our USEEmbeddings class and pass it to PolyFuzz:

from polyfuzz import PolyFuzz
from polyfuzz.models import USEEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model).match(from_list, to_list)

For a full list of possible models, click this link.

You can also use a custom USE model:

from polyfuzz import PolyFuzz
from polyfuzz.models import USEEmbeddings
import tensorflow_hub

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
distance_model = USEEmbeddings(embedding_model)
model = PolyFuzz(distance_model).match(from_list, to_list)

Using Multiple Models

from polyfuzz import PolyFuzz
from polyfuzz.models import EditDistance, TFIDF, Embeddings
from flair.embeddings import TransformerWordEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

bert = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert_matcher = Embeddings(bert, min_similarity=0, matcher_id="BERT")
tfidf_matcher = TFIDF(min_similarity=0)
edit_matcher = EditDistance()

matchers = [bert_matcher, tfidf_matcher, edit_matcher]
models = PolyFuzz(matchers).match(from_list, to_list)

To access the results, we again can call get_matches but since we have multiple models we get back a dictionary of dataframes back.

In order to access the results of a specific model, call get_matches with the correct id:

>>> models.get_matches("BERT")
        From        To          Similarity
0   apple       apple   1.000000
1   apples      apples  1.000000
2   appl        apple   0.928045
3   recal       apples  0.825268
4   house       mouse   0.887524
5   similarity  mouse   0.791548

Finally, visualize the results to compare the models:

models.visualize_precision_recall(kde=True)