• Added new models (SentenceTransformers, Gensim, USE, Spacy)
  • Added .fit, .transform, and .fit_transform methods
  • Added .save and PolyFuzz.load()


from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)


from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)


from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("")
model = PolyFuzz(distance_model)


from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)

fit, transform, fit_transform
Add fit, transform, and fit_transform in order to use PolyFuzz in production #34

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz

train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]

# Fit
model = PolyFuzz("TF-IDF")

# Transform
results = model.transform(unseen_words)

In the code above, we fit our TF-IDF model on train_words and use .transform() to match the words in unseen_words to the words that we trained on in train_words.

After fitting our model, we can save it as follows:"my_model")

Then, we can load our model to be used elsewhere:

from polyfuzz import PolyFuzz

model = PolyFuzz.load("my_model")


  • Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
from polyfuzz import PolyFuzz

from_list = ["apple", "house"]
model = PolyFuzz("TF-IDF")
model.match(from_list, from_list)

This will match each word in from_list to itself and give it a score of 1. Thus, apple will be matched to apple and house will be mapped to house. However, if you input just a single list, it will try to map them within the list without mapping to itself:

from polyfuzz import PolyFuzz

from_list = ["apple", "apples"]
model = PolyFuzz("TF-IDF")

In the example above, apple will be mapped to apples and not to apple. Here, we assume that the user wants to find the most similar words within a list without mapping to itself.


  • Update numpy to "numpy>=1.20.0" to prevent this and this issue
  • Update pytorch to "torch>=1.4.0,<1.7.1" to prevent save_state_warning error


  • Fix exploding memory usage when using top_n


  • Use top_n in polyfuzz.models.TFIDF and polyfuzz.models.Embeddings


  • Update grouping to include all strings only if identical lists of strings are compared


  • Update naming convention matcher --> model
  • Update documentation
  • Add basic models to grouper
  • Fix issues with vector order in cosine similarity
  • Update naming of cosine similarity function


  • Additional tests
  • More thorough documentation
  • Prepare for public release


  • First release of PolyFuzz
  • Matching through:
    • Edit Distance
    • TF-IDF
    • Embeddings
    • Custom models
  • Grouping of results with custom models
  • Evaluation through precision-recall curves