Releases
v0.4.0¶
- Added new models (SentenceTransformers, Gensim, USE, Spacy)
- Added
.fit
,.transform
, and.fit_transform
methods - Added
.save
andPolyFuzz.load()
SentenceTransformers
from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)
Gensim
from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)
USE
from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model)
Spacy
from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)
fit, transform, fit_transform
Add fit
, transform
, and fit_transform
in order to use PolyFuzz in production #34
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz
train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]
# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)
# Transform
results = model.transform(unseen_words)
In the code above, we fit our TF-IDF model on train_words
and use .transform()
to match the words in unseen_words
to the words that we trained on in train_words
.
After fitting our model, we can save it as follows:
model.save("my_model")
Then, we can load our model to be used elsewhere:
from polyfuzz import PolyFuzz
model = PolyFuzz.load("my_model")
v0.3.4¶
- Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
from polyfuzz import PolyFuzz
from_list = ["apple", "house"]
model = PolyFuzz("TF-IDF")
model.match(from_list, from_list)
This will match each word in from_list
to itself and give it a score of 1. Thus, apple
will be matched to apple
and
house
will be mapped to house
. However, if you input just a single list, it will try to map them within the list without
mapping to itself:
from polyfuzz import PolyFuzz
from_list = ["apple", "apples"]
model = PolyFuzz("TF-IDF")
model.match(from_list)
In the example above, apple
will be mapped to apples
and not to apple
. Here, we assume that the user wants to
find the most similar words within a list without mapping to itself.
v0.3.3¶
- Update numpy to "numpy>=1.20.0" to prevent this and this issue
- Update pytorch to "torch>=1.4.0,<1.7.1" to prevent save_state_warning error
v0.3.2¶
- Fix exploding memory usage when using
top_n
v0.3.0¶
- Use
top_n
inpolyfuzz.models.TFIDF
andpolyfuzz.models.Embeddings
v0.2.2¶
- Update grouping to include all strings only if identical lists of strings are compared
v0.2.0¶
- Update naming convention matcher --> model
- Update documentation
- Add basic models to grouper
- Fix issues with vector order in cosine similarity
- Update naming of cosine similarity function
v0.1.0¶
- Additional tests
- More thorough documentation
- Prepare for public release
v0.0.1¶
- First release of
PolyFuzz
- Matching through:
- Edit Distance
- TF-IDF
- Embeddings
- Custom models
- Grouping of results with custom models
- Evaluation through precision-recall curves