polyfuzz.models.TFIDF
¶
A character based n-gram TF-IDF to approximate edit distance
We turn a string into, typically of length 3, n-grams. For example, using 3-grams of the "hotel" we get ['hot', 'ote', 'tel']. These are then used as input for a TfidfVectorizer in order to create a vector for each word. Then, we simply apply cosine similarity through k-NN
Parameters
Name | Type | Description | Default |
---|---|---|---|
n_gram_range |
Tuple[int, int] |
The n_gram_range on a character-level | (3, 3) |
clean_string |
bool |
Whether to clean the string such that only alphanumerical characters are kept | True |
min_similarity |
float |
The minimum similarity between strings, otherwise return 0 similarity | 0.75 |
top_n |
int |
The number of matches you want returned | 1 |
cosine_method |
str |
The method/package for calculating the cosine similarity. Options: * sparse * sklearn * knn | 'sparse' |
sparse is the fastest and most memory efficient but requires a
package that might be difficult to install
sklearn is a bit slower than sparse and requires significantly more memory as
the distance matrix is not sparse
knn uses 1-nearest neighbor to extract the most similar strings
it is significantly slower than both methods but requires little memory
model_id: The name of the particular instance, used when comparing models
Usage:
from polymatcher.models import TFIDF
model = TFIDF(n_gram_range=(3, 3), clean_string=True, use_knn=False)
match(self, from_list, to_list=None, re_train=True)
¶
Show source code in models\_tfidf.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
Match two lists of strings to each other and return the most similar strings
Parameters
Name | Type | Description | Default |
---|---|---|---|
from_list |
List[str] |
The list from which you want mappings | required |
to_list |
List[str] |
The list where you want to map to | None |
re_train |
bool |
Whether to re-train the model with new embeddings Set this to False if you want to use this model in production | True |
Returns
Type | Description |
---|---|
DataFrame |
matches: The best matches between the lists of strings |
Usage:
from polymatcher.models import TFIDF
model = TFIDF()
matches = model.match(["string_one", "string_two"],
["string_three", "string_four"])