polyfuzz.models.Embeddings
¶
Embed words into vectors and use cosine similarity to find the best matches between two lists of strings
Parameters
Name | Type | Description | Default |
---|---|---|---|
embedding_method |
Optional[List] |
list of Flair embeddings to use | None |
min_similarity |
float |
The minimum similarity between strings, otherwise return 0 similarity | 0.75 |
top_n |
int |
The number of best matches you want returned | 1 |
cosine_method |
str |
The method/package for calculating the cosine similarity. Options: "sparse", "sklearn", "knn". Sparse is the fastest and most memory efficient but requires a package that might be difficult to install. Sklearn is a bit slower than sparse and requires significantly more memory as the distance matrix is not sparse Knn uses 1-nearest neighbor to extract the most similar strings it is significantly slower than both methods but requires little memory | 'sparse' |
model_id |
str |
The name of the particular instance, used when comparing models | None |
Usage:
model = Embeddings(min_similarity=0.5)
Or if you want a custom model to be used and it is a word embedding model, pass it in as a list:
embedding_model = WordEmbeddings('news')
model = Embeddings([embeddings_model], min_similarity=0.5)
As you might have guessed, you can pass along multiple word embedding models and the results will be averaged:
fasttext_embedding = WordEmbeddings('news')
glove_embedding = WordEmbeddings('glove')
bert_embedding = TransformerWordEmbeddings('bert-base-multilingual-cased')
model = Embeddings([glove_embedding,
fasttext_embedding,
bert_embedding ], min_similarity=0.5)
match(self, from_list, to_list=None, embeddings_from=None, embeddings_to=None, re_train=True)
¶
Show source code in models\_embeddings.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
Matches the two lists of strings to each other and returns the best mapping
Parameters
Name | Type | Description | Default |
---|---|---|---|
from_list |
List[str] |
The list from which you want mappings | required |
to_list |
List[str] |
The list where you want to map to | None |
embeddings_from |
ndarray |
Embeddings you created yourself from the from_list |
None |
embeddings_to |
ndarray |
Embeddings you created yourself from the to_list |
None |
re_train |
bool |
Whether to re-train the model with new embeddings Set this to False if you want to use this model in production | True |
Returns
Type | Description |
---|---|
DataFrame |
matches: The best matches between the lists of strings |
Usage:
model = Embeddings(min_similarity=0.5)
matches = model.match(["string_one", "string_two"],
["string_three", "string_four"])