polyfuzz.models.USEEmbeddings
¶
Embed words into vectors and use cosine similarity to find the best matches between two lists of strings
Parameters
Name | Type | Description | Default |
---|---|---|---|
embedding_model |
The USE model to use, this can be either a string or the model directly | 'https://tfhub.dev/google/universal-sentence-encoder/4' |
|
min_similarity |
float |
The minimum similarity between strings, otherwise return 0 similarity | 0.75 |
top_n |
int |
The number of best matches you want returned | 1 |
cosine_method |
str |
The method/package for calculating the cosine similarity. Options: "sparse", "sklearn", "knn". Sparse is the fastest and most memory efficient but requires a package that might be difficult to install. Sklearn is a bit slower than sparse and requires significantly more memory as the distance matrix is not sparse Knn uses 1-nearest neighbor to extract the most similar strings it is significantly slower than both methods but requires little memory | 'sparse' |
model_id |
str |
The name of the particular instance, used when comparing models | None |
Usage:
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4", min_similarity=0.5)
Or if you want to directly pass a USE model:
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
distance_model = UseEmbeddings(embedding_model, min_similarity=0.5)
match(self, from_list, to_list=None, embeddings_from=None, embeddings_to=None, re_train=True)
¶
Show source code in models\_use.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
Matches the two lists of strings to each other and returns the best mapping
Parameters
Name | Type | Description | Default |
---|---|---|---|
from_list |
List[str] |
The list from which you want mappings | required |
to_list |
List[str] |
The list where you want to map to | None |
embeddings_from |
ndarray |
Embeddings you created yourself from the from_list |
None |
embeddings_to |
ndarray |
Embeddings you created yourself from the to_list |
None |
re_train |
bool |
Whether to re-train the model with new embeddings Set this to False if you want to use this model in production | True |
Returns
Type | Description |
---|---|
DataFrame |
matches: The best matches between the lists of strings |
Usage:
model = Embeddings(min_similarity=0.5)
matches = model.match(["string_one", "string_two"],
["string_three", "string_four"])