polyfuzz.models.cosine_similarity
¶
Calculate similarity between two matrices/vectors and return best matches
Parameters
Name | Type | Description | Default |
---|---|---|---|
from_vector |
ndarray |
the matrix or vector representing the embedded strings to map from | required |
to_vector |
ndarray |
the matrix or vector representing the embedded strings to map to | required |
from_list |
List[str] |
The list from which you want mappings | required |
to_list |
List[str] |
The list where you want to map to | required |
min_similarity |
float |
The minimum similarity between strings, otherwise return 0 similarity | 0.75 |
top_n |
int |
The number of best matches you want returned | 1 |
method |
str |
The method/package for calculating the cosine similarity. Options: "sparse", "sklearn", "knn". Sparse is the fastest and most memory efficient but requires a package that might be difficult to install. Sklearn is a bit slower than sparse and requires significantly more memory as the distance matrix is not sparse Knn uses 1-nearest neighbor to extract the most similar strings it is significantly slower than both methods but requires little memory | 'sparse' |
Returns
Type | Description |
---|---|
DataFrame |
matches: The best matches between the lists of strings |
Usage:
Make sure to fill the to_vector
and from_vector
with vector representations
of to_list
and from_list
respectively:
from polyfuzz.models import extract_best_matches
indices, similarity = extract_best_matches(from_vector, to_vector, method="sparse")