polyfuzz.models.cosine_similarity

Calculate similarity between two matrices/vectors and return best matches

Parameters

Name Type Description Default
from_vector ndarray the matrix or vector representing the embedded strings to map from required
to_vector ndarray the matrix or vector representing the embedded strings to map to required
from_list List[str] The list from which you want mappings required
to_list List[str] The list where you want to map to required
min_similarity float The minimum similarity between strings, otherwise return 0 similarity 0.75
top_n int The number of best matches you want returned 1
method str The method/package for calculating the cosine similarity. Options: "sparse", "sklearn", "knn". Sparse is the fastest and most memory efficient but requires a package that might be difficult to install. Sklearn is a bit slower than sparse and requires significantly more memory as the distance matrix is not sparse Knn uses 1-nearest neighbor to extract the most similar strings it is significantly slower than both methods but requires little memory 'sparse'

Returns

Type Description
DataFrame matches: The best matches between the lists of strings

Usage:

Make sure to fill the to_vector and from_vector with vector representations of to_list and from_list respectively:

from polyfuzz.models import extract_best_matches
indices, similarity = extract_best_matches(from_vector, to_vector, method="sparse")