polyfuzz.models.SentenceEmbeddings

Embed words into vectors and use cosine similarity to find the best matches between two lists of strings

Parameters

Name Type Description Default
embedding_model Union[str, sentence_transformers.SentenceTransformer.SentenceTransformer] The sbert model to use, this can be either a string or the model directly 'all-MiniLM-L6-v2'
min_similarity float The minimum similarity between strings, otherwise return 0 similarity 0.75
top_n int The number of best matches you want returned 1
cosine_method str The method/package for calculating the cosine similarity. Options: "sparse", "sklearn", "knn". Sparse is the fastest and most memory efficient but requires a package that might be difficult to install. Sklearn is a bit slower than sparse and requires significantly more memory as the distance matrix is not sparse Knn uses 1-nearest neighbor to extract the most similar strings it is significantly slower than both methods but requires little memory 'sparse'
model_id str The name of the particular instance, used when comparing models None

Usage:

distance_model = SentenceEmbeddings("all-MiniLM-L6-v2", min_similarity=0.5)

Or if you want to directly pass a sbert model:

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
distance_model = SentenceEmbeddings(embedding_model, min_similarity=0.5)

match(self, from_list, to_list=None, embeddings_from=None, embeddings_to=None, re_train=True)

Show source code in models\_sbert.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
    def match(self,
              from_list: List[str],
              to_list: List[str] = None,
              embeddings_from: np.ndarray = None,
              embeddings_to: np.ndarray = None,
              re_train: bool = True) -> pd.DataFrame:
        """ Matches the two lists of strings to each other and returns the best mapping

        Arguments:
            from_list: The list from which you want mappings
            to_list: The list where you want to map to
            embeddings_from: Embeddings you created yourself from the `from_list`
            embeddings_to: Embeddings you created yourself from the `to_list`
            re_train: Whether to re-train the model with new embeddings
                      Set this to False if you want to use this model in production

        Returns:
            matches: The best matches between the lists of strings

        Usage:

        ```python
        model = Embeddings(min_similarity=0.5)
        matches = model.match(["string_one", "string_two"],
                              ["string_three", "string_four"])
        ```
        """
        # Extract embeddings from the `from_list`
        embeddings_from = self.embedding_model.encode(from_list, show_progress_bar=False)

        # Extract embeddings from the `to_list` if it exists
        if not isinstance(embeddings_to, np.ndarray):
            if not re_train:
                embeddings_to = self.embeddings_to
            elif to_list is None:
                embeddings_to = self.embedding_model.encode(from_list, show_progress_bar=False)
            else:
                embeddings_to = self.embedding_model.encode(to_list, show_progress_bar=False)

        matches = cosine_similarity(embeddings_from, embeddings_to,
                                    from_list, to_list,
                                    self.min_similarity,
                                    top_n=self.top_n,
                                    method=self.cosine_method)

        self.embeddings_to = embeddings_to

        return matches

Matches the two lists of strings to each other and returns the best mapping

Parameters

Name Type Description Default
from_list List[str] The list from which you want mappings required
to_list List[str] The list where you want to map to None
embeddings_from ndarray Embeddings you created yourself from the from_list None
embeddings_to ndarray Embeddings you created yourself from the to_list None
re_train bool Whether to re-train the model with new embeddings Set this to False if you want to use this model in production True

Returns

Type Description
DataFrame matches: The best matches between the lists of strings

Usage:

model = Embeddings(min_similarity=0.5)
matches = model.match(["string_one", "string_two"],
                      ["string_three", "string_four"])