polyfuzz.models.Embeddings

Embed words into vectors and use cosine similarity to find the best matches between two lists of strings

Parameters

Name Type Description Default
embedding_method Optional[List] list of Flair embeddings to use None
min_similarity float The minimum similarity between strings, otherwise return 0 similarity 0.75
top_n int The number of best matches you want returned 1
cosine_method str The method/package for calculating the cosine similarity. Options: "sparse", "sklearn", "knn". Sparse is the fastest and most memory efficient but requires a package that might be difficult to install. Sklearn is a bit slower than sparse and requires significantly more memory as the distance matrix is not sparse Knn uses 1-nearest neighbor to extract the most similar strings it is significantly slower than both methods but requires little memory 'sparse'
model_id str The name of the particular instance, used when comparing models None

Usage:

model = Embeddings(min_similarity=0.5)

Or if you want a custom model to be used and it is a word embedding model, pass it in as a list:

embedding_model = WordEmbeddings('news')
model = Embeddings([embeddings_model], min_similarity=0.5)

As you might have guessed, you can pass along multiple word embedding models and the results will be averaged:

fasttext_embedding = WordEmbeddings('news')
glove_embedding = WordEmbeddings('glove')
bert_embedding = TransformerWordEmbeddings('bert-base-multilingual-cased')
model = Embeddings([glove_embedding,
                    fasttext_embedding,
                    bert_embedding ], min_similarity=0.5)

match(self, from_list, to_list=None, embeddings_from=None, embeddings_to=None, re_train=True)

Show source code in models\_embeddings.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
    def match(self,
              from_list: List[str],
              to_list: List[str] = None,
              embeddings_from: np.ndarray = None,
              embeddings_to: np.ndarray = None,
              re_train: bool = True) -> pd.DataFrame:
        """ Matches the two lists of strings to each other and returns the best mapping

        Arguments:
            from_list: The list from which you want mappings
            to_list: The list where you want to map to
            embeddings_from: Embeddings you created yourself from the `from_list`
            embeddings_to: Embeddings you created yourself from the `to_list`
            re_train: Whether to re-train the model with new embeddings
                      Set this to False if you want to use this model in production

        Returns:
            matches: The best matches between the lists of strings

        Usage:

        ```python
        model = Embeddings(min_similarity=0.5)
        matches = model.match(["string_one", "string_two"],
                              ["string_three", "string_four"])
        ```
        """
        # Extract embeddings from the `from_list`
        if not isinstance(embeddings_from, np.ndarray):
            embeddings_from = self._embed(from_list)

        # Extract embeddings from the `to_list` if it exists
        if not isinstance(embeddings_to, np.ndarray):
            if not re_train:
                embeddings_to = self.embeddings_to
            elif to_list is None:
                embeddings_to = self._embed(from_list)
            else:
                embeddings_to = self._embed(to_list)

        matches = cosine_similarity(embeddings_from, embeddings_to,
                                    from_list, to_list,
                                    self.min_similarity,
                                    top_n=self.top_n,
                                    method=self.cosine_method)

        self.embeddings_to = embeddings_to

        return matches

Matches the two lists of strings to each other and returns the best mapping

Parameters

Name Type Description Default
from_list List[str] The list from which you want mappings required
to_list List[str] The list where you want to map to None
embeddings_from ndarray Embeddings you created yourself from the from_list None
embeddings_to ndarray Embeddings you created yourself from the to_list None
re_train bool Whether to re-train the model with new embeddings Set this to False if you want to use this model in production True

Returns

Type Description
DataFrame matches: The best matches between the lists of strings

Usage:

model = Embeddings(min_similarity=0.5)
matches = model.match(["string_one", "string_two"],
                      ["string_three", "string_four"])