polyfuzz.models.TFIDF

A character based n-gram TF-IDF to approximate edit distance

We turn a string into, typically of length 3, n-grams. For example, using 3-grams of the "hotel" we get ['hot', 'ote', 'tel']. These are then used as input for a TfidfVectorizer in order to create a vector for each word. Then, we simply apply cosine similarity through k-NN

Parameters

Name Type Description Default
n_gram_range Tuple[int, int] The n_gram_range on a character-level (3, 3)
clean_string bool Whether to clean the string such that only alphanumerical characters are kept True
min_similarity float The minimum similarity between strings, otherwise return 0 similarity 0.75
top_n int The number of matches you want returned 1
cosine_method str The method/package for calculating the cosine similarity. Options: * sparse * sklearn * knn 'sparse'
                sparse is the fastest and most memory efficient but requires a
                package that might be difficult to install

                sklearn is a bit slower than sparse and requires significantly more memory as
                the distance matrix is not sparse

                knn uses 1-nearest neighbor to extract the most similar strings
                it is significantly slower than both methods but requires little memory
model_id: The name of the particular instance, used when comparing models

Usage:

from polymatcher.models import TFIDF
model = TFIDF(n_gram_range=(3, 3), clean_string=True, use_knn=False)

match(self, from_list, to_list=None, re_train=True)

Show source code in models\_tfidf.py
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
    def match(self,
              from_list: List[str],
              to_list: List[str] = None,
              re_train: bool = True) -> pd.DataFrame:
        """ Match two lists of strings to each other and return the most similar strings

        Arguments:
            from_list: The list from which you want mappings
            to_list: The list where you want to map to
            re_train: Whether to re-train the model with new embeddings
                      Set this to False if you want to use this model in production

        Returns:
            matches: The best matches between the lists of strings

        Usage:

        ```python
        from polymatcher.models import TFIDF
        model = TFIDF()
        matches = model.match(["string_one", "string_two"],
                              ["string_three", "string_four"])
        ```
        """

        tf_idf_from, tf_idf_to = self._extract_tf_idf(from_list, to_list, re_train)
        matches = cosine_similarity(tf_idf_from, tf_idf_to,
                                    from_list, to_list,
                                    self.min_similarity,
                                    top_n=self.top_n,
                                    method=self.cosine_method)

        return matches

Match two lists of strings to each other and return the most similar strings

Parameters

Name Type Description Default
from_list List[str] The list from which you want mappings required
to_list List[str] The list where you want to map to None
re_train bool Whether to re-train the model with new embeddings Set this to False if you want to use this model in production True

Returns

Type Description
DataFrame matches: The best matches between the lists of strings

Usage:

from polymatcher.models import TFIDF
model = TFIDF()
matches = model.match(["string_one", "string_two"],
                      ["string_three", "string_four"])