polyfuzz.models.EditDistance

Calculate the Edit Distance between lists of strings using any distance/similarity based scorer

Parameters

Name Type Description Default
n_jobs int Nr of parallel processes, use -1 to use all cores 1
scorer Callable The scorer function to be used to calculate the edit distance. This function should give back a float between 0 and 1, and work as follows: scorer("string_one", "string_two") <cyfunction ratio at 0x00000237A334AAD0>
model_id str The name of the particular instance, used when comparing models None

Usage:

from rapidfuzz import fuzz
model = EditDistance(n_jobs=-1, scorer=fuzz.WRatio)

match(self, from_list, to_list=None, **kwargs)

Show source code in models\_distance.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
    def match(self,
              from_list: List[str],
              to_list: List[str] = None,
              **kwargs) -> pd.DataFrame:
        """ Calculate the edit distances between two list of strings
        by parallelizing the calculation and passing the lists in
        batches.

        Arguments:
            from_list: The list from which you want mappings
            to_list: The list where you want to map to

        Returns:
            matches: The best matches between the lists of strings

        Usage:

        ```python
        from rapidfuzz import fuzz
        model = EditDistance(n_jobs=-1, score_cutoff=0.5, scorer=fuzz.WRatio)
        matches = model.match(["string_one", "string_two"],
                              ["string_three", "string_four"])
        ```
        """
        if to_list is None:
            self.equal_lists = True
            expected_iterations = int(len(from_list)/2)
            to_list = from_list.copy()
        else:
            expected_iterations = len(from_list)

        matches = Parallel(n_jobs=self.n_jobs)(delayed(self._calculate_edit_distance)
                                               (from_string, to_list)
                                               for from_string in tqdm(from_list, total=expected_iterations,
                                                                       disable=True))
        matches = pd.DataFrame(matches, columns=['From', "To", "Similarity"])

        if self.normalize:
            matches["Similarity"] = (matches["Similarity"] -
                                     matches["Similarity"].min()) / (matches["Similarity"].max() -
                                                                     matches["Similarity"].min())
        return matches

Calculate the edit distances between two list of strings by parallelizing the calculation and passing the lists in batches.

Parameters

Name Type Description Default
from_list List[str] The list from which you want mappings required
to_list List[str] The list where you want to map to None

Returns

Type Description
DataFrame matches: The best matches between the lists of strings

Usage:

from rapidfuzz import fuzz
model = EditDistance(n_jobs=-1, score_cutoff=0.5, scorer=fuzz.WRatio)
matches = model.match(["string_one", "string_two"],
                      ["string_three", "string_four"])