Skip to content

Datasets

There are two datasets prepared for you to play around with:

  • Company Names
  • Movie Titles

Movie Titles

This data is retrieved from:

  • https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset
  • https://www.kaggle.com/shivamb/netflix-shows

It contains Netflix and IMDB movie titles that can be matched against each other. Where IMDB has 80852 movie titles and Netflix has 6172 movie titles.

You can use them as follows:

from polyfuzz import PolyFuzz
from polyfuzz.datasets import load_movie_titles

data = load_movie_titles()
model = PolyFuzz("TF-IDF").match(data["Netflix"], data["IMDB"])

Company Names

This data is retrieved from here and contains 100_000 company names to be matched against each other.

This is a different use case than what you have typically seen so far. We often see two different lists compared with each other. Here, you can use this dataset to compare the company names with themselves in order to clean them up.

You can use them as follows:

from polyfuzz import PolyFuzz
from polyfuzz.datasets import load_company_names

data = load_company_names()
model = PolyFuzz("TF-IDF").match(data)

By only inserting a single list, PolyFuzz will recognize that you are looking to match the titles with themselves. It will ignore any comparison a string has with itself, otherwise everything will get mapped to itself.