Skip to content

Quick Start

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

# Choose an embedding backend
pip install bertopic[flair, gensim, spacy, use]

# Topic modeling with images
pip install bertopic[vision]

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics, we can access the frequent topics that were generated:

>>> topic_model.get_topic_info()

Topic   Count   Name
-1      4630    -1_can_your_will_any
0       693     49_windows_drive_dos_file
1       466     32_jesus_bible_christian_faith
2       441     2_space_launch_orbit_lunar
3       381     22_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

Using .get_document_info, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

>>> topic_model.get_document_info(docs)

Document                               Topic    Name                        Top_n_words                     Probability    ...
I am sure some bashers of Pens...       0       0_game_team_games_season    game - team - games...          0.200010       ...
My brother is in the market for...      -1     -1_can_your_will_any         can - your - will...            0.420668       ...
Finally you said what you dream...      -1     -1_can_your_will_any         can - your - will...            0.807259       ...
Think! It is the SCSI card doing...     49     49_windows_drive_dos_file    windows - drive - docs...       0.071746       ...
1) I have an old Jasmine drive...       49     49_windows_drive_dos_file    windows - drive - docs...       0.038983       ...

Multilingual

Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Fine-tune Topic Representations

In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is KeyBERTInspired, which for many users increases the coherence and reduces stopwords from the resulting topic representations:

from bertopic.representation import KeyBERTInspired

# Fine-tune your topic representations
representation_model = KeyBERTInspired()
topic_model = BERTopic(representation_model=representation_model)

However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:

import openai
from bertopic.representation import OpenAI

# Fine-tune topic representations with GPT
client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(client, model="gpt-3.5-turbo", chat=True)
topic_model = BERTopic(representation_model=representation_model)

Multi-aspect Topic Modeling

Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic.

Visualizations

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the many visualization options in BERTopic. For example, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Save/Load BERTopic model

There are three methods for saving BERTopic:

  1. A light model with .safetensors and config files
  2. A light model with pytorch .bin and config files
  3. A full model with .pickle

Method 3 allows for saving the entire topic model but has several drawbacks:

  • Arbitrary code can be run from .pickle files
  • The resulting model is rather large (often > 500MB) since all sub-models need to be saved
  • Explicit and specific version control is needed as they typically only run if the environment is exactly the same

It is advised to use methods 1 or 2 for saving.

These methods have a number of advantages:

  • .safetensors is a relatively safe format
  • The resulting model can be very small (often < 20MB) since no sub-models need to be saved
  • Although version control is important, there is a bit more flexibility with respect to specific versions of packages
  • More easily used in production
  • Share models with the HuggingFace Hub

Tip

For more detail about how to load in a custom vectorizer, representation model, and more, it is highly advised to checkout the serialization page. It contains more examples, details, and some tips and tricks for loading and saving your environment.

The methods are as used as follows:

topic_model = BERTopic().fit(my_docs)

# Method 1 - safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

# Method 2 - pytorch
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model)

# Method 3 - pickle
topic_model.save("my_model", serialization="pickle")

To load a model:

# Load from directory
loaded_model = BERTopic.load("path/to/my/model_dir")

# Load from file
loaded_model = BERTopic.load("my_model")

# Load from HuggingFace
loaded_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")

Warning

When saving the model, make sure to also keep track of the versions of dependencies and Python used. Loading and saving the model should be done using the same dependencies and Python. Moreover, models saved in one version of BERTopic should not be loaded in other versions.