Quickstart
Installation¶
Installation, with sentence-transformers, can be done using pypi:
pip install bertopic
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]
Quick Start¶
We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of English documents:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
After generating topics, we can access the frequent topics that were generated:
>>> topic_model.get_topic_info()
Topic Count Name
-1 4630 -1_can_your_will_any
0 693 49_windows_drive_dos_file
1 466 32_jesus_bible_christian_faith
2 441 2_space_launch_orbit_lunar
3 381 22_key_encryption_keys_encrypted
-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:
>>> topic_model.get_topic(0)
[('windows', 0.006152228076250982),
('drive', 0.004982897610645755),
('dos', 0.004845038866360651),
('file', 0.004140142872194834),
('disk', 0.004131678774810884),
('mac', 0.003624848635985097),
('memory', 0.0034840976976789903),
('software', 0.0034415334250699077),
('email', 0.0034239554442333257),
('pc', 0.003047105930670237)]
Tip!
Use BERTopic(language="multilingual")
to select a model that supports 50+ languages.
Visualize Topics¶
After having trained our BERTopic
model, we can iteratively go through perhaps a hundred topics to get a good
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
LDAvis:
topic_model.visualize_topics()
Save/Load BERTopic model¶
We can easily save a trained BERTopic model by calling save
:
from bertopic import BERTopic
topic_model = BERTopic()
topic_model.save("my_model")
Then, we can load the model in one line:
topic_model = BERTopic.load("my_model")
Tip!
If you do not want to save the embedding model because it is loaded from the cloud, simply run
model.save("my_model", save_embedding_model=False)
instead. Then, you can load in the model
with BERTopic.load("my_model", embedding_model="whatever_model_you_used")
.