Skip to content

Update Topic Representations

The topics that are extracted from BERTopic are represented by words. These words are extracted from the documents occupying their topics using a class-based TF-IDF. This allows us to extract words that are interesting to a topic but less so to another.

Update Topic Representation after Training

When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to remove stop_words or you want to try out a different n_gram_range. We can use the function update_topics to update the topic representation with new parameters for c-TF-IDF:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Create topics
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic(n_gram_range=(2, 3))
topics, probs = topic_model.fit_transform(docs)

From the model created above, one of the most frequent topics is the following:

>>> topic_model.get_topic(31)[:10]
[('clipper chip', 0.007240771542316232),
 ('key escrow', 0.004601603973377443),
 ('law enforcement', 0.004277247929596332),
 ('intercon com', 0.0035961920238955824),
 ('amanda walker', 0.003474856425297157),
 ('serial number', 0.0029876119137150358),
 ('com amanda', 0.002789303096817983),
 ('intercon com amanda', 0.0027386688593327084),
 ('amanda intercon', 0.002585262048515583),
 ('amanda intercon com', 0.002585262048515583)]

Although there does seems to be some relation between words, it is difficult, at least for me, to intuitively understand what the topic is about. Instead, let's simplify the topic representation by setting n_gram_range to (1, 3) to also allow for single words.

>>> topic_model.update_topics(docs, n_gram_range=(1, 3))
>>> topic_model.get_topic(31)[:10]
[('encryption', 0.008021846079148017),
 ('clipper', 0.00789642647602742),
 ('chip', 0.00637127942464045),
 ('key', 0.006363124787175884),
 ('escrow', 0.005030980365244285),
 ('clipper chip', 0.0048271268437973395),
 ('keys', 0.0043245812747907545),
 ('crypto', 0.004311198708675516),
 ('intercon', 0.0038772934659295076),
 ('amanda', 0.003516026493904586)]

To me, the combination of the words above seem a bit more intuitive than the words we previously had! You can play around with n_gram_range or use your own custom sklearn.feature_extraction.text.CountVectorizer and pass that
instead:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 5))
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)

Tip!

If you want to change the topics to something else, whether that is merging them or removing outliers, you can pass a custom list of topics to update them: topic_model.update_topics(docs, topics=my_updated_topics)

Custom labels

The topic labels are currently automatically generated by taking the top 3 words and combining them using the _ separator. Although this is an informative label, in practice, this is definitely not the prettiest nor necessarily the most accurate label. For example, although the topic label 1_space_nasa_orbit is informative, but we would prefer to have a bit more intuitive label, such as space travel. The difficulty with creating such topic labels is that much of the interpretation is left to the user. Would space travel be more accurate or perhaps space explorations? To truly understand which labels are most suited, going into some of the documents in topics is especially helpful.

Although we can go through every single topic ourselves and try to label them, we can start by creating an overview of labels that have the length and number of words that we are looking for. To do so, we can generate our list of topic labels with .generate_topic_labels and define the number of words, the separator, word length, etc:

topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 word_length=10,
                                                 separator=", ")

Tip

If you created multiple topic representations or aspects, you can choose one of these aspects with aspect="Aspect1" or whatever you named the aspect.

In the above example, 1_space_nasa_orbit would turn into space, nasa, orbit since we selected 3 words, no topic prefix, and the , separator. We can then either change our topic_labels to whatever we want or directly pass them to .set_topic_labels so that they can be used across most visualization functions:

topic_model.set_topic_labels(topic_labels)

It is also possible to only change a few topic labels at a time by passing a dictionary where the key represents the topic ID and the value is the topic label:

topic_model.set_topic_labels({1: "Space Travel", 7: "Religion"})

Then, to make use of those custom topic labels across visualizations, such as .visualize_hierarchy(), we can use the custom_labels=True parameter that is found in most visualizations.

fig = topic_model.visualize_barchart(custom_labels=True)

Optimize labels

The great advantage of passing custom labels to BERTopic is that when more accurate zero-shot are released, we can simply use those on top of BERTopic to further fine-tune the labeling. For example, let's say you have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could use the bart-large-mnli model to find which user-defined labels best represent the BERTopic-generated labels:

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# A selected topic representation
# 'god jesus atheists atheism belief atheist believe exist beliefs existence'
sequence_to_classify =  " ".join([word for word, _ in topic_model.get_topic(1)])

# Our set of potential topic labels
candidate_labels = ['cooking', 'dancing', 'religion']
classifier(sequence_to_classify, candidate_labels)

#{'labels': ['cooking', 'dancing', 'religion'],
# 'scores': [0.086, 0.063, 0.850],
# 'sequence': 'god jesus atheists atheism belief atheist believe exist beliefs existence'}