Seed Words

When performing Topic Modeling, you are often faced with data that you are familiar with to a certain extend or that speaks a very specific language. In those cases, topic modeling techniques might have difficulties capturing and representing the semantic nature of domain specific abbreviations, slang, short form, acronyms, etc. For example, the "TNM" classification is a method for identifying the stage of most cancers. The word "TNM" is an abbreviation and might not be correctly captured in generic embedding models.

To make sure that certain domain specific words are weighted higher and are more often used in topic representations, you can set any number of seed_words in the bertopic.vectorizer.ClassTfidfTransformer. The ClassTfidfTransformer is the base representation of BERTopic and essentially represents each topic as a bag of words. As such, we can choose to increase the importance of certain words, such as "TNM".

To do so, let's take a look at an example. We have a dataset of article abstracts and want to perform some topic modeling. Since we might be familiar with the data, there are certain words that we know should be generally important. Let's assume that we have in-depth knowledge about reinforcement learning and know that words like "agent" and "robot" should be important in such a topic were it to be found. Using the ClassTfidfTransformer, we can define those seed_words and also choose by how much their values are multiplied.

The full example is then as follows:

from umap import UMAP
from datasets import load_dataset
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

# Let's take a subset of ArXiv abstracts as the training data
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:5_000]

# For illustration purposes, we make sure the output is fixed when running this code multiple times
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# We can choose any number of seed words for which we want their representation
# to be strengthen. We increase the importance of these words as we want them to be more
# likely to end up in the topic representations.
ctfidf_model = ClassTfidfTransformer(
    seed_words=["agent", "robot", "behavior", "policies", "environment"], 
    seed_multiplier=2
)

# We run the topic model with the seeded words
topic_model = BERTopic(
    umap_model=umap_model,
    min_topic_size=15,
    ctfidf_model=ctfidf_model,
).fit(abstracts)

Then, when we run topic_model.get_topic(0), we get the following output:

[('policy', 0.023413102511982354),
 ('reinforcement', 0.021796126795834238),
 ('agent', 0.021131601305431902),
 ('policies', 0.01888385271486409),
 ('environment', 0.017819874593917057),
 ('learning', 0.015321710504308708),
 ('robot', 0.013881115279230468),
 ('control', 0.013297705894983875),
 ('the', 0.013247933839985382),
 ('to', 0.013058208312484141)]

As we can see, the output includes some of the seed words that we assigned. However, if a word is not found to be important in a topic than we can still multiply its importance but it will remain relatively low. This is a great feature as it allows you to improve their importance with less risk of making words important in topics that really should not be.

A benefit of this method is that this often influences all other representation methods, like KeyBERTInspired and OpenAI. The reason for this is that each representation model uses the words generated by the ClassTfidfTransformer as candidate words to be further optimized. In many cases, words like "TNM" might not end up in the candidate words. By increasing their importance, they are more likely to end up as candidate words in representation models.

Another benefit of using this method is that it artificially increases the interpretability of topics. Sure, some words might be more important than others but there might not mean something to a domain expert. For them, certain words, like "TNM" are highly descriptive and that is something difficult to capture using any method (embedding model, large language model, etc.).

Moreover, these seed_words can be defined together with the domain expert as they can decide what type of words are generally important and might need a nudge from you the algorithmic developer.