Topics per Class

In some cases, you might be interested in how certain topics are represented over certain categories. Perhaps there are specific groups of users for which you want to see how they talk about certain topics.

Instead of running the topic model per class, we can simply create a topic model and then extract, for each topic, its representation per class. This allows you to see how certain topics, calculated over all documents, are represented for certain subgroups.


1 Topic 1 Class m Class 1 Class m Class n Topic c-TF-IDF c-TF-IDF c-TF-IDF c-TF-IDF Split documents by topic Split documents by topic and class Apply pre-fitted c-TF-IDF on each subset of documents.


To do so, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents.

First, let's prepare the data:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data["data"]
targets = data["target"]
target_names = data["target_names"]
classes = [data["target_names"][i] for i in data["target"]]

Next, we want to extract the topics across all documents without taking the categories into account:

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)

Now that we have created our global topic model, let us calculate the topic representations across each category:

topics_per_class = topic_model.topics_per_class(docs, classes=classes)

The classes variable contains the class for each document. Then, we simply visualize these topics per class:

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)

You can hover over the bars to see the topic representation per class.

As you can see in the visualization above, the topics 93_homosexual_homosexuality_sex and 58_bike_bikes_motorcycle are somewhat distributed over all classes.

You can see that the topic representation between rec.motorcycles and rec.autos in 58_bike_bikes_motorcycle clearly differs from one another. It seems that BERTopic has tried to combine those two categories into a single topic. However, since they do contain two separate topics, the topic representation in those two categories differs.

We see something similar for 93_homosexual_homosexuality_sex, where the topic is distributed among several categories and is represented slightly differently.

Thus, you can see that although in certain categories the topic is similar, the way the topic is represented can differ.