Topics per Class
In some cases, you might be interested in how certain topics are represented over certain categories. Perhaps there are specific groups of users for which you want to see how they talk about certain topics.
Instead of running the topic model per class, we can simply create a topic model and then extract, for each topic, its representation per class. This allows you to see how certain topics, calculated over all documents, are represented for certain subgroups.
To do so, we use the 20 Newsgroups dataset to see how the topics that we uncover are represented in the 20 categories of documents.
First, let's prepare the data:
from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) docs = data["data"] targets = data["target"] target_names = data["target_names"] classes = [data["target_names"][i] for i in data["target"]]
Next, we want to extract the topics across all documents without taking the categories into account:
topic_model = BERTopic(verbose=True) topics, probs = topic_model.fit_transform(docs)
Now that we have created our global topic model, let us calculate the topic representations across each category:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
classes variable contains the class for each document. Then, we simply visualize these topics per class:
You can hover over the bars to see the topic representation per class.
As you can see in the visualization above, the topics
are somewhat distributed over all classes.
You can see that the topic representation between rec.motorcycles and rec.autos in
differs from one another. It seems that BERTopic has tried to combine those two categories into a single topic. However,
since they do contain two separate topics, the topic representation in those two categories differs.
We see something similar for
93_homosexual_homosexuality_sex, where the topic is distributed among several categories
and is represented slightly differently.
Thus, you can see that although in certain categories the topic is similar, the way the topic is represented can differ.