Dynamic Topic Modeling
Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented across different times. For example, in 1995 people may talk differently about environmental awareness than those in 2015. Although the topic itself remains the same, environmental awareness, the exact representation of that topic might differ.
BERTopic allows for DTM by calculating the topic representation at each timestep without the need to run the entire model several times. To do this, we first need to fit BERTopic as if there were no temporal aspect in the data. Thus, a general topic model will be created. We use the global representation as to the main topics that can be found at, most likely, different timesteps. For each topic and timestep, we calculate the c-TF-IDF representation. This will result in a specific topic representation at each timestep without the need to create clusters from embeddings as they were already created.
Next, there are two main ways to further fine-tune these specific topic representations, namely globally and evolutionary.
A topic representation at timestep t can be fine-tuned globally by averaging its c-TF-IDF representation with that of the global representation. This allows each topic representation to move slightly towards the global representation whilst still keeping some of its specific words.
A topic representation at timestep t can be fine-tuned evolutionary by averaging its c-TF-IDF representation with that of the c-TF-IDF representation at timestep t-1. This is done for each topic representation allowing for the representations to evolve over time.
Both fine-tuning methods are set to True
as a default and allow for interesting representations to be created.
Example¶
To demonstrate DTM in BERTopic, we first need to prepare our data. A good example of where DTM is useful is topic modeling on Twitter data. We can analyze how certain people have talked about certain topics in the years they have been on Twitter. Due to the controversial nature of his tweets, we are going to be using all tweets by Donald Trump.
First, we need to load the data and do some very basic cleaning. For example, I am not interested in his re-tweets for this use-case:
import re
import pandas as pd
# Prepare data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
Then, we need to extract the global topic representations by simply creating and training a BERTopic model:
from bertopic import BERTopic
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)
From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this
by simply calling topics_over_time
and passing the tweets, the corresponding timestamps, and the related topics:
topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)
And that is it! Aside from what you always need for BERTopic, you now only need to add timestamps
to quickly calculate the topics over time.
Parameters¶
There are a few parameters that are of interest which will be discussed below.
Tuning¶
Both global_tuning
and evolutionary_tuning
are set to True as a default, but can easily be changed. Perhaps
you do not want the representations to be influenced by the global representation and merely see how they
evolved over time:
topics_over_time = topic_model.topics_over_time(tweets, timestamps,
global_tuning=True, evolution_tuning=True, nr_bins=20)
Bins¶
If you have more than 100 unique timestamps, then there will be topic representations created for each of those timestamps which can negatively affect the topic representations. It is advised to keep the number of unique timestamps below 50. To do this, you can simply set the number of bins that are created when calculating the topic representations. The timestamps will be taken and put into equal-sized bins:
topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)
Datetime format¶
If you are passing strings (dates) instead of integers, then BERTopic will try to automatically detect
which datetime format your strings have. Unfortunately, this will not always work if they are in an unexpected format.
We can use datetime_format
to pass the format the timestamps have:
topics_over_time = topic_model.topics_over_time(tweets, timestamps, datetime_format="%b%M", nr_bins=20)
Visualization¶
To me, DTM becomes truly interesting when you have a good way of visualizing how topics have changed over time.
A nice way of doing so is by leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency
of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.
Simply call visualize_topics_over_time
with the newly created topics over time:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)
I used top_n_topics
to only show the top 20 most frequent topics. If I were to visualize all topics, which is possible by
leaving top_n_topics
empty, there is a chance that hundreds of lines will fill the plot.
You can also use topics
to show specific topics:
topic_model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])