4. Vectorizers
In topic modeling, the quality of the topic representations is key for interpreting the topics, communicating results, and understanding patterns. It is of utmost importance to make sure that the topic representations fit with your use case.
In practice, there is not one correct way of creating topic representations. Some use cases might opt for higher n-grams, whereas others might focus more on single words without any stop words. The diversity in use cases also means that we need to have some flexibility in BERTopic to make sure it can be used across most use cases. The image below illustrates this modularity:
In this section, we will go through several examples of vectorization algorithms and how they can be implemented.
CountVectorizer¶
One often underestimated component of BERTopic is the CountVectorizer
and c-TF-IDF
calculation. Together, they are responsible for creating the topic representations and luckily
can be quite flexible in parameter tuning. Here, we will go through tips and tricks for tuning your CountVectorizer
and see how they might affect the topic representations.
Before starting, it should be noted that you can pass the CountVectorizer
before and after training your topic model. Passing it before training allows you to
minimize the size of the resulting c-TF-IDF
matrix:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
# Train BERTopic with a custom CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)
Passing it after training allows you to fine-tune the topic representations by using .update_topics()
:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
# Train a BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Fine-tune topic representations after training BERTopic
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
The great thing about using .update_topics()
is that it allows you to tweak the topic representations without re-training your model! Thus, here we will be focusing
on fine-tuning our topic representations after training our model.
Note
The great thing about processing our topic representations with the CountVectorizer
is that it does not influence the quality of clusters as that is
being performed before generating the topic representations.
Basic Usage¶
First, let's start with defining our documents and training our topic model:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Prepare documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Train a BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
Now, let's see the top 10 most frequent topics that have been generated:
>>> topic_model.get_topic_info()[1:11]
Topic Count Name
1 0 1822 0_game_team_games_he
2 1 580 1_key_clipper_chip_encryption
3 2 532 2_ites_hello_cheek_hi
4 3 493 3_israel_israeli_jews_arab
5 4 453 4_card_monitor_video_drivers
6 5 438 5_you_your_post_jim
7 6 314 6_car_cars_engine_ford
8 7 279 7_health_newsgroup_cancer_1993
9 8 218 8_fbi_koresh_fire_gas
10 9 174 9_amp_audio_condition_asking
The topic representations generated already seem quite interpretable! However, I am quite sure we do much better without having
to re-train our model. Next, we will go through common parameters in CountVectorizer
and focus on the effects that they might have. As a baseline, we will be comparing
them to the topic representation above.
Parameters¶
There are several basic parameters in the CountVectorizer that we can use to improve upon the quality of the resulting topic representations.
ngram_range¶
The ngram_range
parameter allows us to decide how many tokens each entity is in a topic representation. For example, we have words like
game
and team
with a length of 1 in a topic but it would also make sense to have words like hockey league
with a length of 2. To allow for these words to be generated,
we can set the ngram_range
parameter:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words="english")
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
As you might have noticed, I also added stop_words="english"
. This is necessary as longer words tend to have many stop words and removing them allows
for nicer topic representations:
>>> topic_model.get_topic_info()[1:11]
Topic Count Name
1 0 1822 0_game_team_games_players
2 1 580 1_key_clipper_chip_encryption
3 2 532 2_hello ites_forget hello_ites 15_huh hi
4 3 493 3_israel_israeli_jews_arab
5 4 453 4_card_monitor_video_drivers
6 5 438 5_post_jim_context_forged
7 6 314 6_car_cars_engine_ford
8 7 279 7_health_newsgroup_cancer_1993
9 8 218 8_fbi_koresh_gas_compound
10 9 174 9_amp_audio_condition_asking
Although they look very similar, if we zoom in on topic 8, we can see longer words in our representation:
>>> topic_model.get_topic(8)
[('fbi', 0.019637149205975653),
('koresh', 0.019054514637064403),
('gas', 0.014156057632897179),
('compound', 0.012381224868591681),
('batf', 0.010349992314076047),
('children', 0.009336408916322387),
('tear gas', 0.008941747802855279),
('tear', 0.008446786597564537),
('davidians', 0.007911119583253022),
('started', 0.007398687505638955)]
tear
and gas
have now been combined into a single representation. This helps us understand what those individual words might have been representing.
stop_words¶
In some of the topics, we can see stop words appearing like he
or the
.
Stop words are something we typically want to prevent in our topic representations as they do not give additional information to the topic.
To prevent those stop words, we can use the stop_words
parameter in the CountVectorizer
to remove them from the representations:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
After running the above, we get the following output:
>>> topic_model.get_topic_info()[1:11]
Topic Count Name
1 0 1822 0_game_team_games_players
2 1 580 1_key_clipper_chip_encryption
3 2 532 2_ites_cheek_hello_hi
4 3 493 3_israel_israeli_jews_arab
5 4 453 4_monitor_card_video_vga
6 5 438 5_post_jim_context_forged
7 6 314 6_car_cars_engine_ford
8 7 279 7_health_newsgroup_cancer_tobacco
9 8 218 8_fbi_koresh_gas_compound
10 9 174 9_amp_audio_condition_stereo
As you can see, the topic representations already look much better! Stop words are removed and the representations are more interpretable. We can also pass in a list of stop words if you have multiple languages to take into account.
min_df¶
One important parameter to keep in mind is the min_df
. This is typically an integer representing how frequent a word must be before
being added to our representation. You can imagine that if we have a million documents and a certain word only appears a single time across all of them, then
it would be highly unlikely to be representative of a topic. Typically, the c-TF-IDF
calculation removes that word from the topic representation but when
you have millions of documents, that will also lead to a very large topic-term matrix. To prevent a huge vocabulary, we can set the min_df
to only accept
words that have a minimum frequency.
When you have millions of documents or error issues, I would advise increasing the value of min_df
as long as the topic representations might sense:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(min_df=10)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
With the following topic representation:
>>> topic_model.get_topic_info()[1:11]
Topic Count Name
1 0 1822 0_game_team_games_he
2 1 580 1_key_clipper_chip_encryption
3 2 532 2_hello_hi_yep_huh
4 3 493 3_israel_jews_jewish_peace
5 4 453 4_card_monitor_video_drivers
6 5 438 5_you_your_post_jim
7 6 314 6_car_cars_engine_ford
8 7 279 7_health_newsgroup_cancer_1993
9 8 218 8_fbi_koresh_fire_gas
10 9 174 9_audio_condition_stereo_asking
As you can see, the output is nearly the same which is what we would like to achieve. All words that appear less than 10 times are now removed
from our topic-term matrix (i.e., c-TF-IDF
matrix) which drastically lowers the matrix in size.
max_features¶
A parameter similar to min_df
is max_features
which allows you to select the top n most frequent words to be used in the topic representation.
Setting this, for example, to 10_000
creates a topic-term matrix with 10_000
terms. This helps you control the size of the topic-term matrix
directly without having to fiddle around with the min_df
parameter:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(max_features=10_000)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
With the following representation:
>>> topic_model.get_topic_info()[1:11]
Topic Count Name
1 0 1822 0_game_team_games_he
2 1 580 1_key_clipper_chip_encryption
3 2 532 2_hello_hi_yep_huh
4 3 493 3_israel_israeli_jews_arab
5 4 453 4_card_monitor_video_drivers
6 5 438 5_you_your_post_jim
7 6 314 6_car_cars_engine_ford
8 7 279 7_health_newsgroup_cancer_1993
9 8 218 8_fbi_koresh_fire_gas
10 9 174 9_amp_audio_condition_asking
As with min_df
, we would like the topic representations to be very similar.
tokenizer¶
The default tokenizer in the CountVectorizer works well for western languages but fails to tokenize some non-western languages, like Chinese.
Fortunately, we can use the tokenizer
variable in the CountVectorizer to use jieba
, which is a package
for Chinese text segmentation. Using it is straightforward:
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def tokenize_zh(text):
words = jieba.lcut(text)
return words
vectorizer = CountVectorizer(tokenizer=tokenize_zh)
Then, we can simply pass the vectorizer to update our topic representations:
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
OnlineCountVectorizer¶
When using the online/incremental variant of BERTopic, we need a CountVectorizer
than can incrementally update its representation. For that purpose, OnlineCountVectorizer
was created that not only updates out-of-vocabulary words but also implements decay and cleaning functions to prevent the sparse bag-of-words matrix to become too large. It is a class that can be found in bertopic.vectorizers
which extends sklearn.feature_extraction.text.CountVectorizer
. In other words, you can use the exact same parameter in OnlineCountVectorizer
as found in Scikit-Learn's CountVectorizer
. We can use it as follows:
from bertopic import BERTopic
from bertopic.vectorizers import OnlineCountVectorizer
# Train BERTopic with a custom OnlineCountVectorizer
vectorizer_model = OnlineCountVectorizer()
topic_model = BERTopic(vectorizer_model=vectorizer_model)
Parameters¶
Other than parameters found in CountVectorizer
, such as stop_words
and ngram_range
, we can two parameters in OnlineCountVectorizer
to adjust the way old data is processed and kept.
decay¶
At each iteration, we sum the bag-of-words representation of the new documents with the bag-of-words representation of all documents processed thus far. In other words, the bag-of-words matrix keeps increasing with each iteration. However, especially in a streaming setting, older documents might become less and less relevant as time goes on. Therefore, a decay
parameter was implemented that decays the bag-of-words' frequencies at each iteration before adding the document frequencies of new documents. The decay
parameter is a value between 0 and 1 and indicates the percentage of frequencies the previous bag-of-words matrix should be reduced to. For example, a value of .1
will decrease the frequencies in the bag-of-words matrix by 10% at each iteration before adding the new bag-of-words matrix. This will make sure that recent data has more weight than previous iterations.
delete_min_df¶
In BERTopic, we might want to remove words from the topic representation that appear infrequently. The min_df
in the CountVectorizer
works quite well for that. However, when we have a streaming setting, the min_df
does not work as well since a word's frequency might start below min_df
but will end up higher than that over time. Setting that value high might not always be advised.
As a result, the vocabulary of the resulting bag-of-words matrix can become quite large. Similarly, if we implement the decay
parameter, then some values will decrease over time until they are below min_df
. For these reasons, the delete_min_df
parameter was implemented. The parameter takes positive integers and indicates, at each iteration, which words will be removed. If the value is set to 5, it will check after each iteration if the total frequency of a word is exceeded by that value. If so, the word will be removed in its entirety from the bag-of-words matrix. This helps to keep the bag-of-words matrix of a manageable size.
Note
Although the delete_min_df
parameter removes words from the bag-of-words matrix, it is not permanent. If new documents come in where those previously deleted words are used frequently, they get added back to the matrix.