Merge Multiple Fitted Models¶
After you have trained a new BERTopic model on your data, new data might still be coming in. Although you can use online BERTopic, you might prefer to use the default HDBSCAN and UMAP models since they do not support incremental learning out of the box.
Instead, we you can train a new BERTopic on incoming data and merge it with your base model to detect whether new topics have appeared in the unseen documents. This is a great way of detecting whether your new model contains information that was not previously found in your base topic model.
Similarly, you might want to train multiple BERTopic models using different sets of settings, even though they might all be using the same underlying embedding model. Merging these models would also allow for a single model that you can use throughout your use cases.
Lastly, this methods also allows for a degree of federated learning
where each node trains a topic model that are aggregated in a central server.
Example¶
To demonstrate merging different topic models with BERTopic, we use the ArXiv paper abstracts to see which topics they generally contain.
First, we train three separate models on different parts of the data:
from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
# Extract abstracts to train on and corresponding titles
abstracts_1 = dataset["abstract"][:5_000]
abstracts_2 = dataset["abstract"][5_000:10_000]
abstracts_3 = dataset["abstract"][10_000:15_000]
# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)
Then, we can combine all three models into one with .merge_models
:
# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])
When we inspect the first model, we can see it has 52 topics:
>>> len(topic_model_1.get_topic_info())
52
Now, we inspect the merged model, we can see it has 57 topics:
>>> len(merged_model.get_topic_info())
57
It seems that by merging these three models, there were 6 undiscovered topics that we could add to the very first model.
Note
Note that the models are merged sequentially. This means that the comparison starts with topic_model_1
and that
each new topic from topic_model_2
and topic_model_3
will be added to topic_model_1
.
We can check the newly added topics in the merged_model
by simply looking at the 6 latest topics that were added. The order of topics from topic_model_1
remains the same. All new topics are simply added on top of them.
Let's inspect them:
>>> merged_model.get_topic_info().tail(5)
Topic | Count | Name | Representation | Representative_Docs | |
---|---|---|---|---|---|
52 | 51 | 47 | 50_activity_mobile_wearable_sensors | ['activity', 'mobile', 'wearable', 'sensors', 'falls', 'human', 'phone', 'recognition', 'activities', 'accelerometer'] | nan |
53 | 52 | 48 | 25_music_musical_audio_chord | ['music', 'musical', 'audio', 'chord', 'and', 'we', 'to', 'that', 'of', 'for'] | nan |
54 | 53 | 32 | 36_fairness_discrimination_fair_groups | ['fairness', 'discrimination', 'fair', 'groups', 'protected', 'decision', 'we', 'of', 'classifier', 'to'] | nan |
55 | 54 | 30 | 38_traffic_driver_prediction_flow | ['traffic', 'driver', 'prediction', 'flow', 'trajectory', 'the', 'and', 'congestion', 'of', 'transportation'] | nan |
56 | 55 | 22 | 50_spiking_neurons_networks_learning | ['spiking', 'neurons', 'networks', 'learning', 'neural', 'snn', 'dynamics', 'plasticity', 'snns', 'of'] | nan |
It seems that topics about activity, music, fairness, traffic, and spiking networks were added to the base topic model! Two things that you might have noticed. First,
the representative documents were not added to the model. This is because of privacy reasons, you might want to combine models that were trained on different stations which
would allow for a degree of federated learning
. Second, the names of the new topics contain topic ids that refer to one of the old models. They were purposefully left this way
so that the user can identify which topics were newly added which you could inspect in the original models.
min_similarity¶
The way the models are merged is through comparison of their topic embeddings. If topics between models are similar enough, then they will be regarded as the same topics and the topic of the first model in the list will be chosen. However, if topics between models are dissimilar enough, then the topic of the latter model will be added to the former.
This (dis)similarity is can be tweaked using the min_similarity
parameter. Increasing this value will increase the chance of adding new topics. In contrast, decreasing this value
will make it more strict and threfore decrease the chance of adding new topics. The value is set to 0.7
by default, so let's see what happens if we were to increase this value to
`0.9``:
# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3], min_similarity=0.9)
When we inspect the number of topics in our new model, we can see that they have increased quite a bit:
>>> len(merged_model.get_topic_info())
102
This demonstrates the influence of min_similarity
on the number of new topics that are added to the base model.