6A. Representation Models
One of the core components of BERTopic is its Bag-of-Words representation and weighting with c-TF-IDF. This method is fast and can quickly generate a number of keywords for a topic without depending on the clustering task. As a result, topics can easily and quickly be updated after training the model without the need to re-train it. Although these give good topic representations, we may want to further fine-tune the topic representations.
As such, there are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations. These are optional and are not used by default. You are not restrained by the how the representation can be fine-tuned, from GPT-like models to fast keyword extraction with KeyBERT-like models:
For each model below, an example will be shown on how it may change or improve upon the default topic keywords that are generated. The dataset used in these examples can be found here.
If you want to have multiple representations of a single topic, it might be worthwhile to also check out multi-aspect topic modeling with BERTopic.
KeyBERTInspired¶
After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents.
Thus, the algorithm follows some principles of KeyBERT but does some optimization in order to speed up inference. Usage is straightforward:
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
# Create your representation model
representation_model = KeyBERTInspired()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
PartOfSpeech¶
Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic.
More specifically, we find documents that contain the keywords from our candidate topics as calculated with c-TF-IDF. These documents serve
as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic.
These candidate keywords are first put through Spacy's POS module to see whether they match with the DEFAULT_PATTERNS
:
DEFAULT_PATTERNS = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}],
[{'POS': 'ADJ'}]
]
These patterns follow Spacy's Rule-Based Matching. Then, the resulting keywords are sorted by their respective c-TF-IDF values.
from bertopic.representation import PartOfSpeech
from bertopic import BERTopic
# Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
You can define custom POS patterns to be extracted:
pos_patterns = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}], [{'POS': 'ADJ'}]
]
representation_model = PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns)
MaximalMarginalRelevance¶
When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars" essentially represent the same information and often redundant.
To decrease this redundancy and improve the diversity of keywords, we can use an algorithm called Maximal Marginal Relevance (MMR). MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Zero-Shot Classification¶
For some use cases, you might already have a set of candidate labels that you would like to automatically assign to some of the topics. Although we can use guided or supervised BERTopic for that, we can also use zero-shot classification to assign labels to our topics. For that, we can make use of 🤗 transformers on their models on the model hub.
To perform this classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels. If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
We use it in BERTopic as follows:
from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic
# Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Chain Models¶
All of the above models can make use of the candidate topics, as generated by c-TF-IDF, to further fine-tune the topic representations. For example, MaximalMarginalRelevance
takes the keywords in the candidate topics and re-ranks them. Similarly, the keywords in the candidate topic can be used as the input for GPT-prompts in OpenAI
.
Although the default candidate topics are generated by c-TF-IDF, what if we were to chain these models? For example, we can use MaximalMarginalRelevance
to improve upon the keywords in each topic before passing them to OpenAI
.
This is supported in BERTopic by simply passing a list of representation models when instantiation the topic model:
from bertopic.representation import MaximalMarginalRelevance, OpenAI
from bertopic import BERTopic
import openai
# Create your representation models
client = openai.OpenAI(api_key="sk-...")
openai_generator = OpenAI(client)
mmr = MaximalMarginalRelevance(diversity=0.3)
representation_models = [mmr, openai_generator]
# Use the chained models
topic_model = BERTopic(representation_model=representation_models)
Custom Model¶
Although several representation models have been implemented in BERTopic, new technologies get released often and we should not have to wait until they get implemented in BERTopic. Therefore, you can create your own representation model and use that to fine-tune the topics.
The following is the basic structure for creating your custom model. Note that it returns the same topics as the those calculated with c-TF-IDF:
from bertopic.representation._base import BaseRepresentation
class CustomRepresentationModel(BaseRepresentation):
def extract_topics(self, topic_model, documents, c_tf_idf, topics
) -> Mapping[str, List[Tuple[str, float]]]:
""" Extract topics
Arguments:
topic_model: The BERTopic model
documents: A dataframe of documents with their related topics
c_tf_idf: The c-TF-IDF matrix
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
updated_topics = topics.copy()
return updated_topics
Then, we can use that model as follows:
from bertopic import BERTopic
# Create our custom representation model
representation_model = CustomRepresentationModel()
# Pass our custom representation model to BERTopic
topic_model = BERTopic(representation_model=representation_model)
There are a few things to take into account when creating your custom model:
- It needs to have the exact same parameter input:
topic_model
,documents
,c_tf_idf
,topics
. - Make sure that
updated_topics
has the exact same structure astopics
:
updated_topics = {
"1", [("space", 0.9), ("nasa", 0.7)],
"2": [("science", 0.66), ("article", 0.6)]
}
Tip
You can change the __init__
however you want, it does not influence the underlying structure. This
also means that you can save data/embeddings/representations/sentiment in your custom representation
model.