Dimensionality Reduction
One important aspect of BERTopic is dimensionality reduction of the embeddings. Typically, embeddings are at least 384 in length and many clustering algorithms have difficulty clustering in such a high dimensional space. A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with.
In BERTopic, we typically use UMAP as it is able to capture both the local and global high-dimensional space in lower dimensions. However, there are other solutions out there, such as PCA that users might be interested in trying out.
We have seen that developments in the artificial intelligence fields are quite fast and that whatever mights be state-of-the-art now, could be different a year or even months later. Therefore, BERTopic allows you to use any dimensionality reduction algorithm that you would like to be using.
As a result, the umap_model
parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have
the following attributes:
* .fit(X)
* A function that can be used to fit the model
* .transform(X)
* A transform function that transforms the input to a lower dimensional size
In other words, it should have the following structure:
class DimensionalityReduction:
def fit(self, X):
return self
def transform(self, X):
return X
In this tutorial, I will show you how to use several dimensionality reduction algorithms in BERTopic.
UMAP¶
As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters, we simply define it and pass it to BERTopic:
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
topic_model = BERTopic(umap_model=umap_model)
Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics that you are using.
PCA¶
Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and to perform
inference with. To use PCA, we can simply import it from sklearn
and pass it to the umap_model
parameter:
from bertopic import BERTopic
from sklearn.decomposition import PCA
dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN.
Note
As you might have noticed, the dim_model
is passed to umap_model
which might be a bit confusing considering
you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current
state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.
Truncated SVD¶
Like PCA, there are a bunch more dimensionality reduction techniques in sklearn
that you can be using. Here, we will demonstrate Truncated SVD
but any model can be used as long as it has both a .fit()
and .transform()
method:
from bertopic import BERTopic
from sklearn.decomposition import TruncatedSVD
dim_model = TruncatedSVD(n_components=5)
topic_model = BERTopic(umap_model=dim_model)