Skip to content

2. Dimensionality Reduction

An important aspect of BERTopic is the dimensionality reduction of the input embeddings. As embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality.

A solution is to reduce the dimensionality of the embeddings to a workable dimensional space (e.g., 5) for clustering algorithms to work with. UMAP is used as a default in BERTopic since it can capture both the local and global high-dimensional space in lower dimensions. However, there are other solutions out there, such as PCA that users might be interested in trying out. Since BERTopic allows assumes some independency between steps, we can use any other dimensionality reduction algorithm. The image below illustrates this modularity:

Image title

As a result, the umap_model parameter in BERTopic now allows for a variety of dimensionality reduction models. To do so, the class should have the following attributes:

  • .fit(X)
    • A function that can be used to fit the model
  • .transform(X)
    • A transform function that transforms the input to a lower dimensional size

In other words, it should have the following structure:

class DimensionalityReduction:
    def fit(self, X):
        return self

    def transform(self, X):
        return X

In this section, we will go through several examples of dimensionality reduction techniques and how they can be implemented.

UMAP

As a default, BERTopic uses UMAP to perform its dimensionality reduction. To use a UMAP model with custom parameters, we simply define it and pass it to BERTopic:

from bertopic import BERTopic
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
topic_model = BERTopic(umap_model=umap_model)

Here, we can define any parameters in UMAP to optimize for the best performance based on whatever validation metrics you are using.

PCA

Although UMAP works quite well in BERTopic and is typically advised, you might want to be using PCA instead. It can be faster to train and perform inference. To use PCA, we can simply import it from sklearn and pass it to the umap_model parameter:

from bertopic import BERTopic
from sklearn.decomposition import PCA

dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)

As a small note, PCA and k-Means have worked quite well in my experiments and might be interesting to use instead of PCA and HDBSCAN.

Note

As you might have noticed, the dim_model is passed to umap_model which might be a bit confusing considering you are not passing a UMAP model. For now, the name of the parameter is kept the same to adhere to the current state of the API. Changing the name could lead to deprecation issues, which I want to prevent as much as possible.

Truncated SVD

Like PCA, there are a bunch more dimensionality reduction techniques in sklearn that you can be using. Here, we will demonstrate Truncated SVD but any model can be used as long as it has both a .fit() and .transform() method:

from bertopic import BERTopic
from sklearn.decomposition import TruncatedSVD

dim_model = TruncatedSVD(n_components=5)
topic_model = BERTopic(umap_model=dim_model)

cuML UMAP

Although the original UMAP implementation is an amazing technique, it may have difficulty handling large amounts of data. Instead, we can use cuML to speed up UMAP through GPU acceleration:

from bertopic import BERTopic
from cuml.manifold import UMAP

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
topic_model = BERTopic(umap_model=umap_model)

Note

If you want to install cuML together with BERTopic using Google Colab, you can run the following code:

!pip install bertopic
!pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install --upgrade cupy-cuda11x -f https://pip.cupy.dev/aarch64

Skip dimensionality reduction

Although BERTopic applies dimensionality reduction as a default in its pipeline, this is a step that you might want to skip. We generate an "empty" model that simply returns the data pass it to:

from bertopic import BERTopic
from bertopic.dimensionality import BaseDimensionalityReduction

# Fit BERTopic without actually performing any dimensionality reduction
empty_dimensionality_model = BaseDimensionalityReduction()
topic_model = BERTopic(umap_model=empty_dimensionality_model)

In other words, we go from this pipeline:


SBERT UMAP HDBSCAN c-TF-IDF Embeddings Dimensionality reduction Clustering Topic representation


To the following pipeline:


SBERT HDBSCAN c-TF-IDF Embeddings Clustering Topic representation