Skip to content

OnlineCountVectorizer

Bases: CountVectorizer

An online variant of the CountVectorizer with updating vocabulary.

At each .partial_fit, its vocabulary is updated based on any OOV words it might find. Then, .update_bow can be used to track and update the Bag-of-Words representation. These functions are separated such that the vectorizer can be used in iteration without updating the Bag-of-Words representation can might speed up the fitting process. However, the .update_bow function is used in BERTopic to track changes in the topic representations and allow for decay.

This class inherits its parameters and attributes from

sklearn.feature_extraction.text.CountVectorizer

Parameters:

Name Type Description Default
decay float

A value between [0, 1] to weight the percentage of frequencies the previous bag-of-words should be decreased. For example, a value of .1 will decrease the frequencies in the bag-of-words matrix with 10% at each iteration.

None
delete_min_df float

Delete words at each iteration from its vocabulary that are below a minimum frequency. This will keep the resulting bag-of-words matrix small such that it does not explode in size with increasing vocabulary. If decay is None then this equals min_df.

None
**kwargs

Set of parameters inherited from: sklearn.feature_extraction.text.CountVectorizer In practice, this means that you can still use parameters from the original CountVectorizer, like stop_words and ngram_range.

{}

Attributes:

Name Type Description
X_ scipy.sparse.csr_matrix)

The Bag-of-Words representation

Examples:

from bertopic.vectorizers import OnlineCountVectorizer
vectorizer = OnlineCountVectorizer(stop_words="english")

for index, doc in enumerate(my_docs):
    vectorizer.partial_fit(doc)

    # Update and clean the bow every 100 iterations:
    if index % 100 == 0:
        X = vectorizer.update_bow()

To use the model in BERTopic:

from bertopic import BERTopic
from bertopic.vectorizers import OnlineCountVectorizer

vectorizer_model = OnlineCountVectorizer(stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model)
References

Adapted from: https://github.com/idoshlomo/online_vectorizers

Source code in bertopic\vectorizers\_online_cv.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
class OnlineCountVectorizer(CountVectorizer):
    """ An online variant of the CountVectorizer with updating vocabulary.

    At each `.partial_fit`, its vocabulary is updated based on any OOV words
    it might find. Then, `.update_bow` can be used to track and update
    the Bag-of-Words representation. These functions are separated such that
    the vectorizer can be used in iteration without updating the Bag-of-Words
    representation can might speed up the fitting process. However, the
    `.update_bow` function is used in BERTopic to track changes in the
    topic representations and allow for decay.

    This class inherits its parameters and attributes from:
        `sklearn.feature_extraction.text.CountVectorizer`

    Arguments:
        decay: A value between [0, 1] to weight the percentage of frequencies
               the previous bag-of-words should be decreased. For example,
               a value of `.1` will decrease the frequencies in the bag-of-words
               matrix with 10% at each iteration.
        delete_min_df: Delete words at each iteration from its vocabulary
                       that are below a minimum frequency.
                       This will keep the resulting bag-of-words matrix small
                       such that it does not explode in size with increasing
                       vocabulary. If `decay` is None then this equals `min_df`.
        **kwargs: Set of parameters inherited from:
                  `sklearn.feature_extraction.text.CountVectorizer`
                  In practice, this means that you can still use parameters
                  from the original CountVectorizer, like `stop_words` and
                  `ngram_range`.

    Attributes:
        X_ (scipy.sparse.csr_matrix) : The Bag-of-Words representation

    Examples:

    ```python
    from bertopic.vectorizers import OnlineCountVectorizer
    vectorizer = OnlineCountVectorizer(stop_words="english")

    for index, doc in enumerate(my_docs):
        vectorizer.partial_fit(doc)

        # Update and clean the bow every 100 iterations:
        if index % 100 == 0:
            X = vectorizer.update_bow()
    ```

    To use the model in BERTopic:

    ```python
    from bertopic import BERTopic
    from bertopic.vectorizers import OnlineCountVectorizer

    vectorizer_model = OnlineCountVectorizer(stop_words="english")
    topic_model = BERTopic(vectorizer_model=vectorizer_model)
    ```

    References:
        Adapted from: https://github.com/idoshlomo/online_vectorizers
    """
    def __init__(self,
                 decay: float = None,
                 delete_min_df: float = None,
                 **kwargs):
        self.decay = decay
        self.delete_min_df = delete_min_df
        super(OnlineCountVectorizer, self).__init__(**kwargs)

    def partial_fit(self, raw_documents: List[str]) -> None:
        """ Perform a partial fit and update vocabulary with OOV tokens

        Arguments:
            raw_documents: A list of documents
        """
        if not hasattr(self, 'vocabulary_'):
            return self.fit(raw_documents)

        analyzer = self.build_analyzer()
        analyzed_documents = [analyzer(doc) for doc in raw_documents]
        new_tokens = set(chain.from_iterable(analyzed_documents))
        oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))

        if oov_tokens:
            max_index = max(self.vocabulary_.values())
            oov_vocabulary = dict(zip(oov_tokens, list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1))))
            self.vocabulary_.update(oov_vocabulary)

        return self

    def update_bow(self, raw_documents: List[str]) -> csr_matrix:
        """ Create or update the bag-of-words matrix

        Update the bag-of-words matrix by adding the newly transformed
        documents. This may add empty columns if new words are found and/or
        add empty rows if new topics are found.

        During this process, the previous bag-of-words matrix might be
        decayed if `self.decay` has been set during init. Similarly, words
        that do not exceed `self.delete_min_df` are removed from its
        vocabulary and bag-of-words matrix.

        Arguments:
            raw_documents: A list of documents

        Returns:
            X_: Bag-of-words matrix
        """
        if hasattr(self, "X_"):
            X = self.transform(raw_documents)

            # Add empty columns if new words are found
            columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)
            self.X_ = sparse.hstack([self.X_, columns])

            # Add empty rows if new topics are found
            rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)
            self.X_ = sparse.vstack([self.X_, rows])

            # Decay of BoW matrix
            if self.decay is not None:
                self.X_ = self.X_ * (1 - self.decay)

            self.X_ += X
        else:
            self.X_ = self.transform(raw_documents)

        if self.delete_min_df is not None:
            self._clean_bow()

        return self.X_

    def _clean_bow(self) -> None:
        """ Remove words that do not exceed `self.delete_min_df` """
        # Only keep words with a minimum frequency
        indices = np.where(self.X_.sum(0) >= self.delete_min_df)[1]
        indices_dict = {index: index for index in indices}
        self.X_ = self.X_[:, indices]

        # Update vocabulary with new words
        new_vocab = {}
        vocabulary_dict = {v: k for k, v in self.vocabulary_.items()}
        for i, index in enumerate(indices):
            if indices_dict.get(index) is not None:
                new_vocab[vocabulary_dict[index]] = i

        self.vocabulary_ = new_vocab

partial_fit(raw_documents)

Perform a partial fit and update vocabulary with OOV tokens

Parameters:

Name Type Description Default
raw_documents List[str]

A list of documents

required
Source code in bertopic\vectorizers\_online_cv.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def partial_fit(self, raw_documents: List[str]) -> None:
    """ Perform a partial fit and update vocabulary with OOV tokens

    Arguments:
        raw_documents: A list of documents
    """
    if not hasattr(self, 'vocabulary_'):
        return self.fit(raw_documents)

    analyzer = self.build_analyzer()
    analyzed_documents = [analyzer(doc) for doc in raw_documents]
    new_tokens = set(chain.from_iterable(analyzed_documents))
    oov_tokens = new_tokens.difference(set(self.vocabulary_.keys()))

    if oov_tokens:
        max_index = max(self.vocabulary_.values())
        oov_vocabulary = dict(zip(oov_tokens, list(range(max_index + 1, max_index + 1 + len(oov_tokens), 1))))
        self.vocabulary_.update(oov_vocabulary)

    return self

update_bow(raw_documents)

Create or update the bag-of-words matrix

Update the bag-of-words matrix by adding the newly transformed documents. This may add empty columns if new words are found and/or add empty rows if new topics are found.

During this process, the previous bag-of-words matrix might be decayed if self.decay has been set during init. Similarly, words that do not exceed self.delete_min_df are removed from its vocabulary and bag-of-words matrix.

Parameters:

Name Type Description Default
raw_documents List[str]

A list of documents

required

Returns:

Name Type Description
X_ csr_matrix

Bag-of-words matrix

Source code in bertopic\vectorizers\_online_cv.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def update_bow(self, raw_documents: List[str]) -> csr_matrix:
    """ Create or update the bag-of-words matrix

    Update the bag-of-words matrix by adding the newly transformed
    documents. This may add empty columns if new words are found and/or
    add empty rows if new topics are found.

    During this process, the previous bag-of-words matrix might be
    decayed if `self.decay` has been set during init. Similarly, words
    that do not exceed `self.delete_min_df` are removed from its
    vocabulary and bag-of-words matrix.

    Arguments:
        raw_documents: A list of documents

    Returns:
        X_: Bag-of-words matrix
    """
    if hasattr(self, "X_"):
        X = self.transform(raw_documents)

        # Add empty columns if new words are found
        columns = csr_matrix((self.X_.shape[0], X.shape[1] - self.X_.shape[1]), dtype=int)
        self.X_ = sparse.hstack([self.X_, columns])

        # Add empty rows if new topics are found
        rows = csr_matrix((X.shape[0] - self.X_.shape[0], self.X_.shape[1]), dtype=int)
        self.X_ = sparse.vstack([self.X_, rows])

        # Decay of BoW matrix
        if self.decay is not None:
            self.X_ = self.X_ * (1 - self.decay)

        self.X_ += X
    else:
        self.X_ = self.transform(raw_documents)

    if self.delete_min_df is not None:
        self._clean_bow()

    return self.X_