Vectorizers
¶
Bases: CountVectorizer
An online variant of the CountVectorizer with updating vocabulary.
At each .partial_fit
, its vocabulary is updated based on any OOV words
it might find. Then, .update_bow
can be used to track and update
the Bag-of-Words representation. These functions are separated such that
the vectorizer can be used in iteration without updating the Bag-of-Words
representation can might speed up the fitting process. However, the
.update_bow
function is used in BERTopic to track changes in the
topic representations and allow for decay.
This class inherits its parameters and attributes from
sklearn.feature_extraction.text.CountVectorizer
Parameters:
Name | Type | Description | Default |
---|---|---|---|
decay
|
float
|
A value between [0, 1] to weight the percentage of frequencies
the previous bag-of-words should be decreased. For example,
a value of |
None
|
delete_min_df
|
float
|
Delete words at each iteration from its vocabulary
that are below a minimum frequency.
This will keep the resulting bag-of-words matrix small
such that it does not explode in size with increasing
vocabulary. If |
None
|
**kwargs
|
Set of parameters inherited from:
|
{}
|
Attributes:
Name | Type | Description |
---|---|---|
X_ |
scipy.sparse.csr_matrix)
|
The Bag-of-Words representation |
Examples:
from bertopic.vectorizers import OnlineCountVectorizer
vectorizer = OnlineCountVectorizer(stop_words="english")
for index, doc in enumerate(my_docs):
vectorizer.partial_fit(doc)
# Update and clean the bow every 100 iterations:
if index % 100 == 0:
X = vectorizer.update_bow()
To use the model in BERTopic:
from bertopic import BERTopic
from bertopic.vectorizers import OnlineCountVectorizer
vectorizer_model = OnlineCountVectorizer(stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model)
References
Adapted from: https://github.com/idoshlomo/online_vectorizers
Source code in bertopic\vectorizers\_online_cv.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
partial_fit(raw_documents)
¶
Perform a partial fit and update vocabulary with OOV tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_documents
|
List[str]
|
A list of documents |
required |
Source code in bertopic\vectorizers\_online_cv.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
update_bow(raw_documents)
¶
Create or update the bag-of-words matrix.
Update the bag-of-words matrix by adding the newly transformed documents. This may add empty columns if new words are found and/or add empty rows if new topics are found.
During this process, the previous bag-of-words matrix might be
decayed if self.decay
has been set during init. Similarly, words
that do not exceed self.delete_min_df
are removed from its
vocabulary and bag-of-words matrix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_documents
|
List[str]
|
A list of documents |
required |
Returns:
Name | Type | Description |
---|---|---|
X_ |
csr_matrix
|
Bag-of-words matrix |
Source code in bertopic\vectorizers\_online_cv.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|