KeyBERTInspired
¶
Bases: BaseRepresentation
Source code in bertopic\representation\_keybert.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
__init__(top_n_words=10, nr_repr_docs=5, nr_samples=500, nr_candidate_words=100, random_state=42)
¶
Use a KeyBERT-like model to fine-tune the topic representations
The algorithm follows KeyBERT but does some optimization in order to speed up inference.
The steps are as follows. First, we extract the top n representative
documents per topic. To extract the representative documents, we
randomly sample a number of candidate documents per cluster
which is controlled by the nr_samples
parameter. Then,
the top n representative documents are extracted by calculating
the c-TF-IDF representation for the candidate documents and finding,
through cosine similarity, which are closest to the topic c-TF-IDF representation.
Next, the top n words per topic are extracted based on their
c-TF-IDF representation, which is controlled by the nr_repr_docs
parameter.
Then, we extract the embeddings for words and representative documents and create topic embeddings by averaging the representative documents. Finally, the most similar words to each topic are extracted by calculating the cosine similarity between word and topic embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_n_words |
int
|
The top n words to extract per topic. |
10
|
nr_repr_docs |
int
|
The number of representative documents to extract per cluster. |
5
|
nr_samples |
int
|
The number of candidate documents to extract per cluster. |
500
|
nr_candidate_words |
int
|
The number of candidate words per cluster. |
100
|
random_state |
int
|
The random state for randomly sampling candidate documents. |
42
|
Usage:
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
# Create your representation model
representation_model = KeyBERTInspired()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Source code in bertopic\representation\_keybert.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model |
A BERTopic model |
required | |
documents |
DataFrame
|
All input documents |
required |
c_tf_idf |
csr_matrix
|
The topic c-TF-IDF representation |
required |
topics |
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_keybert.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|