Representations
¶
BaseRepresentation
¶
Bases: BaseEstimator
The base representation model for fine-tuning topic representations.
Source code in bertopic\representation\_base.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Each representation model that inherits this class will have its arguments (topic_model, documents, c_tf_idf, topics) automatically passed. Therefore, the representation model will only have access to the information about topics related to those arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
The BERTopic model that is fitted until topic representations are calculated. |
required | |
documents
|
DataFrame
|
A dataframe with columns "Document" and "Topic" that contains all documents with each corresponding topic. |
required |
c_tf_idf
|
csr_matrix
|
A c-TF-IDF representation that is typically
identical to |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
A dictionary with topic (key) and tuple of word and weight (value) as calculated by c-TF-IDF. This is the default topics that are returned if no representation model is used. |
required |
Source code in bertopic\representation\_base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
Cohere
¶
Bases: BaseRepresentation
Use the Cohere API to generate topic labels based on their generative model.
Find more about their models here: https://docs.cohere.ai/docs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
A |
required | |
model
|
str
|
Model to use within Cohere, defaults to |
'command-r'
|
prompt
|
str
|
The prompt to be used in the model. If no prompt is given,
|
None
|
system_prompt
|
str
|
The system prompt to be used in the model. If no system prompt is given,
|
None
|
delay_in_seconds
|
float
|
The delay in seconds between consecutive prompts in order to prevent RateLimitErrors. |
None
|
nr_docs
|
int
|
The number of documents to pass to OpenAI if a prompt
with the |
4
|
diversity
|
float
|
The diversity of documents to pass to OpenAI. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents. |
None
|
doc_length
|
int
|
The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed. |
None
|
tokenizer
|
Union[str, Callable]
|
The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to |
None
|
Usage:
To use this, you will need to install cohere first:
pip install cohere
Then, get yourself an API key and use Cohere's API as follows:
import cohere
from bertopic.representation import Cohere
from bertopic import BERTopic
# Create your representation model
co = cohere.Client(my_api_key)
representation_model = Cohere(co)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
You can also use a custom prompt:
prompt = "I have the following documents: [DOCUMENTS]. What topic do they contain?"
representation_model = Cohere(co, prompt=prompt)
Source code in bertopic\representation\_cohere.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
Not used |
required | |
documents
|
DataFrame
|
Not used |
required |
c_tf_idf
|
csr_matrix
|
Not used |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_cohere.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
KeyBERTInspired
¶
Bases: BaseRepresentation
Source code in bertopic\representation\_keybert.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
__init__(top_n_words=10, nr_repr_docs=5, nr_samples=500, nr_candidate_words=100, random_state=42)
¶
Use a KeyBERT-like model to fine-tune the topic representations.
The algorithm follows KeyBERT but does some optimization in order to speed up inference.
The steps are as follows. First, we extract the top n representative
documents per topic. To extract the representative documents, we
randomly sample a number of candidate documents per cluster
which is controlled by the nr_samples
parameter. Then,
the top n representative documents are extracted by calculating
the c-TF-IDF representation for the candidate documents and finding,
through cosine similarity, which are closest to the topic c-TF-IDF representation.
Next, the top n words per topic are extracted based on their
c-TF-IDF representation, which is controlled by the nr_repr_docs
parameter.
Then, we extract the embeddings for words and representative documents and create topic embeddings by averaging the representative documents. Finally, the most similar words to each topic are extracted by calculating the cosine similarity between word and topic embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_n_words
|
int
|
The top n words to extract per topic. |
10
|
nr_repr_docs
|
int
|
The number of representative documents to extract per cluster. |
5
|
nr_samples
|
int
|
The number of candidate documents to extract per cluster. |
500
|
nr_candidate_words
|
int
|
The number of candidate words per cluster. |
100
|
random_state
|
int
|
The random state for randomly sampling candidate documents. |
42
|
Usage:
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
# Create your representation model
representation_model = KeyBERTInspired()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Source code in bertopic\representation\_keybert.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
All input documents |
required |
c_tf_idf
|
csr_matrix
|
The topic c-TF-IDF representation |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_keybert.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
|
LangChain
¶
Bases: BaseRepresentation
Using chains in langchain to generate topic labels.
The classic example uses langchain.chains.question_answering.load_qa_chain
.
This returns a chain that takes a list of documents and a question as input.
You can also use Runnables such as those composed using the LangChain Expression Language.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chain
|
The langchain chain or Runnable with a |
required | |
prompt
|
str
|
The prompt to be used in the model. If no prompt is given,
|
None
|
nr_docs
|
int
|
The number of documents to pass to LangChain |
4
|
diversity
|
float
|
The diversity of documents to pass to LangChain. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents. |
None
|
doc_length
|
int
|
The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed. |
None
|
tokenizer
|
Union[str, Callable]
|
The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to |
None
|
chain_config
|
The configuration for the langchain chain. Can be used to set options like max_concurrency to avoid rate limiting errors. |
None
|
Usage:
To use this, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:
pip install langchain
pip install openai
Then, you can create your chain as follows:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
Finally, you can pass the chain to BERTopic as follows:
from bertopic.representation import LangChain
# Create your representation model
representation_model = LangChain(chain)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
You can also use a custom prompt:
prompt = "What are these documents about? Please give a single label."
representation_model = LangChain(chain, prompt=prompt)
You can also use a Runnable instead of a chain. The example below uses the LangChain Expression Language:
from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatAnthropic
from langchain.schema.document import Document
from langchain.schema.runnable import RunnablePassthrough
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer
prompt = ...
llm = ...
# We will construct a special privacy-preserving chain using Microsoft Presidio
pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
chain = (
{
"input_documents": (
lambda inp: [
Document(
page_content=pii_handler.anonymize(
d.page_content,
language="en",
),
)
for d in inp["input_documents"]
]
),
"question": RunnablePassthrough(),
}
| load_qa_chain(representation_llm, chain_type="stuff")
| (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
)
representation_model = LangChain(chain, prompt=representation_prompt)
Source code in bertopic\representation\_langchain.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
All input documents |
required |
c_tf_idf
|
csr_matrix
|
The topic c-TF-IDF representation |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, int]]]
|
Updated topic representations |
Source code in bertopic\representation\_langchain.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
LiteLLM
¶
Bases: BaseRepresentation
Using the LiteLLM API to generate topic labels.
For an overview of models see:
https://docs.litellm.ai/docs/providers
Arguments:
model: Model to use. Defaults to OpenAI's "gpt-3.5-turbo".
generator_kwargs: Kwargs passed to `litellm.completion`.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
delay_in_seconds: The delay in seconds between consecutive prompts
in order to prevent RateLimitErrors.
exponential_backoff: Retry requests with a random exponential backoff.
A short sleep is used when a rate limit error is hit,
then the requests is retried. Increase the sleep length
if errors are hit until 10 unsuccesfull requests.
If True, overrides `delay_in_seconds`.
nr_docs: The number of documents to pass to LiteLLM if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to LiteLLM.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
Usage:
To use this, you will need to install the litellm package first:
`pip install litellm`
Then, get yourself an API key of any provider (for instance OpenAI) and use it as follows:
```python
import os
from bertopic.representation import LiteLLM
from bertopic import BERTopic
# set ENV variables
os.environ["OPENAI_API_KEY"] = "your-openai-key"
# Create your representation model
representation_model = LiteLLM(model="gpt-3.5-turbo")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can also use a custom prompt:
```python
prompt = "I have the following documents: [DOCUMENTS]
These documents are about the following topic: '" representation_model = LiteLLM(model="gpt", prompt=prompt) ```
Source code in bertopic\representation\_litellm.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
All input documents |
required |
c_tf_idf
|
csr_matrix
|
The topic c-TF-IDF representation |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_litellm.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
|
LlamaCPP
¶
Bases: BaseRepresentation
A llama.cpp implementation to use as a representation model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
Union[str, Llama]
|
Either a string pointing towards a local LLM or a
|
required |
prompt
|
str
|
The prompt to be used in the model. If no prompt is given,
|
None
|
system_prompt
|
str
|
The system prompt to be used in the model. If no system prompt is given,
|
None
|
pipeline_kwargs
|
Mapping[str, Any]
|
Kwargs that you can pass to the |
{}
|
nr_docs
|
int
|
The number of documents to pass to OpenAI if a prompt
with the |
4
|
diversity
|
float
|
The diversity of documents to pass to OpenAI. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents. |
None
|
doc_length
|
int
|
The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed. |
None
|
tokenizer
|
Union[str, Callable]
|
The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to |
None
|
Usage:
To use a llama.cpp, first download the LLM:
wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf
Then, we can now use the model the model with BERTopic in just a couple of lines:
from bertopic import BERTopic
from bertopic.representation import LlamaCPP
# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
representation_model = LlamaCPP("zephyr-7b-alpha.Q4_K_M.gguf")
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
If you want to have more control over the LLMs parameters, you can run it like so:
from bertopic import BERTopic
from bertopic.representation import LlamaCPP
from llama_cpp import Llama
# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
llm = Llama(model_path="zephyr-7b-alpha.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=4096, stop="Q:")
representation_model = LlamaCPP(llm)
# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
Source code in bertopic\representation\_llamacpp.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topic representations and return a single label.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
Not used |
required |
c_tf_idf
|
csr_matrix
|
Not used |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_llamacpp.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
|
MaximalMarginalRelevance
¶
Bases: BaseRepresentation
Calculate Maximal Marginal Relevance (MMR) between candidate keywords and the document.
MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
diversity
|
float
|
How diverse the select keywords/keyphrases are. Values range between 0 and 1 with 0 being not diverse at all and 1 being most diverse. |
0.1
|
top_n_words
|
int
|
The number of keywords/keyhprases to return |
10
|
Usage:
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Source code in bertopic\representation\_mmr.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topic representations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
The BERTopic model |
required | |
documents
|
DataFrame
|
Not used |
required |
c_tf_idf
|
csr_matrix
|
Not used |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_mmr.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
OpenAI
¶
Bases: BaseRepresentation
Using the OpenAI API to generate topic labels based on one of their Completion of ChatCompletion models.
For an overview see: https://platform.openai.com/docs/models
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
A |
required | |
model
|
str
|
Model to use within OpenAI, defaults to |
'gpt-4o-mini'
|
generator_kwargs
|
Mapping[str, Any]
|
Kwargs passed to |
{}
|
prompt
|
str
|
The prompt to be used in the model. If no prompt is given,
|
None
|
system_prompt
|
str
|
The system prompt to be used in the model. If no system prompt is given,
|
None
|
delay_in_seconds
|
float
|
The delay in seconds between consecutive prompts in order to prevent RateLimitErrors. |
None
|
exponential_backoff
|
bool
|
Retry requests with a random exponential backoff.
A short sleep is used when a rate limit error is hit,
then the requests is retried. Increase the sleep length
if errors are hit until 10 unsuccessful requests.
If True, overrides |
False
|
nr_docs
|
int
|
The number of documents to pass to OpenAI if a prompt
with the |
4
|
diversity
|
float
|
The diversity of documents to pass to OpenAI. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents. |
None
|
doc_length
|
int
|
The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed. |
None
|
tokenizer
|
Union[str, Callable]
|
The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to |
None
|
Usage:
To use this, you will need to install the openai package first:
pip install openai
Then, get yourself an API key and use OpenAI's API as follows:
import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic
# Create your representation model
client = openai.OpenAI(api_key=MY_API_KEY)
representation_model = OpenAI(client, delay_in_seconds=5)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
You can also use a custom prompt:
prompt = "I have the following documents: [DOCUMENTS] \nThese documents are about the following topic: '"
representation_model = OpenAI(client, prompt=prompt, delay_in_seconds=5)
To choose a model:
representation_model = OpenAI(client, model="gpt-4o-mini", delay_in_seconds=10)
Source code in bertopic\representation\_openai.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
All input documents |
required |
c_tf_idf
|
csr_matrix
|
The topic c-TF-IDF representation |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_openai.py
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
|
PartOfSpeech
¶
Bases: BaseRepresentation
Extract Topic Keywords based on their Part-of-Speech.
DEFAULT_PATTERNS = [ [{'POS': 'ADJ'}, {'POS': 'NOUN'}], [{'POS': 'NOUN'}], [{'POS': 'ADJ'}] ]
From candidate topics, as extracted with c-TF-IDF, find documents that contain keywords found in the candidate topics. These candidate documents then serve as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic.
These candidate keywords are first judged by whether they fall within the DEFAULT_PATTERNS or the user-defined pattern. Then, the resulting keywords are sorted by their respective c-TF-IDF values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
Union[str, Language]
|
The Spacy model to use |
'en_core_web_sm'
|
top_n_words
|
int
|
The top n words to extract |
10
|
pos_patterns
|
List[str]
|
Patterns for Spacy to use. See https://spacy.io/usage/rule-based-matching |
None
|
Usage:
from bertopic.representation import PartOfSpeech
from bertopic import BERTopic
# Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
You can define custom POS patterns to be extracted:
pos_patterns = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}], [{'POS': 'ADJ'}]
]
representation_model = PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns)
Source code in bertopic\representation\_pos.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
All input documents |
required |
c_tf_idf
|
csr_matrix
|
Not used |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_pos.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
TextGeneration
¶
Bases: BaseRepresentation
Text2Text or text generation with transformers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
Union[str, pipeline]
|
A transformers pipeline that should be initialized as "text-generation"
for gpt-like models or "text2text-generation" for T5-like models.
For example, |
required |
prompt
|
str
|
The prompt to be used in the model. If no prompt is given,
|
None
|
pipeline_kwargs
|
Mapping[str, Any]
|
Kwargs that you can pass to the transformers.pipeline when it is called. |
{}
|
random_state
|
int
|
A random state to be passed to |
42
|
nr_docs
|
int
|
The number of documents to pass to OpenAI if a prompt
with the |
4
|
diversity
|
float
|
The diversity of documents to pass to OpenAI. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents. |
None
|
doc_length
|
int
|
The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed. |
None
|
tokenizer
|
Union[str, Callable]
|
The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to |
None
|
Usage:
To use a gpt-like model:
from bertopic.representation import TextGeneration
from bertopic import BERTopic
# Create your representation model
generator = pipeline('text-generation', model='gpt2')
representation_model = TextGeneration(generator)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTo pic(representation_model=representation_model)
You can use a custom prompt and decide where the keywords should
be inserted by using the [KEYWORDS]
or documents with thte [DOCUMENTS]
tag:
from bertopic.representation import TextGeneration
prompt = "I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?""
# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)
Source code in bertopic\representation\_textgeneration.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topic representations and return a single label.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
Not used |
required |
c_tf_idf
|
csr_matrix
|
Not used |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_textgeneration.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
|
VisualRepresentation
¶
Bases: BaseRepresentation
From a collection of representative documents, extract images to represent topics. These topics are represented by a collage of images.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nr_repr_images
|
int
|
Number of representative images to extract |
9
|
nr_samples
|
int
|
The number of candidate documents to extract per cluster. |
500
|
image_height
|
Tuple[int, int]
|
The height of the resulting collage |
600
|
image_square
|
Whether to resize each image in the collage to a square. This can be visually more appealing if all input images are all almost squares. |
required | |
image_to_text_model
|
Union[str, Pipeline]
|
The model to caption images. |
None
|
batch_size
|
int
|
The number of images to pass to the
|
32
|
Usage:
from bertopic.representation import VisualRepresentation
from bertopic import BERTopic
# The visual representation is typically not a core representation
# and is advised to pass to BERTopic as an additional aspect.
# Aspects can be labeled with dictionaries as shown below:
representation_model = {
"Visual_Aspect": VisualRepresentation()
}
# Use the representation model in BERTopic as a separate aspect
topic_model = BERTopic(representation_model=representation_model)
Source code in bertopic\representation\_visual.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
A BERTopic model |
required | |
documents
|
DataFrame
|
All input documents |
required |
c_tf_idf
|
csr_matrix
|
The topic c-TF-IDF representation |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
representative_images |
Mapping[str, List[Tuple[str, float]]]
|
Representative images per topic |
Source code in bertopic\representation\_visual.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
image_to_text(documents, embeddings)
¶
Convert images to text.
Source code in bertopic\representation\_visual.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
|
ZeroShotClassification
¶
Bases: BaseRepresentation
Zero-shot Classification on topic keywords with candidate labels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
candidate_topics
|
List[str]
|
A list of labels to assign to the topics if they
exceed |
required |
model
|
str
|
A transformers pipeline that should be initialized as
"zero-shot-classification". For example,
|
'facebook/bart-large-mnli'
|
pipeline_kwargs
|
Mapping[str, Any]
|
Kwargs that you can pass to the transformers.pipeline
when it is called. NOTE: Use |
{}
|
min_prob
|
float
|
The minimum probability to assign a candidate label to a topic |
0.8
|
Usage:
from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic
# Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
Source code in bertopic\representation\_zeroshot.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
extract_topics(topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model
|
Not used |
required | |
documents
|
DataFrame
|
Not used |
required |
c_tf_idf
|
csr_matrix
|
Not used |
required |
topics
|
Mapping[str, List[Tuple[str, float]]]
|
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Name | Type | Description |
---|---|---|
updated_topics |
Mapping[str, List[Tuple[str, float]]]
|
Updated topic representations |
Source code in bertopic\representation\_zeroshot.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|