Backends
¶
BaseEmbedder
¶
The Base Embedder used for creating embedding models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_model
|
The main embedding model to be used for extracting document and word embedding |
None
|
|
word_embedding_model
|
The embedding model used for extracting word
embeddings only. If this model is selected,
then the |
None
|
Source code in bertopic\backend\_base.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
embed(documents, verbose=False)
¶
Embed a list of n documents/words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[str]
|
A list of documents or words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_base.py
21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
embed_documents(document, verbose=False)
¶
Embed a list of n words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document
|
List[str]
|
A list of documents to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_base.py
50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
embed_words(words, verbose=False)
¶
Embed a list of n words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
words
|
List[str]
|
A list of words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Word embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_base.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
CohereBackend
¶
Bases: BaseEmbedder
Cohere Embedding Model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
A |
required | |
embedding_model
|
str
|
A Cohere model. Default is "large". For an overview of models see: https://docs.cohere.ai/docs/generation-card |
'large'
|
delay_in_seconds
|
float
|
If a |
None
|
batch_size
|
int
|
The size of each batch. |
None
|
embed_kwargs
|
Mapping[str, Any]
|
Kwargs passed to |
{}
|
Examples:
import cohere
from bertopic.backend import CohereBackend
client = cohere.Client("APIKEY")
cohere_model = CohereBackend(client)
If you want to specify input_type
:
cohere_model = CohereBackend(
client,
embedding_model="embed-english-v3.0",
embed_kwargs={"input_type": "clustering"}
)
Source code in bertopic\backend\_cohere.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
embed(documents, verbose=False)
¶
Embed a list of n documents/words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[str]
|
A list of documents or words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_cohere.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
Model2VecBackend
¶
Bases: BaseEmbedder
Model2Vec embedding model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_model
|
Union[str, StaticModel]
|
Either a model2vec model or a string pointing to a model2vec model |
required |
distill
|
bool
|
Indicates whether to distill a sentence-transformers compatible model.
The distillation will happen during fitting of the topic model.
NOTE: Only works if |
False
|
distill_kwargs
|
dict
|
Keyword arguments to pass to the distillation process
of |
{}
|
distill_vectorizer
|
str
|
A CountVectorizer used for creating a custom vocabulary
based on the same documents used for topic modeling.
NOTE: If "vocabulary" is in |
None
|
Examples: To create a model, you can load in a string pointing to a model2vec model:
from bertopic.backend import Model2VecBackend
sentence_model = Model2VecBackend("minishlab/potion-base-8M")
or you can instantiate a model yourself:
from bertopic.backend import Model2VecBackend
from model2vec import StaticModel
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")
sentence_model = Model2VecBackend(embedding_model)
If you want to distill a sentence-transformers model with the vocabulary of the documents, run the following:
from bertopic.backend import Model2VecBackend
sentence_model = Model2VecBackend("sentence-transformers/all-MiniLM-L6-v2", distill=True)
Source code in bertopic\backend\_model2vec.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
embed(documents, verbose=False)
¶
Embed a list of n documents/words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[str]
|
A list of documents or words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_model2vec.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
MultiModalBackend
¶
Bases: BaseEmbedder
Multimodal backend using Sentence-transformers.
The sentence-transformers embedding model used for generating word, document, and image embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_model
|
Union[str, SentenceTransformer]
|
A sentence-transformers embedding model that
can either embed both images and text or only text.
If it only embeds text, then |
required |
image_model
|
Union[str, SentenceTransformer]
|
A sentence-transformers embedding model that is used to embed only images. |
None
|
batch_size
|
int
|
The sizes of image batches to pass |
32
|
Examples: To create a model, you can load in a string pointing to a sentence-transformers model:
from bertopic.backend import MultiModalBackend
sentence_model = MultiModalBackend("clip-ViT-B-32")
or you can instantiate a model yourself:
from bertopic.backend import MultiModalBackend
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("clip-ViT-B-32")
sentence_model = MultiModalBackend(embedding_model)
Source code in bertopic\backend\_multimodal.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
|
embed(documents, images=None, verbose=False)
¶
Embed a list of n documents/words or images into an n-dimensional matrix of embeddings.
Either documents, images, or both can be provided. If both are provided, then the embeddings are averaged.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[str]
|
A list of documents or words to be embedded |
required |
images
|
List[str]
|
A list of image paths to be embedded |
None
|
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_multimodal.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
embed_documents(documents, verbose=False)
¶
Embed a list of n documents/words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[str]
|
A list of documents or words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_multimodal.py
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
embed_words(words, verbose=False)
¶
Embed a list of n words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
words
|
List[str]
|
A list of words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_multimodal.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
OpenAIBackend
¶
Bases: BaseEmbedder
OpenAI Embedding Model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
client
|
OpenAI
|
A |
required |
embedding_model
|
str
|
An OpenAI model. Default is For an overview of models see: https://platform.openai.com/docs/models/embeddings |
'text-embedding-ada-002'
|
delay_in_seconds
|
float
|
If a |
None
|
batch_size
|
int
|
The size of each batch. |
None
|
generator_kwargs
|
Mapping[str, Any]
|
Kwargs passed to |
{}
|
Examples:
import openai
from bertopic.backend import OpenAIBackend
client = openai.OpenAI(api_key="sk-...")
openai_embedder = OpenAIBackend(client, "text-embedding-ada-002")
Source code in bertopic\backend\_openai.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|
embed(documents, verbose=False)
¶
Embed a list of n documents/words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents
|
List[str]
|
A list of documents or words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document/words embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_openai.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
WordDocEmbedder
¶
Bases: BaseEmbedder
Combine a document- and word-level embedder.
Source code in bertopic\backend\_word_doc.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
embed_documents(document, verbose=False)
¶
Embed a list of n words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document
|
List[str]
|
A list of documents to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Document embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_word_doc.py
31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
embed_words(words, verbose=False)
¶
Embed a list of n words into an n-dimensional matrix of embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
words
|
List[str]
|
A list of words to be embedded |
required |
verbose
|
bool
|
Controls the verbosity of the process |
False
|
Returns:
Type | Description |
---|---|
ndarray
|
Word embeddings with shape (n, m) with |
ndarray
|
that each have an embeddings size of |
Source code in bertopic\backend\_word_doc.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|