Skip to content

LangChain

Bases: BaseRepresentation

Using chains in langchain to generate topic labels.

The classic example uses langchain.chains.question_answering.load_qa_chain. This returns a chain that takes a list of documents and a question as input.

You can also use Runnables such as those composed using the LangChain Expression Language.

Parameters:

Name Type Description Default
chain

The langchain chain or Runnable with a batch method. Input keys must be input_documents and question. Output key must be output_text.

required
prompt str

The prompt to be used in the model. If no prompt is given, self.default_prompt_ is used instead. NOTE: Use "[KEYWORDS]" in the prompt to decide where the keywords need to be inserted. Keywords won't be included unless indicated. Unlike other representation models, Langchain does not use the "[DOCUMENTS]" tag to insert documents into the prompt. The load_qa_chain function formats the representative documents within the prompt.

None
nr_docs int

The number of documents to pass to LangChain

4
diversity float

The diversity of documents to pass to LangChain. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents.

None
doc_length int

The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed.

None
tokenizer Union[str, Callable]

The tokenizer used to calculate to split the document into segments used to count the length of a document. * If tokenizer is 'char', then the document is split up into characters which are counted to adhere to doc_length * If tokenizer is 'whitespace', the document is split up into words separated by whitespaces. These words are counted and truncated depending on doc_length * If tokenizer is 'vectorizer', then the internal CountVectorizer is used to tokenize the document. These tokens are counted and trunctated depending on doc_length. They are decoded with whitespaces. * If tokenizer is a callable, then that callable is used to tokenize the document. These tokens are counted and truncated depending on doc_length

None
chain_config

The configuration for the langchain chain. Can be used to set options like max_concurrency to avoid rate limiting errors.

None

Usage:

To use this, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:

pip install langchain pip install openai

Then, you can create your chain as follows:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")

Finally, you can pass the chain to BERTopic as follows:

from bertopic.representation import LangChain

# Create your representation model
representation_model = LangChain(chain)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

You can also use a custom prompt:

prompt = "What are these documents about? Please give a single label."
representation_model = LangChain(chain, prompt=prompt)

You can also use a Runnable instead of a chain. The example below uses the LangChain Expression Language:

from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatAnthropic
from langchain.schema.document import Document
from langchain.schema.runnable import RunnablePassthrough
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer

prompt = ...
llm = ...

# We will construct a special privacy-preserving chain using Microsoft Presidio

pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])

chain = (
    {
        "input_documents": (
            lambda inp: [
                Document(
                    page_content=pii_handler.anonymize(
                        d.page_content,
                        language="en",
                    ),
                )
                for d in inp["input_documents"]
            ]
        ),
        "question": RunnablePassthrough(),
    }
    | load_qa_chain(representation_llm, chain_type="stuff")
    | (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
)

representation_model = LangChain(chain, prompt=representation_prompt)
Source code in bertopic\representation\_langchain.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
class LangChain(BaseRepresentation):
    """ Using chains in langchain to generate topic labels.

    The classic example uses `langchain.chains.question_answering.load_qa_chain`.
    This returns a chain that takes a list of documents and a question as input.

    You can also use Runnables such as those composed using the LangChain Expression Language.

    Arguments:
        chain: The langchain chain or Runnable with a `batch` method.
               Input keys must be `input_documents` and `question`.
               Output key must be `output_text`.
        prompt: The prompt to be used in the model. If no prompt is given,
                `self.default_prompt_` is used instead.
                 NOTE: Use `"[KEYWORDS]"` in the prompt
                 to decide where the keywords need to be
                 inserted. Keywords won't be included unless
                 indicated. Unlike other representation models,
                 Langchain does not use the `"[DOCUMENTS]"` tag
                 to insert documents into the prompt. The load_qa_chain function
                 formats the representative documents within the prompt.
        nr_docs: The number of documents to pass to LangChain
        diversity: The diversity of documents to pass to LangChain.
                   Accepts values between 0 and 1. A higher 
                   values results in passing more diverse documents
                   whereas lower values passes more similar documents.
        doc_length: The maximum length of each document. If a document is longer,
                    it will be truncated. If None, the entire document is passed.
        tokenizer: The tokenizer used to calculate to split the document into segments
                   used to count the length of a document. 
                       * If tokenizer is 'char', then the document is split up 
                         into characters which are counted to adhere to `doc_length`
                       * If tokenizer is 'whitespace', the document is split up
                         into words separated by whitespaces. These words are counted
                         and truncated depending on `doc_length`
                       * If tokenizer is 'vectorizer', then the internal CountVectorizer
                         is used to tokenize the document. These tokens are counted
                         and trunctated depending on `doc_length`. They are decoded with 
                         whitespaces.
                       * If tokenizer is a callable, then that callable is used to tokenize
                         the document. These tokens are counted and truncated depending
                         on `doc_length`
        chain_config: The configuration for the langchain chain. Can be used to set options
                      like max_concurrency to avoid rate limiting errors.
    Usage:

    To use this, you will need to install the langchain package first.
    Additionally, you will need an underlying LLM to support langchain,
    like openai:

    `pip install langchain`
    `pip install openai`

    Then, you can create your chain as follows:

    ```python
    from langchain.chains.question_answering import load_qa_chain
    from langchain.llms import OpenAI
    chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
    ```

    Finally, you can pass the chain to BERTopic as follows:

    ```python
    from bertopic.representation import LangChain

    # Create your representation model
    representation_model = LangChain(chain)

    # Use the representation model in BERTopic on top of the default pipeline
    topic_model = BERTopic(representation_model=representation_model)
    ```

    You can also use a custom prompt:

    ```python
    prompt = "What are these documents about? Please give a single label."
    representation_model = LangChain(chain, prompt=prompt)
    ```

    You can also use a Runnable instead of a chain.
    The example below uses the LangChain Expression Language:

    ```python
    from bertopic.representation import LangChain
    from langchain.chains.question_answering import load_qa_chain
    from langchain.chat_models import ChatAnthropic
    from langchain.schema.document import Document
    from langchain.schema.runnable import RunnablePassthrough
    from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer

    prompt = ...
    llm = ...

    # We will construct a special privacy-preserving chain using Microsoft Presidio

    pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])

    chain = (
        {
            "input_documents": (
                lambda inp: [
                    Document(
                        page_content=pii_handler.anonymize(
                            d.page_content,
                            language="en",
                        ),
                    )
                    for d in inp["input_documents"]
                ]
            ),
            "question": RunnablePassthrough(),
        }
        | load_qa_chain(representation_llm, chain_type="stuff")
        | (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
    )

    representation_model = LangChain(chain, prompt=representation_prompt)
    ```
    """
    def __init__(self,
                 chain,
                 prompt: str = None,
                 nr_docs: int = 4,
                 diversity: float = None,
                 doc_length: int = None,
                 tokenizer: Union[str, Callable] = None,
                 chain_config = None,
                 ):
        self.chain = chain
        self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
        self.default_prompt_ = DEFAULT_PROMPT
        self.chain_config = chain_config
        self.nr_docs = nr_docs
        self.diversity = diversity
        self.doc_length = doc_length
        self.tokenizer = tokenizer

    def extract_topics(self,
                       topic_model,
                       documents: pd.DataFrame,
                       c_tf_idf: csr_matrix,
                       topics: Mapping[str, List[Tuple[str, float]]]
                       ) -> Mapping[str, List[Tuple[str, int]]]:
        """ Extract topics

        Arguments:
            topic_model: A BERTopic model
            documents: All input documents
            c_tf_idf: The topic c-TF-IDF representation
            topics: The candidate topics as calculated with c-TF-IDF

        Returns:
            updated_topics: Updated topic representations
        """
        # Extract the top 4 representative documents per topic
        repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
            c_tf_idf=c_tf_idf,
            documents=documents,
            topics=topics,
            nr_samples=500,
            nr_repr_docs=self.nr_docs,
            diversity=self.diversity
        )

        # Generate label using langchain's batch functionality
        chain_docs: List[List[Document]] = [
            [
                Document(
                    page_content=truncate_document(
                        topic_model,
                        self.doc_length,
                        self.tokenizer,
                        doc
                    )
                )
                for doc in docs
            ]
            for docs in repr_docs_mappings.values()
        ]

        # `self.chain` must take `input_documents` and `question` as input keys
        # Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
        if "[KEYWORDS]" in self.prompt:
            prompts = []
            for topic in topics:
                keywords = list(zip(*topics[topic]))[0]
                prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
                prompts.append(prompt)

            inputs = [
                {"input_documents": docs, "question": prompt}
                for docs, prompt in zip(chain_docs, prompts)
            ]

        else:
            inputs = [
                {"input_documents": docs, "question": self.prompt}
                for docs in chain_docs
            ]

        # `self.chain` must return a dict with an `output_text` key
        # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`
        outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
        labels = [output["output_text"].strip() for output in outputs]

        updated_topics = {
            topic: [(label, 1)] + [("", 0) for _ in range(9)]
            for topic, label in zip(repr_docs_mappings.keys(), labels)
        }

        return updated_topics

extract_topics(topic_model, documents, c_tf_idf, topics)

Extract topics

Parameters:

Name Type Description Default
topic_model

A BERTopic model

required
documents DataFrame

All input documents

required
c_tf_idf csr_matrix

The topic c-TF-IDF representation

required
topics Mapping[str, List[Tuple[str, float]]]

The candidate topics as calculated with c-TF-IDF

required

Returns:

Name Type Description
updated_topics Mapping[str, List[Tuple[str, int]]]

Updated topic representations

Source code in bertopic\representation\_langchain.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
def extract_topics(self,
                   topic_model,
                   documents: pd.DataFrame,
                   c_tf_idf: csr_matrix,
                   topics: Mapping[str, List[Tuple[str, float]]]
                   ) -> Mapping[str, List[Tuple[str, int]]]:
    """ Extract topics

    Arguments:
        topic_model: A BERTopic model
        documents: All input documents
        c_tf_idf: The topic c-TF-IDF representation
        topics: The candidate topics as calculated with c-TF-IDF

    Returns:
        updated_topics: Updated topic representations
    """
    # Extract the top 4 representative documents per topic
    repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
        c_tf_idf=c_tf_idf,
        documents=documents,
        topics=topics,
        nr_samples=500,
        nr_repr_docs=self.nr_docs,
        diversity=self.diversity
    )

    # Generate label using langchain's batch functionality
    chain_docs: List[List[Document]] = [
        [
            Document(
                page_content=truncate_document(
                    topic_model,
                    self.doc_length,
                    self.tokenizer,
                    doc
                )
            )
            for doc in docs
        ]
        for docs in repr_docs_mappings.values()
    ]

    # `self.chain` must take `input_documents` and `question` as input keys
    # Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
    if "[KEYWORDS]" in self.prompt:
        prompts = []
        for topic in topics:
            keywords = list(zip(*topics[topic]))[0]
            prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
            prompts.append(prompt)

        inputs = [
            {"input_documents": docs, "question": prompt}
            for docs, prompt in zip(chain_docs, prompts)
        ]

    else:
        inputs = [
            {"input_documents": docs, "question": self.prompt}
            for docs in chain_docs
        ]

    # `self.chain` must return a dict with an `output_text` key
    # same output key as the `StuffDocumentsChain` returned by `load_qa_chain`
    outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
    labels = [output["output_text"].strip() for output in outputs]

    updated_topics = {
        topic: [(label, 1)] + [("", 0) for _ in range(9)]
        for topic, label in zip(repr_docs_mappings.keys(), labels)
    }

    return updated_topics