LangChain
¶
Using chains in langchain to generate topic labels.
The classic example uses langchain.chains.question_answering.load_qa_chain
.
This returns a chain that takes a list of documents and a question as input.
You can also use Runnables such as those composed using the LangChain Expression Language.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chain |
The langchain chain or Runnable with a |
required | |
prompt |
str |
The prompt to be used in the model. If no prompt is given,
|
None |
nr_docs |
int |
The number of documents to pass to LangChain |
4 |
diversity |
float |
The diversity of documents to pass to LangChain. Accepts values between 0 and 1. A higher values results in passing more diverse documents whereas lower values passes more similar documents. |
None |
doc_length |
int |
The maximum length of each document. If a document is longer, it will be truncated. If None, the entire document is passed. |
None |
tokenizer |
Union[str, Callable] |
The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to |
None |
chain_config |
The configuration for the langchain chain. Can be used to set options like max_concurrency to avoid rate limiting errors. |
None |
Usage:
To use this, you will need to install the langchain package first. Additionally, you will need an underlying LLM to support langchain, like openai:
pip install langchain
pip install openai
Then, you can create your chain as follows:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
Finally, you can pass the chain to BERTopic as follows:
from bertopic.representation import LangChain
# Create your representation model
representation_model = LangChain(chain)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
You can also use a custom prompt:
prompt = "What are these documents about? Please give a single label."
representation_model = LangChain(chain, prompt=prompt)
You can also use a Runnable instead of a chain. The example below uses the LangChain Expression Language:
from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatAnthropic
from langchain.schema.document import Document
from langchain.schema.runnable import RunnablePassthrough
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer
prompt = ...
llm = ...
# We will construct a special privacy-preserving chain using Microsoft Presidio
pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
chain = (
{
"input_documents": (
lambda inp: [
Document(
page_content=pii_handler.anonymize(
d.page_content,
language="en",
),
)
for d in inp["input_documents"]
]
),
"question": RunnablePassthrough(),
}
| load_qa_chain(representation_llm, chain_type="stuff")
| (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
)
representation_model = LangChain(chain, prompt=representation_prompt)
Source code in bertopic\representation\_langchain.py
class LangChain(BaseRepresentation):
"""Using chains in langchain to generate topic labels.
The classic example uses `langchain.chains.question_answering.load_qa_chain`.
This returns a chain that takes a list of documents and a question as input.
You can also use Runnables such as those composed using the LangChain Expression Language.
Arguments:
chain: The langchain chain or Runnable with a `batch` method.
Input keys must be `input_documents` and `question`.
Output key must be `output_text`.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` in the prompt
to decide where the keywords need to be
inserted. Keywords won't be included unless
indicated. Unlike other representation models,
Langchain does not use the `"[DOCUMENTS]"` tag
to insert documents into the prompt. The load_qa_chain function
formats the representative documents within the prompt.
nr_docs: The number of documents to pass to LangChain
diversity: The diversity of documents to pass to LangChain.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.
doc_length: The maximum length of each document. If a document is longer,
it will be truncated. If None, the entire document is passed.
tokenizer: The tokenizer used to calculate to split the document into segments
used to count the length of a document.
* If tokenizer is 'char', then the document is split up
into characters which are counted to adhere to `doc_length`
* If tokenizer is 'whitespace', the document is split up
into words separated by whitespaces. These words are counted
and truncated depending on `doc_length`
* If tokenizer is 'vectorizer', then the internal CountVectorizer
is used to tokenize the document. These tokens are counted
and truncated depending on `doc_length`. They are decoded with
whitespaces.
* If tokenizer is a callable, then that callable is used to tokenize
the document. These tokens are counted and truncated depending
on `doc_length`
chain_config: The configuration for the langchain chain. Can be used to set options
like max_concurrency to avoid rate limiting errors.
Usage:
To use this, you will need to install the langchain package first.
Additionally, you will need an underlying LLM to support langchain,
like openai:
`pip install langchain`
`pip install openai`
Then, you can create your chain as follows:
```python
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
```
Finally, you can pass the chain to BERTopic as follows:
```python
from bertopic.representation import LangChain
# Create your representation model
representation_model = LangChain(chain)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
You can also use a custom prompt:
```python
prompt = "What are these documents about? Please give a single label."
representation_model = LangChain(chain, prompt=prompt)
```
You can also use a Runnable instead of a chain.
The example below uses the LangChain Expression Language:
```python
from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatAnthropic
from langchain.schema.document import Document
from langchain.schema.runnable import RunnablePassthrough
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer
prompt = ...
llm = ...
# We will construct a special privacy-preserving chain using Microsoft Presidio
pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
chain = (
{
"input_documents": (
lambda inp: [
Document(
page_content=pii_handler.anonymize(
d.page_content,
language="en",
),
)
for d in inp["input_documents"]
]
),
"question": RunnablePassthrough(),
}
| load_qa_chain(representation_llm, chain_type="stuff")
| (lambda output: {"output_text": pii_handler.deanonymize(output["output_text"])})
)
representation_model = LangChain(chain, prompt=representation_prompt)
```
"""
def __init__(
self,
chain,
prompt: str = None,
nr_docs: int = 4,
diversity: float = None,
doc_length: int = None,
tokenizer: Union[str, Callable] = None,
chain_config=None,
):
self.chain = chain
self.prompt = prompt if prompt is not None else DEFAULT_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.chain_config = chain_config
self.nr_docs = nr_docs
self.diversity = diversity
self.doc_length = doc_length
self.tokenizer = tokenizer
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, int]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top 4 representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf=c_tf_idf,
documents=documents,
topics=topics,
nr_samples=500,
nr_repr_docs=self.nr_docs,
diversity=self.diversity,
)
# Generate label using langchain's batch functionality
chain_docs: List[List[Document]] = [
[
Document(page_content=truncate_document(topic_model, self.doc_length, self.tokenizer, doc))
for doc in docs
]
for docs in repr_docs_mappings.values()
]
# `self.chain` must take `input_documents` and `question` as input keys
# Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
if "[KEYWORDS]" in self.prompt:
prompts = []
for topic in topics:
keywords = list(zip(*topics[topic]))[0]
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
prompts.append(prompt)
inputs = [{"input_documents": docs, "question": prompt} for docs, prompt in zip(chain_docs, prompts)]
else:
inputs = [{"input_documents": docs, "question": self.prompt} for docs in chain_docs]
# `self.chain` must return a dict with an `output_text` key
# same output key as the `StuffDocumentsChain` returned by `load_qa_chain`
outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
labels = [output["output_text"].strip() for output in outputs]
updated_topics = {
topic: [(label, 1)] + [("", 0) for _ in range(9)] for topic, label in zip(repr_docs_mappings.keys(), labels)
}
return updated_topics
extract_topics(self, topic_model, documents, c_tf_idf, topics)
¶
Extract topics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_model |
A BERTopic model |
required | |
documents |
DataFrame |
All input documents |
required |
c_tf_idf |
csr_matrix |
The topic c-TF-IDF representation |
required |
topics |
Mapping[str, List[Tuple[str, float]]] |
The candidate topics as calculated with c-TF-IDF |
required |
Returns:
Type | Description |
---|---|
updated_topics |
Updated topic representations |
Source code in bertopic\representation\_langchain.py
def extract_topics(
self,
topic_model,
documents: pd.DataFrame,
c_tf_idf: csr_matrix,
topics: Mapping[str, List[Tuple[str, float]]],
) -> Mapping[str, List[Tuple[str, int]]]:
"""Extract topics.
Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
# Extract the top 4 representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf=c_tf_idf,
documents=documents,
topics=topics,
nr_samples=500,
nr_repr_docs=self.nr_docs,
diversity=self.diversity,
)
# Generate label using langchain's batch functionality
chain_docs: List[List[Document]] = [
[
Document(page_content=truncate_document(topic_model, self.doc_length, self.tokenizer, doc))
for doc in docs
]
for docs in repr_docs_mappings.values()
]
# `self.chain` must take `input_documents` and `question` as input keys
# Use a custom prompt that leverages keywords, using the tag: [KEYWORDS]
if "[KEYWORDS]" in self.prompt:
prompts = []
for topic in topics:
keywords = list(zip(*topics[topic]))[0]
prompt = self.prompt.replace("[KEYWORDS]", ", ".join(keywords))
prompts.append(prompt)
inputs = [{"input_documents": docs, "question": prompt} for docs, prompt in zip(chain_docs, prompts)]
else:
inputs = [{"input_documents": docs, "question": self.prompt} for docs in chain_docs]
# `self.chain` must return a dict with an `output_text` key
# same output key as the `StuffDocumentsChain` returned by `load_qa_chain`
outputs = self.chain.batch(inputs=inputs, config=self.chain_config)
labels = [output["output_text"].strip() for output in outputs]
updated_topics = {
topic: [(label, 1)] + [("", 0) for _ in range(9)] for topic, label in zip(repr_docs_mappings.keys(), labels)
}
return updated_topics