FAQ
Which embedding model works best for which language?¶
Unfortunately, there is not a definitive list of the best models for each language, this highly depends
on your data, the model, and your specific use-case. However, the default model in KeyBERT
("all-MiniLM-L6-v2"
) works great for English documents. In contrast, for multi-lingual
documents or any other language, "paraphrase-multilingual-MiniLM-L12-v2""
has shown great performance.
If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using paraphrase-mpnet-base-v2
and paraphrase-multilingual-mpnet-base-v2
instead.
Should I preprocess the data?¶
No. By using document embeddings there is typically no need to preprocess the data as all parts of a document are important in understanding the general topic of the document. Although this holds true in 99% of cases, if you have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply topic modeling to HTML-code to extract topics of code, then it becomes important.
How can I speed up the model?¶
Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package. Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words need to only be embedded a single time, which can result in a major speed up.
This is faster:
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(my_list_of_documents)
This is slower:
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = []
for document in my_list_of_documents:
keyword = kw_model.extract_keywords(document)
keywords.append(keyword)
How can I use KeyBERT with Chinese documents?¶
You need to make sure you use a tokenizer in KeyBERT that supports tokenization of Chinese. I suggest installing jieba
for this:
from sklearn.feature_extraction.text import CountVectorizer
import jieba
def tokenize_zh(text):
words = jieba.lcut(text)
return words
vectorizer = CountVectorizer(tokenizer=tokenize_zh)
Then, simply pass the vectorizer to your KeyBERT instance:
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer)
It also supports highlighting: