Quickstart
Installation¶
Installation can be done using pypi:
pip install keybert
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
pip install keybert[flair]
pip install keybert[gensim]
pip install keybert[spacy]
pip install keybert[use]
Basic usage¶
The most minimal example can be seen below for the extraction of keywords:
from keybert import KeyBERT
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)
You can set keyphrase_ngram_range
to set the length of the resulting keywords/keyphrases:
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
('algorithm', 0.4556),
('training', 0.4487),
('class', 0.4086),
('mapping', 0.3700)]
To extract keyphrases, simply set keyphrase_ngram_range
to (1, 2) or higher depending on the number
of words you would like in the resulting keyphrases:
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
('machine learning', 0.6305),
('supervised learning', 0.5985),
('algorithm analyzes', 0.5860),
('learning function', 0.5850)]
We can highlight the keywords in the document by simply setting highlight
:
keywords = kw_model.extract_keywords(doc, highlight=True)
NOTE
For a full overview of all possible transformer models see sentence-transformer.
I would advise either "all-MiniLM-L6-v2"
for English documents or "paraphrase-multilingual-MiniLM-L12-v2"
for multi-lingual documents or any other language.
Fine-tuning¶
As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two approaches that we can take in order to fine-tune our output, Max Sum Distance and Maximal Marginal Relevance.
Max Sum Distance¶
To diversify the results, we take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
('generalize training data', 0.7727),
('requires learning algorithm', 0.5050),
('supervised learning algorithm', 0.3779),
('learning machine learning', 0.2891)]
Maximal Marginal Relevance¶
To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity. The results with high diversity:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
('labels unseen instances', 0.1649),
('new examples optimal', 0.4185),
('determine class labels', 0.4774),
('supervised learning algorithm', 0.7502)]
The results with low diversity:
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
('supervised learning algorithm', 0.7502),
('learning machine learning', 0.7577),
('learning algorithm analyzes', 0.7587),
('learning algorithm generalize', 0.7514)]
Candidate Keywords/Keyphrases¶
In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction:
import yake
from keybert import KeyBERT
# Create candidates
kw_extractor = yake.KeywordExtractor(top=50)
candidates = kw_extractor.extract_keywords(doc)
candidates = [candidate[0] for candidate in candidates]
# Pass candidates to KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates=candidates)
Guided KeyBERT¶
Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.
Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT:
from keybert import KeyBERT
kw_model = KeyBERT()
# Define our seeded term
seed_keywords = ["information"]
keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
Prepare embeddings¶
When you have a large dataset and you want to fine-tune parameters such as diversity
it can take quite a while to re-calculate the document and
word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to .extract_keywords
such that
we only have to calculate it once:
from keybert import KeyBERT
kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
You can then use these embeddings and pass them to .extract_keywords
to speed up the tuning the model:
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
There are several parameters in .extract_embeddings
that define how the list of candidate keywords/keyphrases is generated:
candidates
keyphrase_ngram_range
stop_words
min_df
vectorizer
The values of these parameters need to be exactly the same in .extract_embeddings
as they are in . extract_keywords
.
In other words, the following will work as they use the same parameter subset:
from keybert import KeyBERT
kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
The following, however, will throw an error since we did not use the same values for min_df
and stop_words
:
from keybert import KeyBERT
kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)