Max Sum Distance
¶
Calculate Max Sum Distance for extraction of keywords
We take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.
This is O(n^2) and therefore not advised if you use a large top_n
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_embedding |
ndarray |
The document embeddings |
required |
word_embeddings |
ndarray |
The embeddings of the selected candidate keywords/phrases |
required |
words |
List[str] |
The selected candidate keywords/keyphrases |
required |
top_n |
int |
The number of keywords/keyhprases to return |
required |
nr_candidates |
int |
The number of candidates to consider |
required |
Returns:
Type | Description |
---|---|
List[Tuple[str, float]] |
The selected keywords/keyphrases with their distances |
Source code in keybert\_maxsum.py
def max_sum_distance(
doc_embedding: np.ndarray,
word_embeddings: np.ndarray,
words: List[str],
top_n: int,
nr_candidates: int,
) -> List[Tuple[str, float]]:
"""Calculate Max Sum Distance for extraction of keywords
We take the 2 x top_n most similar words/phrases to the document.
Then, we take all top_n combinations from the 2 x top_n words and
extract the combination that are the least similar to each other
by cosine similarity.
This is O(n^2) and therefore not advised if you use a large `top_n`
Arguments:
doc_embedding: The document embeddings
word_embeddings: The embeddings of the selected candidate keywords/phrases
words: The selected candidate keywords/keyphrases
top_n: The number of keywords/keyhprases to return
nr_candidates: The number of candidates to consider
Returns:
List[Tuple[str, float]]: The selected keywords/keyphrases with their distances
"""
if nr_candidates < top_n:
raise Exception(
"Make sure that the number of candidates exceeds the number "
"of keywords to return."
)
elif top_n > len(words):
return []
# Calculate distances and extract keywords
distances = cosine_similarity(doc_embedding, word_embeddings)
distances_words = cosine_similarity(word_embeddings, word_embeddings)
# Get 2*top_n words as candidates based on cosine similarity
words_idx = list(distances.argsort()[0][-nr_candidates:])
words_vals = [words[index] for index in words_idx]
candidates = distances_words[np.ix_(words_idx, words_idx)]
# Calculate the combination of words that are the least similar to each other
min_sim = 100_000
candidate = None
for combination in itertools.combinations(range(len(words_idx)), top_n):
sim = sum(
[candidates[i][j] for i in combination for j in combination if i != j]
)
if sim < min_sim:
candidate = combination
min_sim = sim
return [
(words_vals[idx], round(float(distances[0][words_idx[idx]]), 4))
for idx in candidate
]