c-TF-IDF
¶
Bases: TfidfTransformer
A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
c-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents. The frequency of each word x is extracted for each class c and is l1 normalized. This constitutes the term frequency.
Then, the term frequency is multiplied with IDF which is the logarithm of 1 plus the average number of words per class A divided by the frequency of word x across all classes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bm25_weighting
|
bool
|
Uses BM25-inspired idf-weighting procedure instead of the procedure
as defined in the c-TF-IDF formula. It uses the following weighting scheme:
|
False
|
reduce_frequent_words
|
bool
|
Takes the square root of the bag-of-words after normalizing the matrix. Helps to reduce the impact of words that appear too frequently. |
False
|
seed_words
|
List[str]
|
Specific words that will have their idf value increased by
the value of |
None
|
seed_multiplier
|
float
|
The value with which the idf values of the words in |
2
|
Examples:
transformer = ClassTfidfTransformer()
Source code in bertopic\vectorizers\_ctfidf.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
fit(X, multiplier=None)
¶
Learn the idf vector (global term weights).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
csr_matrix
|
A matrix of term/token counts. |
required |
multiplier
|
ndarray
|
A multiplier for increasing/decreasing certain IDF scores |
None
|
Source code in bertopic\vectorizers\_ctfidf.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
transform(X)
¶
Transform a count-based matrix to c-TF-IDF.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
sparse matrix
|
A matrix of term/token counts. |
required |
Returns:
Name | Type | Description |
---|---|---|
X |
sparse matrix
|
A c-TF-IDF matrix |
Source code in bertopic\vectorizers\_ctfidf.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|