c-TF-IDF
¶
A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.
C-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents. Then, the frequency of words t are extracted for each class i and divided by the total number of words w. Next, the total, unjoined, number of documents across all classes m is divided by the total sum of word i across all classes.
Source code in bertopic\_ctfidf.py
class ClassTFIDF(TfidfTransformer):
"""
A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.

C-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes
by joining all documents per class. Thus, each class is converted to a single document
instead of set of documents. Then, the frequency of words **t** are extracted for
each class **i** and divided by the total number of words **w**.
Next, the total, unjoined, number of documents across all classes **m** is divided by the total
sum of word **i** across all classes.
"""
def __init__(self, *args, **kwargs):
super(ClassTFIDF, self).__init__(*args, **kwargs)
def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
"""Learn the idf vector (global term weights).
Arguments:
X: A matrix of term/token counts.
multiplier: A multiplier for increasing/decreasing certain IDF scores
"""
X = check_array(X, accept_sparse=('csr', 'csc'))
if not sp.issparse(X):
X = sp.csr_matrix(X)
dtype = np.float64
if self.use_idf:
_, n_features = X.shape
# Calculate the frequency of words across all classes
df = np.squeeze(np.asarray(X.sum(axis=0)))
# Calculate the average number of samples as regularization
avg_nr_samples = int(X.sum(axis=1).mean())
# Divide the average number of samples by the word frequency
# +1 is added to force values to be positive
idf = np.log((avg_nr_samples / df)+1)
# Multiplier to increase/decrease certain idf scores
if multiplier is not None:
idf = idf * multiplier
self._idf_diag = sp.diags(idf, offsets=0,
shape=(n_features, n_features),
format='csr',
dtype=dtype)
return self
def transform(self, X: sp.csr_matrix):
"""Transform a count-based matrix to c-TF-IDF
Arguments:
X (sparse matrix): A matrix of term/token counts.
Returns:
X (sparse matrix): A c-TF-IDF matrix
"""
if self.use_idf:
X = normalize(X, axis=1, norm='l1', copy=False)
X = X * self._idf_diag
return X
fit(self, X, multiplier=None)
¶
Learn the idf vector (global term weights).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
csr_matrix |
A matrix of term/token counts. |
required |
multiplier |
ndarray |
A multiplier for increasing/decreasing certain IDF scores |
None |
Source code in bertopic\_ctfidf.py
def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
"""Learn the idf vector (global term weights).
Arguments:
X: A matrix of term/token counts.
multiplier: A multiplier for increasing/decreasing certain IDF scores
"""
X = check_array(X, accept_sparse=('csr', 'csc'))
if not sp.issparse(X):
X = sp.csr_matrix(X)
dtype = np.float64
if self.use_idf:
_, n_features = X.shape
# Calculate the frequency of words across all classes
df = np.squeeze(np.asarray(X.sum(axis=0)))
# Calculate the average number of samples as regularization
avg_nr_samples = int(X.sum(axis=1).mean())
# Divide the average number of samples by the word frequency
# +1 is added to force values to be positive
idf = np.log((avg_nr_samples / df)+1)
# Multiplier to increase/decrease certain idf scores
if multiplier is not None:
idf = idf * multiplier
self._idf_diag = sp.diags(idf, offsets=0,
shape=(n_features, n_features),
format='csr',
dtype=dtype)
return self
transform(self, X)
¶
Transform a count-based matrix to c-TF-IDF
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
sparse matrix |
A matrix of term/token counts. |
required |
Returns:
Type | Description |
---|---|
X (sparse matrix) |
A c-TF-IDF matrix |
Source code in bertopic\_ctfidf.py
def transform(self, X: sp.csr_matrix):
"""Transform a count-based matrix to c-TF-IDF
Arguments:
X (sparse matrix): A matrix of term/token counts.
Returns:
X (sparse matrix): A c-TF-IDF matrix
"""
if self.use_idf:
X = normalize(X, axis=1, norm='l1', copy=False)
X = X * self._idf_diag
return X