Skip to content

c-TF-IDF

A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.

C-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents. Then, the frequency of words t are extracted for each class i and divided by the total number of words w. Next, the total, unjoined, number of documents across all classes m is divided by the total sum of word i across all classes.

Source code in bertopic\_ctfidf.py
class ClassTFIDF(TfidfTransformer):
    """
    A Class-based TF-IDF procedure using scikit-learns TfidfTransformer as a base.

    ![](../img/ctfidf.png)

    C-TF-IDF can best be explained as a TF-IDF formula adopted for multiple classes
    by joining all documents per class. Thus, each class is converted to a single document
    instead of set of documents. Then, the frequency of words **t** are extracted for
    each class **i** and divided by the total number of words **w**.
    Next, the total, unjoined, number of documents across all classes **m** is divided by the total
    sum of word **i** across all classes.
    """
    def __init__(self, *args, **kwargs):
        super(ClassTFIDF, self).__init__(*args, **kwargs)

    def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
        """Learn the idf vector (global term weights).

        Arguments:
            X: A matrix of term/token counts.
            multiplier: A multiplier for increasing/decreasing certain IDF scores
        """
        X = check_array(X, accept_sparse=('csr', 'csc'))
        if not sp.issparse(X):
            X = sp.csr_matrix(X)
        dtype = np.float64

        if self.use_idf:
            _, n_features = X.shape

            # Calculate the frequency of words across all classes
            df = np.squeeze(np.asarray(X.sum(axis=0)))

            # Calculate the average number of samples as regularization
            avg_nr_samples = int(X.sum(axis=1).mean())

            # Divide the average number of samples by the word frequency
            # +1 is added to force values to be positive
            idf = np.log((avg_nr_samples / df)+1)

            # Multiplier to increase/decrease certain idf scores
            if multiplier is not None:
                idf = idf * multiplier

            self._idf_diag = sp.diags(idf, offsets=0,
                                      shape=(n_features, n_features),
                                      format='csr',
                                      dtype=dtype)

        return self

    def transform(self, X: sp.csr_matrix):
        """Transform a count-based matrix to c-TF-IDF

        Arguments:
            X (sparse matrix): A matrix of term/token counts.

        Returns:
            X (sparse matrix): A c-TF-IDF matrix
        """
        if self.use_idf:
            X = normalize(X, axis=1, norm='l1', copy=False)
            X = X * self._idf_diag

        return X

fit(self, X, multiplier=None)

Learn the idf vector (global term weights).

Parameters:

Name Type Description Default
X csr_matrix

A matrix of term/token counts.

required
multiplier ndarray

A multiplier for increasing/decreasing certain IDF scores

None
Source code in bertopic\_ctfidf.py
def fit(self, X: sp.csr_matrix, multiplier: np.ndarray = None):
    """Learn the idf vector (global term weights).

    Arguments:
        X: A matrix of term/token counts.
        multiplier: A multiplier for increasing/decreasing certain IDF scores
    """
    X = check_array(X, accept_sparse=('csr', 'csc'))
    if not sp.issparse(X):
        X = sp.csr_matrix(X)
    dtype = np.float64

    if self.use_idf:
        _, n_features = X.shape

        # Calculate the frequency of words across all classes
        df = np.squeeze(np.asarray(X.sum(axis=0)))

        # Calculate the average number of samples as regularization
        avg_nr_samples = int(X.sum(axis=1).mean())

        # Divide the average number of samples by the word frequency
        # +1 is added to force values to be positive
        idf = np.log((avg_nr_samples / df)+1)

        # Multiplier to increase/decrease certain idf scores
        if multiplier is not None:
            idf = idf * multiplier

        self._idf_diag = sp.diags(idf, offsets=0,
                                  shape=(n_features, n_features),
                                  format='csr',
                                  dtype=dtype)

    return self

transform(self, X)

Transform a count-based matrix to c-TF-IDF

Parameters:

Name Type Description Default
X sparse matrix

A matrix of term/token counts.

required

Returns:

Type Description
X (sparse matrix)

A c-TF-IDF matrix

Source code in bertopic\_ctfidf.py
def transform(self, X: sp.csr_matrix):
    """Transform a count-based matrix to c-TF-IDF

    Arguments:
        X (sparse matrix): A matrix of term/token counts.

    Returns:
        X (sparse matrix): A c-TF-IDF matrix
    """
    if self.use_idf:
        X = normalize(X, axis=1, norm='l1', copy=False)
        X = X * self._idf_diag

    return X
Back to top