`Concept`¶

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Since topics are part of conversations and text, they do not represent the context of images well. Therefore, these clusters of images are referred to as 'Concepts' instead of the traditional 'Topics'.

Thus, Concept Modeling takes inspiration from topic modeling techniques to cluster images, find common concepts and model them both visually using images and textually using topic representations.

Usage:

from concept import ConceptModel

concept_model = ConceptModel()
concept_clusters = concept_model.fit_transform(images)

`init(self, min_concept_size=30, diversity=0.3, embedding_model='clip-ViT-B-32', vectorizer_model=None, umap_model=None, hdbscan_model=None, ctfidf=False)` `special` ¶

Concept Model Initialization

Parameters:

Name	Type	Description	Default
`min_concept_size`	`int`	The minimum size of concepts. Increasing this value will lead to a lower number of concept clusters.	`30`
`diversity`	`float`	How diverse the images within a concept are. Values between 0 and 1 with 0 being not diverse at all and 1 being most diverse.	`0.3`
`embedding_model`	`str`	The CLIP model to use. Current options include: * clip-ViT-B-32 * clip-ViT-B-32-multilingual-v1	`'clip-ViT-B-32'`
`vectorizer_model`	`CountVectorizer`	Pass in a CountVectorizer instead of the default	`None`
`umap_model`	`UMAP`	Pass in a UMAP model to be used instead of the default	`None`
`hdbscan_model`	`HDBSCAN`	Pass in a hdbscan.HDBSCAN model to be used instead of the default	`None`
`ctfidf`	`bool`	Whether to use c-TF-IDF to create the textual concept representation	`False`

Source code in concept\_model.py

def __init__(self,
             min_concept_size: int = 30,
             diversity: float = 0.3,
             embedding_model: str = "clip-ViT-B-32",
             vectorizer_model: CountVectorizer = None,
             umap_model: UMAP = None,
             hdbscan_model: hdbscan.HDBSCAN = None,
             ctfidf: bool = False):
    """ Concept Model Initialization

    Arguments:
        min_concept_size: The minimum size of concepts. Increasing this value will lead
                          to a lower number of concept clusters.
        diversity: How diverse the images within a concept are.
                   Values between 0 and 1 with 0 being not diverse at all
                   and 1 being most diverse.
        embedding_model: The CLIP model to use. Current options include:
                * clip-ViT-B-32
                * clip-ViT-B-32-multilingual-v1
        vectorizer_model: Pass in a CountVectorizer instead of the default
        umap_model: Pass in a UMAP model to be used instead of the default
        hdbscan_model: Pass in a hdbscan.HDBSCAN model to be used instead of the default
        ctfidf: Whether to use c-TF-IDF to create the textual concept representation
    """
    self.diversity = diversity
    self.min_concept_size = min_concept_size

    # Embedding model
    self.embedding_model = SentenceTransformer(embedding_model)

    # Vectorizer
    self.vectorizer_model = vectorizer_model or CountVectorizer()

    # UMAP
    self.umap_model = umap_model or UMAP(n_neighbors=15,
                                         n_components=5,
                                         min_dist=0.0,
                                         metric='cosine')

    # HDBSCAN
    self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_concept_size,
                                                          metric='euclidean',
                                                          cluster_selection_method='eom',
                                                          prediction_data=True)

    self.frequency = None
    self.topics = None
    self.cluster_embeddings = None
    self.ctfidf = ctfidf

`find_concepts(self, search_term)` ¶

Based on a search term, find the top 5 related concepts

Parameters:

Name	Type	Description	Default
`search_term`	`str`	The search term to search for	required

Returns:

Type	Description
`results`	The top 5 related concepts with their similarity scores

Usage:

results = concept_model.find_concepts(search_term="dog")

Source code in concept\_model.py

def find_concepts(self, search_term: str) -> List[Tuple[int, float]]:
    """ Based on a search term, find the top 5 related concepts

    Arguments:
        search_term: The search term to search for

    Returns:
        results: The top 5 related concepts with their similarity scores

    Usage:

    ```python
    results = concept_model.find_concepts(search_term="dog")
    ```
    """
    embedding = self.embedding_model.encode(search_term)
    sim_matrix = cosine_similarity(embedding.reshape(1, -1), np.array(self.cluster_embeddings)[:, 0, :])
    related_concepts = np.argsort(sim_matrix)[0][::-1][:5]
    vals = list(np.sort(sim_matrix)[0][::-1][:5])

    results = [(concept, val) for concept, val in zip(related_concepts, vals)]
    return results

`fit(self, images, image_names=None, image_embeddings=None)` ¶

Fit the model on a collection of images and return concepts

Parameters:

Name	Type	Description	Default
`images`	`List[str]`	A list of paths to each image	required
`image_names`	`List[str]`	The names of the images for easier reading of concept clusters	`None`
`image_embeddings`	`ndarray`	Pre-trained image embeddings to use instead of generating them in Concept	`None`

Usage:

from concept import ConceptModel
concept_model = ConceptModel()
concepts = concept_model.fit(images)

Source code in concept\_model.py

def fit(self,
        images: List[str],
        image_names: List[str] = None,
        image_embeddings: np.ndarray = None):
    """ Fit the model on a collection of images and return concepts

    Arguments:
        images: A list of paths to each image
        image_names: The names of the images for easier
                     reading of concept clusters
        image_embeddings: Pre-trained image embeddings to use
                          instead of generating them in Concept

    Usage:

    ```python
    from concept import ConceptModel
    concept_model = ConceptModel()
    concepts = concept_model.fit(images)
    ```
    """
    self.fit_transform(images, image_names=image_names, image_embeddings=image_embeddings)
    return self

`fit_transform(self, images, docs=None, image_names=None, image_embeddings=None)` ¶

Fit the model on a collection of images and return concepts

Parameters:

Name	Type	Description	Default
`images`	`List[str]`	A list of paths to each image	required
`docs`	`List[str]`	The documents from which to extract textual concept representation	`None`
`image_names`	`List[str]`	The names of the images for easier reading of concept clusters	`None`
`image_embeddings`	`ndarray`	Pre-trained image embeddings to use instead of generating them in Concept	`None`

Returns:

Type	Description
`predictions`	Concept prediction for each image

Usage:

from concept import ConceptModel
concept_model = ConceptModel()
concepts = concept_model.fit_transform(images)

Source code in concept\_model.py

def fit_transform(self,
                  images: List[str],
                  docs: List[str] = None,
                  image_names: List[str] = None,
                  image_embeddings: np.ndarray = None) -> List[int]:
    """ Fit the model on a collection of images and return concepts

    Arguments:
        images: A list of paths to each image
        docs: The documents from which to extract textual concept representation
        image_names: The names of the images for easier
                     reading of concept clusters
        image_embeddings: Pre-trained image embeddings to use
                          instead of generating them in Concept

    Returns:
        predictions: Concept prediction for each image

    Usage:

    ```python
    from concept import ConceptModel
    concept_model = ConceptModel()
    concepts = concept_model.fit_transform(images)
    ```
    """

    # Calculate image embeddings if not already generated
    if image_embeddings is None:
        image_embeddings = self._embed_images(images)

    # Reduce dimensionality and cluster images into concepts
    reduced_embeddings = self._reduce_dimensionality(image_embeddings)
    predictions = self._cluster_embeddings(reduced_embeddings)

    # Extract representative images through exemplars
    representative_images = self._extract_exemplars(image_names)
    exemplar_embeddings = self._extract_cluster_embeddings(image_embeddings,
                                                           representative_images)
    selected_exemplars = self._extract_exemplar_subset(exemplar_embeddings,
                                                       representative_images)

    # Create collective representation of images
    self._cluster_representation(images, selected_exemplars)

    # Find the best words for each concept cluster
    if docs is not None:
        if self.ctfidf:
            self._extract_ctfidf_representation(docs, image_embeddings)
        else:
            self._extract_textual_representation(docs)

    return predictions

`load(path)` `classmethod` ¶

Loads the model from the specified path

Parameters:

Name	Type	Description	Default
`path`	`str`	the location and name of the ConceptModel file you want to load	required

Usage:

ConceptModel.load("my_model")

Source code in concept\_model.py

@classmethod
def load(cls,
         path: str):
    """ Loads the model from the specified path

    Arguments:
        path: the location and name of the ConceptModel file you want to load

    Usage:
    ```python
    ConceptModel.load("my_model")
    ```
    """
    with open(path, 'rb') as file:
        concept_model = joblib.load(file)
        return concept_model

`save(self, path)` ¶

Saves the model to the specified path

Parameters:

Name	Type	Description	Default
`path`	`str`	the location and name of the file you want to save	required

Usage:

concept_model.save("my_model")

Source code in concept\_model.py

def save(self,
         path: str) -> None:
    """ Saves the model to the specified path

    Arguments:
        path: the location and name of the file you want to save

    Usage:
    ```python
    concept_model.save("my_model")
    ```
    """
    with open(path, 'wb') as file:
        joblib.dump(self, file)

`transform(self, images, image_embeddings=None)` ¶

After having fit a model, use transform to predict new instances

Parameters:

Name	Type	Description	Default
`images`	`Union[List[str], str]`	A single images or a list of images to predict	required
`image_embeddings`	`ndarray`	Pre-trained image embeddings. These can be used instead of the sentence-transformer model.	`None`

Returns:

Type	Description
`predictions`	Concept predictions for each image

Usage:

concept_model = ConceptModel()
concepts = concept_model.fit(images)
new_concepts = concept_model.transform(new_images)

Source code in concept\_model.py

def transform(self, 
              images: Union[List[str], str], 
              image_embeddings: np.ndarray = None):
    """ After having fit a model, use transform to predict new instances

    Arguments:
        images: A single images or a list of images to predict
        image_embeddings: Pre-trained image embeddings. These can be used
                          instead of the sentence-transformer model.
    Returns:
        predictions: Concept predictions for each image

    Usage:
    ```python
    concept_model = ConceptModel()
    concepts = concept_model.fit(images)
    new_concepts = concept_model.transform(new_images)
    ```
    """
    if image_embeddings is None:
        if isinstance(images, str):
            images = [images]
        image_embeddings = self._embed_images(images)

    umap_embeddings = self.umap_model.transform(image_embeddings)
    predictions, _ = hdbscan.approximate_predict(self.hdbscan_model, umap_embeddings)
    return predictions

`visualize_concepts(self, top_n=9, concepts=None, figsize=(20, 15))` ¶

Visualize concepts using merged exemplars

Parameters:

Name	Type	Description	Default
`top_n`	`int`	The top_n concepts to visualize	`9`
`concepts`	`List[int]`	The concept clusters to visualize	`None`
`figsize`	`Tuple[int, int]`	The size of the figure	`(20, 15)`

Source code in concept\_model.py

def visualize_concepts(self,
                       top_n: int = 9,
                       concepts: List[int] = None,
                       figsize: Tuple[int, int] = (20, 15)):
    """ Visualize concepts using merged exemplars

    Arguments:
        top_n: The top_n concepts to visualize
        concepts: The concept clusters to visualize
        figsize: The size of the figure
    """
    if not concepts:
        concepts = [self.frequency.index[index] for index in range(top_n)]
        images = [self.cluster_images[index] for index in concepts]
    else:
        images = [self.cluster_images[index] for index in concepts]

    nr_columns = 3 if len(images) >= 3 else len(images)
    nr_rows = int(np.ceil(len(concepts) / nr_columns))

    fig, axs = plt.subplots(nr_rows, nr_columns, figsize=figsize)

    # visualize multiple concepts
    if len(images) > 1:
        axs = axs.flatten()
        for index, ax in enumerate(axs):
            if index < len(images):
                ax.imshow(images[index])
                if self.topics:
                    title = f"Concept {concepts[index]}: \n{self.topics[concepts[index]]}"
                else:
                    title = f"Concept {concepts[index]}"
                ax.set_title(title)
            ax.axis('off')

    # visualize a single concept
    else:
        axs.imshow(images[0])
        if self.topics:
            title = f"Concept {concepts[0]}: \n{self.topics[concepts[0]]}"
        else:
            title = f"Concept {concepts[0]}"
        axs.set_title(title)
        axs.axis('off')
    return fig

Concept¶

__init__(self, min_concept_size=30, diversity=0.3, embedding_model='clip-ViT-B-32', vectorizer_model=None, umap_model=None, hdbscan_model=None, ctfidf=False) special ¶

find_concepts(self, search_term) ¶

fit(self, images, image_names=None, image_embeddings=None) ¶

fit_transform(self, images, docs=None, image_names=None, image_embeddings=None) ¶

load(path) classmethod ¶

save(self, path) ¶

transform(self, images, image_embeddings=None) ¶

visualize_concepts(self, top_n=9, concepts=None, figsize=(20, 15)) ¶

`Concept`¶

`init(self, min_concept_size=30, diversity=0.3, embedding_model='clip-ViT-B-32', vectorizer_model=None, umap_model=None, hdbscan_model=None, ctfidf=False)` `special` ¶

`find_concepts(self, search_term)` ¶

`fit(self, images, image_names=None, image_embeddings=None)` ¶

`fit_transform(self, images, docs=None, image_names=None, image_embeddings=None)` ¶

`load(path)` `classmethod` ¶

`save(self, path)` ¶

`transform(self, images, image_embeddings=None)` ¶

`visualize_concepts(self, top_n=9, concepts=None, figsize=(20, 15))` ¶