The topics that you create can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help to select an appropriate nr_topics when reducing the number of topics that you have created. To visualize this hierarchy, run the following:



Do note that this is not the actual procedure of .reduce_topics() when nr_topics is set to auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles the actual procedure of .reduce_topics() when any number of nr_topics is selected.

Hierarchical labels

Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens to the topic representations when merging topics. To do so, we first need to calculate the representations of the hierarchical topics:

First, we train a basic BERTopic model:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)
hierarchical_topics = topic_model.hierarchical_topics(docs)

To visualize these results, we simply need to pass the resulting hierarchical_topics to our .visualize_hierarchy function:


If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover, we can now see which sub-topics can be found within certain larger themes.

Text-based topic tree

Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can use topic_model.get_topic_tree to create a text-based representation of this hierarchy. Although the general structure is more difficult to view, we can see better which topics could be logically merged:

>>> tree = topic_model.get_topic_tree(hierarchical_topics)
>>> print(tree)
         ├─■──atheists_atheism_god_atheist_argument ── Topic: 21
         └─■──br_god_exist_genetic_existence ── Topic: 124
     └─■──moral_morality_objective_immoral_morals ── Topic: 29
Click here to view the full tree.
                  ├─■──ra_satan_thou_god_lucifer ── Topic: 94
                  └─■──jehovah_lord_mormon_mcconkie_unto ── Topic: 78
                           ├─■──jesus_tomb_disciples_resurrection_john ── Topic: 69
                           └─■──hell_eternal_god_jesus_heaven ── Topic: 53
                       └─■──aaron_baptism_sin_law_god ── Topic: 89
                   └─■──mary_sin_maria_priest_conception ── Topic: 56
          └─■──marriage_married_marry_ceremony_marriages ── Topic: 110
                       ├─■──kinsey_sex_gay_men_sexual ── Topic: 44
                            ├─■──gay_homosexual_homosexuals_sexual_cramer ── Topic: 50
                            └─■──homosexuality_homosexual_sin_paul_sex ── Topic: 27
                            ├─■──jim_context_challenges_articles_quote ── Topic: 36
                                 ├─■──islam_quran_islamic_rushdie_muslims ── Topic: 31
                                 └─■──judas_scripture_bible_books_greek ── Topic: 33
                                 ├─■──atheists_atheism_god_atheist_argument ── Topic: 21
                                 └─■──br_god_exist_genetic_existence ── Topic: 124
                             └─■──moral_morality_objective_immoral_morals ── Topic: 29
                            ├─■──rights_right_slavery_slaves_residence ── Topic: 106
                                 ├─■──government_libertarians_libertarian_regulation_party ── Topic: 58
                                 └─■──tax_taxes_income_billion_deficit ── Topic: 41
                                 ├─■──blacks_penalty_death_cruel_punishment ── Topic: 55
                                 └─■──gun_guns_militia_firearms_amendment ── Topic: 7
                                  ├─■──israel_israeli_jews_arab_jewish ── Topic: 4
                                  └─■──armenian_armenians_turkish_armenia_azerbaijan ── Topic: 15
                         ├─■──serbs_muslims_stephanopoulos_mr_bosnia ── Topic: 35
                         └─■──myers_stephanopoulos_president_ms_mr ── Topic: 87
                ├─■──reno_workers_janet_clinton_waco ── Topic: 77
                         ├─■──batf_warrant_raid_compound_fbi ── Topic: 42
                         └─■──koresh_batf_fbi_children_compound ── Topic: 61
                     └─■──fbi_gas_tear_bds_building ── Topic: 23
                                  ├─■──pds_nubus_lc_slot_card ── Topic: 119
                                  └─■──simms_simm_vram_meg_dram ── Topic: 32
                                           ├─■──fan_cpu_heat_sink_fans ── Topic: 92
                                           └─■──mhz_speed_cpu_fpu_clock ── Topic: 22
                                       └─■──monitor_turn_power_computer_electricity ── Topic: 91
                                        ├─■──duo_battery_apple_230_problem ── Topic: 121
                                        └─■──battery_batteries_concrete_discharge_temperature ── Topic: 75
                                       ├─■──leds_uv_blue_light_boards ── Topic: 66
                                       └─■──wire_wiring_ground_neutral_outlets ── Topic: 120
                                        ├─■──dial_number_phone_line_output ── Topic: 93
                                        └─■──scope_scopes_motorola_generator_oscilloscope ── Topic: 113
                                    ├─■──antenna_antennas_receiver_cable_transmitter ── Topic: 70
                                    └─■──celp_dsp_sampling_speech_voice ── Topic: 52
                                   ├─■──symbol_error_undefined_doug_parse ── Topic: 63
                                   └─■──rx_remote_server_xdm_xterm ── Topic: 45
                                        ├─■──gc_mydisplay_draw_gxxor_drawing ── Topic: 103
                                        └─■──window_widget_application_expose_event ── Topic: 25
                                         ├─■──den_polygon_points_algorithm_polygons ── Topic: 28
                                         └─■──xv_24bit_image_bit_images ── Topic: 57
                                    ├─■──scanner_logitech_grayscale_ocr_scanman ── Topic: 108
                                         ├─■──printer_print_deskjet_hp_ink ── Topic: 18
                                         └─■──fonts_font_truetype_tt_atm ── Topic: 49
                                         ├─■──ghostscript_postscript_pageview_ghostview_dsc ── Topic: 104
                                              ├─■──location_mar_file_host_rwrr ── Topic: 83
                                              └─■──midi_sound_driver_blaster_soundblaster ── Topic: 98
                                     └─■──mouse_driver_mice_ball_problem ── Topic: 68
                                   ├─■──miles_car_amfm_toyota_cassette ── Topic: 62
                                   └─■──amp_speakers_condition_stereo_audio ── Topic: 24
                                        ├─■──size_shipping_sale_condition_mattress ── Topic: 100
                                        └─■──pom_cds_cd_sale_picture ── Topic: 37
                                    └─■──games_game_snes_sega_genesis ── Topic: 40
                                        ├─■──tape_backup_tapes_drive_4mm ── Topic: 107
                                        └─■──lens_camera_lenses_zoom_pouch ── Topic: 114
                                         ├─■──1st_hulk_comics_art_appears ── Topic: 105
                                         └─■──books_book_cover_trek_chemistry ── Topic: 125
                                     ├─■──hotel_voucher_package_vacation_room ── Topic: 74
                                     └─■──tickets_ticket_june_airlines_july ── Topic: 84
                                ├─■──espn_pt_pts_game_la ── Topic: 17
                                └─■──team_25_game_hockey_550 ── Topic: 2
                            └─■──year_game_hit_baseball_players ── Topic: 0
                               ├─■──insurance_health_private_care_canada ── Topic: 99
                               └─■──insurance_car_accident_rates_sue ── Topic: 82
                                    ├─■──radar_detector_detectors_ka_alarm ── Topic: 39
                                         ├─■──clutch_shift_shifting_transmission_gear ── Topic: 88
                                         └─■──car_cars_mustang_ford_v8 ── Topic: 14
                                         ├─■──odometer_sensor_speedo_gauge_mileage ── Topic: 96
                                         └─■──oil_drain_car_leaks_taillights ── Topic: 102
                                     └─■──diesel_diesels_emissions_fuel_oil ── Topic: 79
                                ├─■──bike_ride_riding_lane_car ── Topic: 11
                                └─■──bike_bikes_miles_honda_motorcycle ── Topic: 19
                            └─■──countersteering_bike_motorcycle_rear_shaft ── Topic: 46
                                        ├─■──greek_greece_turkish_greeks_cyprus ── Topic: 71
                                        └─■──kuwait_iraq_iran_gulf_arabia ── Topic: 76
                                             ├─■──clinton_bush_quayle_reagan_panicking ── Topic: 101
                                                  ├─■──cooper_trial_weaver_spence_witnesses ── Topic: 90
                                                  └─■──dog_dogs_bike_trained_springer ── Topic: 67
                                              ├─■──msg_food_chinese_foods_taste ── Topic: 30
                                              └─■──drugs_drug_marijuana_cocaine_alcohol ── Topic: 72
                                         ├─■──rocketry_rockets_engines_nuclear_plutonium ── Topic: 115
                                              ├─■──water_dept_phd_environmental_atmospheric ── Topic: 97
                                              └─■──cooling_water_steam_towers_plants ── Topic: 109
                                          ├─■──theory_universe_larsons_larson_science ── Topic: 54
                                          └─■──oort_cloud_grbs_gamma_burst ── Topic: 80
                                             ├─■──joke_maddi_nickname_nicknames_frank ── Topic: 43
                                             └─■──deleted_stuff_bookstore_joke_motto ── Topic: 81
                                         └─■──kirlian_photography_leaf_pictures_aura ── Topic: 85
                                          ├─■──helmet_liner_foam_cb_helmets ── Topic: 112
                                          └─■──mask_goalies_77_santore_tl ── Topic: 123
                                      ├─■──lock_cable_locks_bike_600 ── Topic: 117
                                           ├─■──wax_paint_plastic_scratches_solvent ── Topic: 65
                                           └─■──ear_wax_skin_greasy_acne ── Topic: 116
                                 ├─■──m4_mp_14_mw_mo ── Topic: 111
                                 └─■──test_ensign_nameless_deane_deanebinahccbrandeisedu ── Topic: 118
                             └─■──ites_cheek_hello_hi_ken ── Topic: 3
                   ├─■──cancer_centers_center_medical_research ── Topic: 122
                            ├─■──candida_yeast_infection_gonorrhea_infections ── Topic: 48
                                 ├─■──hiv_medical_cancer_patients_doctor ── Topic: 34
                                 └─■──pain_drug_patients_disease_diet ── Topic: 26
                        └─■──health_newsgroup_tobacco_vote_votes ── Topic: 9
                        ├─■──sky_advertising_billboard_billboards_space ── Topic: 59
                        └─■──space_station_moon_redesign_nasa ── Topic: 16
                             ├─■──space_launch_nasa_propulsion_astronaut ── Topic: 47
                             └─■──orbit_km_jupiter_probe_earth ── Topic: 86
                         └─■──hst_mission_shuttle_orbit_arrays ── Topic: 60
                  ├─■──key_clipper_encryption_chip_keys ── Topic: 1
                  └─■──entry_file_ripem_entries_key ── Topic: 73
                           ├─■──openwindows_motif_xview_windows_mouse ── Topic: 20
                           └─■──graphics_widget_ray_3d_available ── Topic: 95
                       └─■──3d_machines_version_comments_contact ── Topic: 38
                        ├─■──gopher_ftp_files_stuffit_images ── Topic: 51
                        └─■──jpeg_image_gif_format_images ── Topic: 13
                  ├─■──copy_protection_program_software_disk ── Topic: 64
                  └─■──db_windows_dos_mov_os2 ── Topic: 8
                          ├─■──drive_scsi_drives_ide_disk ── Topic: 6
                          └─■──meg_sale_ram_drive_shipping ── Topic: 12
                          ├─■──card_monitor_video_drivers_vga ── Topic: 5
                          └─■──modem_port_serial_irq_com ── Topic: 10

Visualize Hierarchical Documents

We can extend the previous method by calculating the topic representation at different levels of the hierarchy and plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

# Train BERTopic and extract hierarchical topics
topic_model = BERTopic().fit(docs, embeddings)
hierarchical_topics = topic_model.hierarchical_topics(docs)
Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by reducing their dimensionality ourselves:

# Run the visualization with the original embeddings
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)


The visualization above was generated with the additional parameter hide_document_hover=True which disables the option to hover over the individual points and see the content of the documents. This makes the resulting visualization smaller and fit into your RAM. However, it might be interesting to set hide_document_hover=False to hover over the points and see the content of the documents.