Visualization
Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works. Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing if they make sense is an important factor in alleviating this issue.
Visualize Topics¶
After having trained our BERTopic
model, we can iteratively go through hundreds of topics to get a good
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
LDAvis.
We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using plotly such that we can create an interactive view.
First, we need to train our model:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
Then, we can call .visualize_topics
to create a 2D representation of your topics. The resulting graph is a
plotly interactive graph which can be converted to HTML:
topic_model.visualize_topics()
You can use the slider to select the topic which then lights up red. If you hover over a topic, then general information is given about the topic, including the size of the topic and its corresponding words.
Visualize Documents¶
Using the previous method, we can visualize the topics and get insight into their relationships. However,
you might want a more fine-grained approach where we can visualize the documents inside the topics to see
if they were assigned correctly or whether they make sense. To do so, we can use the topic_model.visualize_documents()
function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization
purposes. This process can be quite expensive, so it is advised to adhere to the following pipeline:
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)
# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
Note
The visualization above was generated with the additional parameter hide_document_hover=True
which disables the
option to hover over the individual points and see the content of the documents. This was done for demonstration purposes
as saving all those documents in the visualization can be quite expensive and result in large files. However,
it might be interesting to set hide_document_hover=False
in order to hover over the points and see the content of the documents.
Custom Hover¶
When you visualize the documents, you might not always want to see the complete document over hover. Many documents have shorter information that might be more interesting to visualize, such as its title. To create the hover based on a documents' title instead of its content, you can simply pass a variable (titles
) containing the title for each document:
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)
Visualize Topic Hierarchy¶
The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical
structure of the topics, we can use scipy.cluster.hierarchy
to create clusters and visualize how
they relate to one another. This might help to select an appropriate nr_topics
when reducing the number
of topics that you have created. To visualize this hierarchy, run the following:
topic_model.visualize_hierarchy()
Note
Do note that this is not the actual procedure of .reduce_topics()
when nr_topics
is set to
auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles
the actual procedure of .reduce_topics()
when any number of nr_topics
is selected.
Hierarchical labels¶
Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens to the topic representations when merging topics. To do so, we first need to calculate the representations of the hierarchical topics:
First, we train a basic BERTopic model:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"]
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)
hierarchical_topics = topic_model.hierarchical_topics(docs)
To visualize these results, we simply need to pass the resulting hierarchical_topics
to our .visualize_hierarchy
function:
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
If you hover over the black circles, you will see the topic representation at that level of the hierarchy. These representations help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover, we can now see which sub-topics can be found within certain larger themes.
Text-based topic tree¶
Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can
use topic_model.get_topic_tree
to create a text-based representation of this hierarchy. Although the general structure is more difficult
to view, we can see better which topics could be logically merged:
>>> tree = topic_model.get_topic_tree(hierarchical_topics)
>>> print(tree)
.
ββatheists_atheism_god_moral_atheist
ββatheists_atheism_god_atheist_argument
β βββ ββatheists_atheism_god_atheist_argument ββ Topic: 21
β βββ ββbr_god_exist_genetic_existence ββ Topic: 124
βββ ββmoral_morality_objective_immoral_morals ββ Topic: 29
Click here to view the full tree.
.
ββpeople_armenian_said_god_armenians
β ββgod_jesus_jehovah_lord_christ
β β ββgod_jesus_jehovah_lord_christ
β β β ββjehovah_lord_mormon_mcconkie_god
β β β β βββ ββra_satan_thou_god_lucifer ββ Topic: 94
β β β β βββ ββjehovah_lord_mormon_mcconkie_unto ββ Topic: 78
β β β ββjesus_mary_god_hell_sin
β β β ββjesus_hell_god_eternal_heaven
β β β β ββhell_jesus_eternal_god_heaven
β β β β β βββ ββjesus_tomb_disciples_resurrection_john ββ Topic: 69
β β β β β βββ ββhell_eternal_god_jesus_heaven ββ Topic: 53
β β β β βββ ββaaron_baptism_sin_law_god ββ Topic: 89
β β β βββ ββmary_sin_maria_priest_conception ββ Topic: 56
β β βββ ββmarriage_married_marry_ceremony_marriages ββ Topic: 110
β ββpeople_armenian_armenians_said_mr
β ββpeople_armenian_armenians_said_israel
β β ββgod_homosexual_homosexuality_atheists_sex
β β β ββhomosexual_homosexuality_sex_gay_homosexuals
β β β β βββ ββkinsey_sex_gay_men_sexual ββ Topic: 44
β β β β ββhomosexuality_homosexual_sin_homosexuals_gay
β β β β βββ ββgay_homosexual_homosexuals_sexual_cramer ββ Topic: 50
β β β β βββ ββhomosexuality_homosexual_sin_paul_sex ββ Topic: 27
β β β ββgod_atheists_atheism_moral_atheist
β β β ββislam_quran_judas_islamic_book
β β β β βββ ββjim_context_challenges_articles_quote ββ Topic: 36
β β β β ββislam_quran_judas_islamic_book
β β β β βββ ββislam_quran_islamic_rushdie_muslims ββ Topic: 31
β β β β βββ ββjudas_scripture_bible_books_greek ββ Topic: 33
β β β ββatheists_atheism_god_moral_atheist
β β β ββatheists_atheism_god_atheist_argument
β β β β βββ ββatheists_atheism_god_atheist_argument ββ Topic: 21
β β β β βββ ββbr_god_exist_genetic_existence ββ Topic: 124
β β β βββ ββmoral_morality_objective_immoral_morals ββ Topic: 29
β β ββarmenian_armenians_people_israel_said
β β ββarmenian_armenians_israel_people_jews
β β β ββtax_rights_government_income_taxes
β β β β βββ ββrights_right_slavery_slaves_residence ββ Topic: 106
β β β β ββtax_government_taxes_income_libertarians
β β β β βββ ββgovernment_libertarians_libertarian_regulation_party ββ Topic: 58
β β β β βββ ββtax_taxes_income_billion_deficit ββ Topic: 41
β β β ββarmenian_armenians_israel_people_jews
β β β ββgun_guns_militia_firearms_amendment
β β β β βββ ββblacks_penalty_death_cruel_punishment ββ Topic: 55
β β β β βββ ββgun_guns_militia_firearms_amendment ββ Topic: 7
β β β ββarmenian_armenians_israel_jews_turkish
β β β βββ ββisrael_israeli_jews_arab_jewish ββ Topic: 4
β β β βββ ββarmenian_armenians_turkish_armenia_azerbaijan ββ Topic: 15
β β ββstephanopoulos_president_mr_myers_ms
β β βββ ββserbs_muslims_stephanopoulos_mr_bosnia ββ Topic: 35
β β βββ ββmyers_stephanopoulos_president_ms_mr ββ Topic: 87
β ββbatf_fbi_koresh_compound_gas
β βββ ββreno_workers_janet_clinton_waco ββ Topic: 77
β ββbatf_fbi_koresh_gas_compound
β ββbatf_koresh_fbi_warrant_compound
β β βββ ββbatf_warrant_raid_compound_fbi ββ Topic: 42
β β βββ ββkoresh_batf_fbi_children_compound ββ Topic: 61
β βββ ββfbi_gas_tear_bds_building ββ Topic: 23
ββuse_like_just_dont_new
ββgame_team_year_games_like
β ββgame_team_games_25_year
β β ββgame_team_games_25_season
β β β ββwindow_printer_use_problem_mhz
β β β β ββmhz_wire_simms_wiring_battery
β β β β β ββsimms_mhz_battery_cpu_heat
β β β β β β ββsimms_pds_simm_vram_lc
β β β β β β β βββ ββpds_nubus_lc_slot_card ββ Topic: 119
β β β β β β β βββ ββsimms_simm_vram_meg_dram ββ Topic: 32
β β β β β β ββmhz_battery_cpu_heat_speed
β β β β β β ββmhz_cpu_speed_heat_fan
β β β β β β β ββmhz_cpu_speed_heat_fan
β β β β β β β β βββ ββfan_cpu_heat_sink_fans ββ Topic: 92
β β β β β β β β βββ ββmhz_speed_cpu_fpu_clock ββ Topic: 22
β β β β β β β βββ ββmonitor_turn_power_computer_electricity ββ Topic: 91
β β β β β β ββbattery_batteries_concrete_duo_discharge
β β β β β β βββ ββduo_battery_apple_230_problem ββ Topic: 121
β β β β β β βββ ββbattery_batteries_concrete_discharge_temperature ββ Topic: 75
β β β β β ββwire_wiring_ground_neutral_outlets
β β β β β ββwire_wiring_ground_neutral_outlets
β β β β β β ββwire_wiring_ground_neutral_outlets
β β β β β β β βββ ββleds_uv_blue_light_boards ββ Topic: 66
β β β β β β β βββ ββwire_wiring_ground_neutral_outlets ββ Topic: 120
β β β β β β ββscope_scopes_phone_dial_number
β β β β β β βββ ββdial_number_phone_line_output ββ Topic: 93
β β β β β β βββ ββscope_scopes_motorola_generator_oscilloscope ββ Topic: 113
β β β β β ββcelp_dsp_sampling_antenna_digital
β β β β β βββ ββantenna_antennas_receiver_cable_transmitter ββ Topic: 70
β β β β β βββ ββcelp_dsp_sampling_speech_voice ββ Topic: 52
β β β β ββwindow_printer_xv_mouse_windows
β β β β ββwindow_xv_error_widget_problem
β β β β β ββerror_symbol_undefined_xterm_rx
β β β β β β βββ ββsymbol_error_undefined_doug_parse ββ Topic: 63
β β β β β β βββ ββrx_remote_server_xdm_xterm ββ Topic: 45
β β β β β ββwindow_xv_widget_application_expose
β β β β β ββwindow_widget_expose_application_event
β β β β β β βββ ββgc_mydisplay_draw_gxxor_drawing ββ Topic: 103
β β β β β β βββ ββwindow_widget_application_expose_event ββ Topic: 25
β β β β β ββxv_den_polygon_points_algorithm
β β β β β βββ ββden_polygon_points_algorithm_polygons ββ Topic: 28
β β β β β βββ ββxv_24bit_image_bit_images ββ Topic: 57
β β β β ββprinter_fonts_print_mouse_postscript
β β β β ββprinter_fonts_print_font_deskjet
β β β β β βββ ββscanner_logitech_grayscale_ocr_scanman ββ Topic: 108
β β β β β ββprinter_fonts_print_font_deskjet
β β β β β βββ ββprinter_print_deskjet_hp_ink ββ Topic: 18
β β β β β βββ ββfonts_font_truetype_tt_atm ββ Topic: 49
β β β β ββmouse_ghostscript_midi_driver_postscript
β β β β ββghostscript_midi_postscript_files_file
β β β β β βββ ββghostscript_postscript_pageview_ghostview_dsc ββ Topic: 104
β β β β β ββmidi_sound_file_windows_driver
β β β β β βββ ββlocation_mar_file_host_rwrr ββ Topic: 83
β β β β β βββ ββmidi_sound_driver_blaster_soundblaster ββ Topic: 98
β β β β βββ ββmouse_driver_mice_ball_problem ββ Topic: 68
β β β ββgame_team_games_25_season
β β β ββ1st_sale_condition_comics_hulk
β β β β ββsale_condition_offer_asking_cd
β β β β β ββcondition_stereo_amp_speakers_asking
β β β β β β βββ ββmiles_car_amfm_toyota_cassette ββ Topic: 62
β β β β β β βββ ββamp_speakers_condition_stereo_audio ββ Topic: 24
β β β β β ββgames_sale_pom_cds_shipping
β β β β β ββpom_cds_sale_shipping_cd
β β β β β β βββ ββsize_shipping_sale_condition_mattress ββ Topic: 100
β β β β β β βββ ββpom_cds_cd_sale_picture ββ Topic: 37
β β β β β βββ ββgames_game_snes_sega_genesis ββ Topic: 40
β β β β ββ1st_hulk_comics_art_appears
β β β β ββ1st_hulk_comics_art_appears
β β β β β ββlens_tape_camera_backup_lenses
β β β β β β βββ ββtape_backup_tapes_drive_4mm ββ Topic: 107
β β β β β β βββ ββlens_camera_lenses_zoom_pouch ββ Topic: 114
β β β β β ββ1st_hulk_comics_art_appears
β β β β β βββ ββ1st_hulk_comics_art_appears ββ Topic: 105
β β β β β βββ ββbooks_book_cover_trek_chemistry ββ Topic: 125
β β β β ββtickets_hotel_ticket_voucher_package
β β β β βββ ββhotel_voucher_package_vacation_room ββ Topic: 74
β β β β βββ ββtickets_ticket_june_airlines_july ββ Topic: 84
β β β ββgame_team_games_season_hockey
β β β ββgame_hockey_team_25_550
β β β β βββ ββespn_pt_pts_game_la ββ Topic: 17
β β β β βββ ββteam_25_game_hockey_550 ββ Topic: 2
β β β βββ ββyear_game_hit_baseball_players ββ Topic: 0
β β ββbike_car_greek_insurance_msg
β β ββcar_bike_insurance_cars_engine
β β β ββcar_insurance_cars_radar_engine
β β β β ββinsurance_health_private_care_canada
β β β β β βββ ββinsurance_health_private_care_canada ββ Topic: 99
β β β β β βββ ββinsurance_car_accident_rates_sue ββ Topic: 82
β β β β ββcar_cars_radar_engine_detector
β β β β ββcar_radar_cars_detector_engine
β β β β β βββ ββradar_detector_detectors_ka_alarm ββ Topic: 39
β β β β β ββcar_cars_mustang_ford_engine
β β β β β βββ ββclutch_shift_shifting_transmission_gear ββ Topic: 88
β β β β β βββ ββcar_cars_mustang_ford_v8 ββ Topic: 14
β β β β ββoil_diesel_odometer_diesels_car
β β β β ββodometer_oil_sensor_car_drain
β β β β β βββ ββodometer_sensor_speedo_gauge_mileage ββ Topic: 96
β β β β β βββ ββoil_drain_car_leaks_taillights ββ Topic: 102
β β β β βββ ββdiesel_diesels_emissions_fuel_oil ββ Topic: 79
β β β ββbike_riding_ride_bikes_motorcycle
β β β ββbike_ride_riding_bikes_lane
β β β β βββ ββbike_ride_riding_lane_car ββ Topic: 11
β β β β βββ ββbike_bikes_miles_honda_motorcycle ββ Topic: 19
β β β βββ ββcountersteering_bike_motorcycle_rear_shaft ββ Topic: 46
β β ββgreek_msg_kuwait_greece_water
β β ββgreek_msg_kuwait_greece_water
β β β ββgreek_msg_kuwait_greece_dog
β β β β ββgreek_msg_kuwait_greece_dog
β β β β β ββgreek_kuwait_greece_turkish_greeks
β β β β β β βββ ββgreek_greece_turkish_greeks_cyprus ββ Topic: 71
β β β β β β βββ ββkuwait_iraq_iran_gulf_arabia ββ Topic: 76
β β β β β ββmsg_dog_drugs_drug_food
β β β β β ββdog_dogs_cooper_trial_weaver
β β β β β β βββ ββclinton_bush_quayle_reagan_panicking ββ Topic: 101
β β β β β β ββdog_dogs_cooper_trial_weaver
β β β β β β βββ ββcooper_trial_weaver_spence_witnesses ββ Topic: 90
β β β β β β βββ ββdog_dogs_bike_trained_springer ββ Topic: 67
β β β β β ββmsg_drugs_drug_food_chinese
β β β β β βββ ββmsg_food_chinese_foods_taste ββ Topic: 30
β β β β β βββ ββdrugs_drug_marijuana_cocaine_alcohol ββ Topic: 72
β β β β ββwater_theory_universe_science_larsons
β β β β ββwater_nuclear_cooling_steam_dept
β β β β β βββ ββrocketry_rockets_engines_nuclear_plutonium ββ Topic: 115
β β β β β ββwater_cooling_steam_dept_plants
β β β β β βββ ββwater_dept_phd_environmental_atmospheric ββ Topic: 97
β β β β β βββ ββcooling_water_steam_towers_plants ββ Topic: 109
β β β β ββtheory_universe_larsons_larson_science
β β β β βββ ββtheory_universe_larsons_larson_science ββ Topic: 54
β β β β βββ ββoort_cloud_grbs_gamma_burst ββ Topic: 80
β β β ββhelmet_kirlian_photography_lock_wax
β β β ββhelmet_kirlian_photography_leaf_mask
β β β β ββkirlian_photography_leaf_pictures_deleted
β β β β β ββdeleted_joke_stuff_maddi_nickname
β β β β β β βββ ββjoke_maddi_nickname_nicknames_frank ββ Topic: 43
β β β β β β βββ ββdeleted_stuff_bookstore_joke_motto ββ Topic: 81
β β β β β βββ ββkirlian_photography_leaf_pictures_aura ββ Topic: 85
β β β β ββhelmet_mask_liner_foam_cb
β β β β βββ ββhelmet_liner_foam_cb_helmets ββ Topic: 112
β β β β βββ ββmask_goalies_77_santore_tl ββ Topic: 123
β β β ββlock_wax_paint_plastic_ear
β β β βββ ββlock_cable_locks_bike_600 ββ Topic: 117
β β β ββwax_paint_ear_plastic_skin
β β β βββ ββwax_paint_plastic_scratches_solvent ββ Topic: 65
β β β βββ ββear_wax_skin_greasy_acne ββ Topic: 116
β β ββm4_mp_14_mw_mo
β β ββm4_mp_14_mw_mo
β β β βββ ββm4_mp_14_mw_mo ββ Topic: 111
β β β βββ ββtest_ensign_nameless_deane_deanebinahccbrandeisedu ββ Topic: 118
β β βββ ββites_cheek_hello_hi_ken ββ Topic: 3
β ββspace_medical_health_disease_cancer
β ββmedical_health_disease_cancer_patients
β β βββ ββcancer_centers_center_medical_research ββ Topic: 122
β β ββhealth_medical_disease_patients_hiv
β β ββpatients_medical_disease_candida_health
β β β βββ ββcandida_yeast_infection_gonorrhea_infections ββ Topic: 48
β β β ββpatients_disease_cancer_medical_doctor
β β β βββ ββhiv_medical_cancer_patients_doctor ββ Topic: 34
β β β βββ ββpain_drug_patients_disease_diet ββ Topic: 26
β β βββ ββhealth_newsgroup_tobacco_vote_votes ββ Topic: 9
β ββspace_launch_nasa_shuttle_orbit
β ββspace_moon_station_nasa_launch
β β βββ ββsky_advertising_billboard_billboards_space ββ Topic: 59
β β βββ ββspace_station_moon_redesign_nasa ββ Topic: 16
β ββspace_mission_hst_launch_orbit
β ββspace_launch_nasa_orbit_propulsion
β β βββ ββspace_launch_nasa_propulsion_astronaut ββ Topic: 47
β β βββ ββorbit_km_jupiter_probe_earth ββ Topic: 86
β βββ ββhst_mission_shuttle_orbit_arrays ββ Topic: 60
ββdrive_file_key_windows_use
ββkey_file_jpeg_encryption_image
β ββkey_encryption_clipper_chip_keys
β β βββ ββkey_clipper_encryption_chip_keys ββ Topic: 1
β β βββ ββentry_file_ripem_entries_key ββ Topic: 73
β ββjpeg_image_file_gif_images
β ββmotif_graphics_ftp_available_3d
β β ββmotif_graphics_openwindows_ftp_available
β β β βββ ββopenwindows_motif_xview_windows_mouse ββ Topic: 20
β β β βββ ββgraphics_widget_ray_3d_available ββ Topic: 95
β β βββ ββ3d_machines_version_comments_contact ββ Topic: 38
β ββjpeg_image_gif_images_format
β βββ ββgopher_ftp_files_stuffit_images ββ Topic: 51
β βββ ββjpeg_image_gif_format_images ββ Topic: 13
ββdrive_db_card_scsi_windows
ββdb_windows_dos_mov_os2
β βββ ββcopy_protection_program_software_disk ββ Topic: 64
β βββ ββdb_windows_dos_mov_os2 ββ Topic: 8
ββdrive_card_scsi_drives_ide
ββdrive_scsi_drives_ide_disk
β βββ ββdrive_scsi_drives_ide_disk ββ Topic: 6
β βββ ββmeg_sale_ram_drive_shipping ββ Topic: 12
ββcard_modem_monitor_video_drivers
βββ ββcard_monitor_video_drivers_vga ββ Topic: 5
βββ ββmodem_port_serial_irq_com ββ Topic: 10
Visualize Hierarchical Documents¶
We can extend the previous method by calculating the topic representation at different levels of the hierarchy and plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic and extract hierarchical topics
topic_model = BERTopic().fit(docs, embeddings)
hierarchical_topics = topic_model.hierarchical_topics(docs)
# Run the visualization with the original embeddings
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
Note
The visualization above was generated with the additional parameter hide_document_hover=True
which disables the
option to hover over the individual points and see the content of the documents. This makes the resulting visualization
smaller and fit into your RAM. However, it might be interesting to set hide_document_hover=False
to hover
over the points and see the content of the documents.
Visualize Terms¶
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other. To visualize this hierarchy, run the following:
topic_model.visualize_barchart()
Visualize Topic Similarity¶
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other. To visualize the heatmap, run the following:
topic_model.visualize_heatmap()
Note
You can set n_clusters
in visualize_heatmap
to order the topics by their similarity.
This will result in blocks being formed in the heatmap indicating which clusters of topics are
similar to each other. This step is very much recommended as it will make reading the heatmap easier.
Visualize Term Score Decline¶
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.
To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.
To visualize the c-TF-IDF score decline, run the following:
topic_model.visualize_term_rank()
To enable the log scale on the y-axis for a better view of individual topics, run the following:
topic_model.visualize_term_rank(log_scale=True)
This visualization was heavily inspired by the "Term Probability Decline" visualization found in an analysis by the amazing tmtoolkit. Reference to that specific analysis can be found here.
Visualize Topics over Time¶
After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by
leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency
of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.
Simply call .visualize_topics_over_time
with the newly created topics over time:
import re
import pandas as pd
from bertopic import BERTopic
# Prepare data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
# Create topics over time
model = BERTopic(verbose=True)
topics, probs = model.fit_transform(tweets)
topics_over_time = model.topics_over_time(tweets, timestamps)
Then, we visualize some interesting topics:
model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])
Visualize Topics per Class¶
You might want to extract and visualize the topic representation per class. For example, if you have specific groups of users that might approach topics differently, then extracting them would help understanding how these users talk about certain topics. In other words, this is simply creating a topic representation for certain classes that you might have in your data.
First, we need to train our model:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Prepare data and classes
data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs = data["data"]
classes = [data["target_names"][i] for i in data["target"]]
# Create topic model and calculate topics per class
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
Then, we visualize the topic representation of major topics per class:
topic_model.visualize_topics_per_class(topics_per_class)
Visualize Probabilities or Distribution¶
We can generate the topic-document probability matrix by simply setting calculate_probabilities=True
if a HDBSCAN model is used:
from bertopic import BERTopic
topic_model = BERTopic(calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)
The resulting probs
variable contains the soft-clustering as done through HDBSCAN.
If a non-HDBSCAN model is used, we can estimate the topic distributions after training our model:
from bertopic import BERTopic
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)
topic_distr, _ = topic_model.approximate_distribution(docs, min_similarity=0)
Then, we either pass the probs
or topic_distr
variable to .visualize_distribution
to visualize either the probability distributions or the topic distributions:
# To visualize the probabilities of topic assignment
topic_model.visualize_distribution(probs[0])
# To visualize the topic distributions in a document
topic_model.visualize_distribution(topic_distr[0])
Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first calculate topic distributions on a token level and then visualize the results:
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)
# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])
df
Note
To get the stylized dataframe for .visualize_approximate_distribution
you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via pip install jinja2
Note
The distribution of the probabilities does not give an indication to the distribution of the frequencies of topics across a document. It merely shows how confident BERTopic is that certain topics can be found in a document.