Visualization
Overview¤
Once you have generated a document-term matrix, it becomes easy to convert it to pandas dataframe and then to use any of the pandas.DataFrame.plot
methods, or to export the data for use with other tools. However, the Lexos API has a number of built-in visualizations to make the process easier. These are accessed through the lexos.visualization
module. By default, all Lexos visualizations are static plots produced with the Python matplotlib
library. However, both static and interactive plots can be produced using third-party libraries like Plotly and Seaborn. These can be imported from their own folders within the visualization module. For instance, the Plotly dendrogram module can be imported with from lexos.visualization.plotly.cluster import dendrogram
.
Each of the available visualization types is described below.
Warning
Currently, there is some inconsistency with respect to the format of the input data for dendrograms. Wherever possible, the document-term matrix (the output of lexos.dtm.DTM
) is used. Generally, the visualization function will then call DTM.get_table()
to retrieve the data as a pandas dataframe. However, in some cases, the visualization function requires the dataframe table to be format of input document. Eventually, these functions will be made consistent.
Word Clouds (and Variants)¤
Single Word Clouds¤
The simplest way to generate a word cloud is to import the wordcloud
function and pass it a document-term matrix.
from lexos.visualization.cloud.wordcloud import wordcloud
dtm = DTM(segments, labels)
wordcloud(dtm, opts, figure_opts)
Wordclouds are generated by the Python Wordcloud library. Options can be passed to WordCloud
using the opts
parameter. Figures are generated using Python's matplotlib
, and its options can be passed using figure_opts
. Both take a dictionary of options, as in the example below:
opts = {
"max_words": 2000,
"background_color": "white",
"contour_width": 0,
"contour_color": "steelblue"
}
figure_opts = {"figsize": (15, 8)}
wordcloud = wordcloud(dtm,
opts=opts,
figure_opts=figure_opts,
round=150,
show=True,
filename="wordcloud.png"
)
The round
parameter (normally between 100 and 300) will add various degrees of rounded corners to the word cloud.
If a filename
is provided, the plot will be saved to the specified file. By default, show=True
, and the wordcloud will be plotted to the screen if the environment is appropriate. If show=False
, the WordCloud
object will be returned. In this instance, you can still save the word cloud by calling wordcloud.to_file(filename)
.
By default, wordcloud()
, creates a word cloud based on the total term counts for all documents. If you wish to use a single or a subset of documents, use the docs
parameter.
wordcloud(dtm, docs=["doc1", "doc2", etc.])
Note
wordcloud()
takes a number of other input formats, including raw text, but a lexos.dtm.DTM
is by far the easiest method to generate data from pre-tokenised texts.
Multiclouds¤
Multiclouds are grid-organized word clouds of individual documents, which allow you to compare the document clouds side by side. The method of generating multiclouds is similar to word clouds. The basic input is a lexos.dtm.DTM
object, where one word cloud will be generated for each document. If a subset of documents is required, the docs
parameter shown above should be used. Once the data is prepared, the multiclouds are generated as shown below:
from lexos.visualization.cloud.wordcloud import multicloud
labels = dtm.get_table().columns.tolist()[1:]
multicloud(dtm, title="My Multicloud", labels=labels, ncols=3)
Since multicloud produce multiple subplots, there is a title
parameter to give the entire figure a title and a labels
parameter, which includes a list labels to be assigned to each subplot. In the example above, we are just taking the labels from the DTM, minus the first "terms" column. The ncols
parameter sets the number of subplots per row.
If a filename
is provided, the entire plot will be saved. If show=False
, multicloud()
returns a list of word clouds. These can be saved individually by calling to_file()
on them.
Note
As with word clouds, the multicloud()
function takes a number of different input formats, but pandas dataframes are the easiest to work with.
Bubble Charts¤
Bubble charts (known as "bubbleviz" in Lexos) are produced as follows:
from lexos.visualization.bubbleviz import bubbleviz
bubbleviz(dtm)
See lexos.visualization.bubbleviz.BubbleChart for a description of the various options.
Warning
The algorithm to produce bubble charts in pure Python is experimental and not nearly as good as the Javascript implementation used in the Lexos app.
Dendrograms¤
Static dendrograms based on hierarchical agglomerative clustering are produced using the Dendrogram
class. It operates directly on the document-term matrix.
dendrogram = Dendrogram(dtm)
dendrogram.fig
or
dendrogram = Dendrogram(dtm, show=True)
The dendrogram plot is not shown by default, so you need to use one of the methods above to display it. The class is a wrapper around scipy.cluster.hierarchy.dendrogram
, and you can change any of its options by calling them, e.g. Dendrogram.orientation = bottom
. If you change any of the options, you must then rebuild the dendrogram by calling Dendrogram.build()
.
The distance title, distance metric, and linkage method, of the dendrogram can be set in the same way by passing title
, metric
, and method
when instantiating the class or by setting them afterwards in the same manner as shown above.
There is also a savefig()
method which takes a filename or filepath to save the file. The image format is detected automatically from the extension type.
Plotly Dendrograms¤
To create a dendrogram in plotly, do the following:
from lexos.visualization.plotly.cluster.dendrogram import PlotlyDendrogram
layout = dict(margin=dict(l=20))
dendrogram = PlotlyDendrogram(
dtm,
title="Plotly Dendrogram",
x_tickangle=45,
**layout
)
The plot()
function accepts the labels
, colorscale
, hovertext
, and color_threshold
parameters in the plotly.figure_factory.create_dendrogram
function. However, it requires strings, rather than functions, for the names of distance metric and linkage method, as shown above. Use the x_tickangle
parameter to change the rotation of the leaf labels.
A dictionary of Plotly configuration options can be passed to the config
parameter. Likewise, Plotly layout options can be passed using the layout
parameter, as shown in the example above.
Once the dendrogram object has been instantiated, it can be displayed with the Dendrogram.showfig()
method it can also be converted to HTML with Dendrogram.to_html()
. By default, this will return an HTML div element, but the output_type
can also be set to "file" and a filename
supplied to save the HTML string as a file.
Warning
If you create a dendrogram with something like dendrogram = Dendrogram(dtm, show=False)
and then call dendrogram.fig
, you will get a plot, but it will not have any configurations you have specified applied to it. This includes the default configurations such as removing the Plotly logo from the menubar. This is due to a flaw in Plotly's API. Accessing the dendrogram figure with dendrogram.showfig()
avoids this problem.
Clustermaps¤
Using Seaborn¤
A clustermap is a dendrogram attached to a heatmap, showing the relative similarity of documents using a colour scale. Lexos can generate static clustermap images using the Python Seaborn library.
To generate a clustermap, use the following code:
from lexos.visualization.seaborn.cluster.clustermap import ClusterMap
cluster_map = ClusterMap(dtm, title="My Clustermap")
lexos.visualization.seaborn.cluster.clustermap.ClusterMap accepts any Seaborn.clustermap
parameter.
The distance title, distance metric, and linkage method, of the dendrogram can be set in the same way by passing title
, metric
, and method
when instantiating the class or by setting them afterwards calling ClusterMap.build()
.
The clustermap plot is not shown by default. To display the plot, generate it with show=True
or refernce it with ClusterMap.fig
. If you change any of the options, you must then rebuild the dendrogram by calling ClusterMap.build()
.
There is also a savefig()
method which takes a filename or filepath to save the file. The image format is detected automatically from the extension type.
Using Plotly¤
Plotly clustermaps are somewhat experimental and may not render plots that are as informative as Seaborn clustermaps. One advantage they have is that, instead of providing labels for each document at the bottom of the graph, they provide the document labels on the x and y axes, as well as the z (distance) score in the hovertext. This allows you to mouse over individual sections of the heatmap to see which documents are represented by that particular section.
Plotly clustermaps are constructed in the same manner to Plotly dendrograms:
from lexos.visualization.seaborn.cluster.clustermap import PlotlyClustermap
cluster_map = PlotlyClustermap(dtm)
cluster_map.showfig()
All the options for Plotly dendrograms are available with the following differences:
- Figure size is determined by configuring the
width
andheight
parameters. colorscale
is the name of a built-in Plotly colorscale. This is applied to the heatmap and converted internally to a list of colorus to apply to the dendrograms.- Two additional parameters,
hide_upper
andhide_side
allow you to hide the individual dendrograms.
Warning
Once the clustermap plot has been generated, it is inadvisable to use the modebar zoom and pan buttons because this tends to separate the heatmap from the dendrogram leaves. It may even be advisable to remove these buttons from the modebar by default.