Visualization

Overview¤

Once you have generated a document-term matrix, it becomes easy to convert it to pandas dataframe and then to use any of the pandas.DataFrame.plot methods, or to export the data for use with other tools. However, the Lexos API has a number of built-in visualizations to make the process easier. These are accessed through the lexos.visualization module. By default, all Lexos visualizations are static plots produced with the Python matplotlib library. However, both static and interactive plots can be produced using third-party libraries like Plotly and Seaborn. These can be imported from their own folders within the visualization module. For instance, the Plotly dendrogram module can be imported with from lexos.visualization.plotly.cluster import dendrogram.

Each of the available visualization types is described below.

Warning

Currently, there is some inconsistency with respect to the format of the input data for dendrograms. Wherever possible, the document-term matrix (the output of lexos.dtm.DTM) is used. Generally, the visualization function will then call DTM.get_table() to retrieve the data as a pandas dataframe. However, in some cases, the visualization function requires the dataframe table to be format of input document. Eventually, these functions will be made consistent.

Word Clouds (and Variants)¤

Single Word Clouds¤

The simplest way to generate a word cloud is to import the wordcloud function and pass it a document-term matrix.

from lexos.visualization.cloud.wordcloud import wordcloud

dtm = DTM(segments, labels)

wordcloud(dtm, opts, figure_opts)

Wordclouds are generated by the Python Wordcloud library. Options can be passed to WordCloud using the opts parameter. Figures are generated using Python's matplotlib, and its options can be passed using figure_opts. Both take a dictionary of options, as in the example below:

opts = {
    "max_words": 2000,
    "background_color": "white",
    "contour_width": 0,
    "contour_color": "steelblue"
}

figure_opts = {"figsize": (15, 8)}

wordcloud = wordcloud(dtm,
    opts=opts,
    figure_opts=figure_opts,
    round=150,
    show=True,
    filename="wordcloud.png"
)

The round parameter (normally between 100 and 300) will add various degrees of rounded corners to the word cloud.

If a filename is provided, the plot will be saved to the specified file. By default, show=True, and the wordcloud will be plotted to the screen if the environment is appropriate. If show=False, the WordCloud object will be returned. In this instance, you can still save the word cloud by calling wordcloud.to_file(filename).

By default, wordcloud(), creates a word cloud based on the total term counts for all documents. If you wish to use a single or a subset of documents, use the docs parameter.

wordcloud(dtm, docs=["doc1", "doc2", etc.])

Note

wordcloud() takes a number of other input formats, including raw text, but a lexos.dtm.DTM is by far the easiest method to generate data from pre-tokenised texts.

Multiclouds¤

Multiclouds are grid-organized word clouds of individual documents, which allow you to compare the document clouds side by side. The method of generating multiclouds is similar to word clouds. The basic input is a lexos.dtm.DTM object, where one word cloud will be generated for each document. If a subset of documents is required, the docs parameter shown above should be used. Once the data is prepared, the multiclouds are generated as shown below:

from lexos.visualization.cloud.wordcloud import multicloud

labels = dtm.get_table().columns.tolist()[1:]

multicloud(dtm, title="My Multicloud", labels=labels, ncols=3)

Since multicloud produce multiple subplots, there is a title parameter to give the entire figure a title and a labels parameter, which includes a list labels to be assigned to each subplot. In the example above, we are just taking the labels from the DTM, minus the first "terms" column. The ncols parameter sets the number of subplots per row.

If a filename is provided, the entire plot will be saved. If show=False, multicloud() returns a list of word clouds. These can be saved individually by calling to_file() on them.

Note

As with word clouds, the multicloud() function takes a number of different input formats, but pandas dataframes are the easiest to work with.

Bubble Charts¤

Bubble charts (known as "bubbleviz" in Lexos) are produced as follows:

from lexos.visualization.bubbleviz import bubbleviz

bubbleviz(dtm)

See lexos.visualization.bubbleviz.BubbleChart for a description of the various options.

Warning

The algorithm to produce bubble charts in pure Python is experimental and not nearly as good as the Javascript implementation used in the Lexos app.

Dendrograms¤

Static dendrograms based on hierarchical agglomerative clustering are produced using the Dendrogram class. It operates directly on the document-term matrix.

dendrogram = Dendrogram(dtm)

dendrogram.fig

or

dendrogram = Dendrogram(dtm, show=True)

The dendrogram plot is not shown by default, so you need to use one of the methods above to display it. The class is a wrapper around scipy.cluster.hierarchy.dendrogram, and you can change any of its options by calling them, e.g. Dendrogram.orientation = bottom. If you change any of the options, you must then rebuild the dendrogram by calling Dendrogram.build().

The distance title, distance metric, and linkage method, of the dendrogram can be set in the same way by passing title, metric, and method when instantiating the class or by setting them afterwards in the same manner as shown above.

There is also a savefig() method which takes a filename or filepath to save the file. The image format is detected automatically from the extension type.

Plotly Dendrograms¤

To create a dendrogram in plotly, do the following:

from lexos.visualization.plotly.cluster.dendrogram import PlotlyDendrogram

layout = dict(margin=dict(l=20))

dendrogram = PlotlyDendrogram(
    dtm,
    title="Plotly Dendrogram",
    x_tickangle=45,
    **layout
)

The plot() function accepts the labels, colorscale, hovertext, and color_threshold parameters in the plotly.figure_factory.create_dendrogram function. However, it requires strings, rather than functions, for the names of distance metric and linkage method, as shown above. Use the x_tickangle parameter to change the rotation of the leaf labels.

A dictionary of Plotly configuration options can be passed to the config parameter. Likewise, Plotly layout options can be passed using the layout parameter, as shown in the example above.

Once the dendrogram object has been instantiated, it can be displayed with the Dendrogram.showfig() method it can also be converted to HTML with Dendrogram.to_html(). By default, this will return an HTML div element, but the output_type can also be set to "file" and a filename supplied to save the HTML string as a file.

Warning

If you create a dendrogram with something like dendrogram = Dendrogram(dtm, show=False) and then call dendrogram.fig, you will get a plot, but it will not have any configurations you have specified applied to it. This includes the default configurations such as removing the Plotly logo from the menubar. This is due to a flaw in Plotly's API. Accessing the dendrogram figure with dendrogram.showfig() avoids this problem.

Clustermaps¤

Using Seaborn¤

A clustermap is a dendrogram attached to a heatmap, showing the relative similarity of documents using a colour scale. Lexos can generate static clustermap images using the Python Seaborn library.

To generate a clustermap, use the following code:

from lexos.visualization.seaborn.cluster.clustermap import ClusterMap

cluster_map = ClusterMap(dtm, title="My Clustermap")

lexos.visualization.seaborn.cluster.clustermap.ClusterMap accepts any Seaborn.clustermap parameter.

The distance title, distance metric, and linkage method, of the dendrogram can be set in the same way by passing title, metric, and method when instantiating the class or by setting them afterwards calling ClusterMap.build().

The clustermap plot is not shown by default. To display the plot, generate it with show=True or refernce it with ClusterMap.fig. If you change any of the options, you must then rebuild the dendrogram by calling ClusterMap.build().

There is also a savefig() method which takes a filename or filepath to save the file. The image format is detected automatically from the extension type.

Using Plotly¤

Plotly clustermaps are somewhat experimental and may not render plots that are as informative as Seaborn clustermaps. One advantage they have is that, instead of providing labels for each document at the bottom of the graph, they provide the document labels on the x and y axes, as well as the z (distance) score in the hovertext. This allows you to mouse over individual sections of the heatmap to see which documents are represented by that particular section.

Plotly clustermaps are constructed in the same manner to Plotly dendrograms:

from lexos.visualization.seaborn.cluster.clustermap import PlotlyClustermap

cluster_map = PlotlyClustermap(dtm)

cluster_map.showfig()

All the options for Plotly dendrograms are available with the following differences:

Figure size is determined by configuring the width and height parameters.
colorscale is the name of a built-in Plotly colorscale. This is applied to the heatmap and converted internally to a list of colorus to apply to the dendrograms.
Two additional parameters, hide_upper and hide_side allow you to hide the individual dendrograms.

Warning

Once the clustermap plot has been generated, it is inadvisable to use the modebar zoom and pan buttons because this tends to separate the heatmap from the dendrogram leaves. It may even be advisable to remove these buttons from the modebar by default.