Topic Modelling

Workshop on Building and Strengthening Digital Humanities through a Regional Network at San Diego State University, October 23-24, 2015

Scott Kleinman, California State University, Northridge / scott.kleinman@csun.edu

What is Topic Modelling?

A form of unsupervised machine learning used to identify categories of meaning in collections of texts
Developed for information search and retrieval
Increasingly employed by scholars of history and literature to model meaning in their texts

A topic model is a model of a collection of texts that assumes text are constructed from building blocks called "topics". Topic modelling algorithms use information in the texts themselves to generate the topics; they are not pre-assigned.

A topic model can produce amazing, magical insights about your texts...

What is a "Topic"?

A probability distribution over terms.

Say what?

A list of terms (usually words) from your text collection, each of which has a certain probability of occurring with the other terms in the list.

Topic Model of US Humanities Patents (Courtesy of Alan Liu)

Topic Number	Prominence	Keywords

0	0.05337	atlanta buck sherman coffee lil city soldiers union donaldson war opera music men dance wrote vamp
1	0.03937	chinese singapore quantum scientific china percent han evidence xinjiang physics bjork language study ethnic test culture pretest memory miksic
2	0.06374	dr medical pain osteopathic medicine brain patients doctors creativity care health smithsonian touro patient benson cancer physician skorton physicians
3	0.12499	book poetry books literary writer writers american literature fiction writing poet author freud english novels culture published review true
4	0.14779	museum art mr ms arts museums artist center artists music hong kong contemporary works director china painting local institute
5	0.0484	love robinson mr godzilla slater gorman movie mother lila literature read sachs happy taught asked writing house child lived
6	0.03013	oct org street nov center museum art gallery sundays saturdays theater free road noon arts connecticut tuesdays avenue university
7	0.09138	johnson rights editor mr wilson poverty civil war vietnam kristof president writer jan ants hope human bill lyndon presidents
8	0.18166	technology computer business ms tech engineering jobs mr women people science percent ireland work skills companies fields number company
9	0.08085	israel women police violence black gender war church poland white northern officers country trial racism rights civil justice rice
10	0.94475	advertisement people time years work make life world year young part day made place back great good times things
11	0.08681	mr chief smith russian vista company russia financial times equity dealbook million reports street private berggruen york bank executive
12	0.1135	street york show free sunday children saturday theater city monday tour friday martin center members students manhattan village west
13	0.17297	times video photo community commencement article york lesson credit read tumblr online students blog digital college plan twitter news
14	0.4395	university american research mr studies international faculty state center work dr director arts universities academic bard advertisement education history
15	0.55649	people human world professor science humanities time knowledge life questions study learn social find ways change thinking problem don
16	0.10946	mr ms professor marriage york wrote degree newark mondale mother born received father school aboulela ajami price married home
17	0.3896	years government mr president report programs public american humanities state ms year million information board budget today left private
18	0.07622	religion religious buddhist faith philosophy traditions god derrida philosophers life beliefs hope buddhism jesus doctrine stone deconstruction theology lives
19	0.31793	students school education college schools student teachers teaching graduate year harvard percent colleges class high graduates job learning universities

What do Topics Represent?

Subjects
Themes
Discourses
Meaningless junk

Good topics are normally judged by the "semantic coherence" of their terms, but there is proven statistical heuristic for demonstrating this.

Typically, human intuition is used to label the topics (e.g. Religion and Deconstructionism: religion religious buddhist faith philosophy traditions god derrida ...).

Less semantically coherent topics can be the most interesting because they bring together terms human users might not relate.

Junk topics can be ignored, but a "good" topic model will have a relatively low percentage of junk topics.

What do Topic Models Tell Us?

The topics present in the collection
The prominence of individual topics in the collection
The prominence of individual terms in each topic
The prominence of each topic in each document in the collection
The most prominent documents associated with each topic

Methodological Questions

How does the algorithm work work?
How do we identify "good" and "bad" topics? "Good" and "bad" topic models?
What epistemological issues are raised by topic modelling?

Such methodological questions make very good discussion points for students.

Examples

How did they do that?

Flashy visualisation techniques that have nothing to do with topic modelling
Topic Modelling Tools like Mallet

Mallet

Pros	Cons
Considered the best implementation of Latent Dirichlet Allocation (LDA), the easiest topic modelling algorithm	Written in Java, requires installation from the command line
The project is maintained and has a growing Digital Humanities user-base	Output is difficult to use

GUI Topic Modeling Tool

Pros	Cons
.JAR wrapper for Mallet requires no installation	Doesn't implement all Mallet options
Easy to use graphical user interface for Mallet	Mallet version is unclear
Produces handy HTML topic browser	Project is not maintained
	Output is difficult to use

Other Options

Python packages like gensim and similar packages in other languages
R (Statistical programming language)
Tools that wrap Mallet (e.g. Serendip http://vep.cs.wisc.edu/serendip/ (Python) and dfr-browser http://agoldst.github.io/dfr-browser/ (R)

Topic Modelling Resources

MALLET: http://mallet.cs.umass.edu/
The Programming Historian Tutorial: http://programminghistorian.org/lessons/topic-modeling-and-mallet
How to Create Topic Clouds with Lexos: http://scottkleinman.net/blog/2014/07/25/how-to-create-topic-clouds-with-lexos/
How to Create and Cluster Topic Files with Lexos: http://scottkleinman.net/blog/2015/09/08/how-to-create-and-cluster-topic-files-in-lexos/
Issue on Topic Modelling in Journal of Digital Humanities: http://journalofdigitalhumanities.org/2-1/
Matthew Jockers, The LDA Buffet: http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/
Ted Underwood, Topic Modeling Made Just Simple Enough: http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/

What do I need?

A collection of texts in a flat folder. This is the input folder.
An empty folder to contain the results of your topic model. This is the output folder.
It is good to start out with the GUI Topic Modeling Tool and the The DARIAH-DE Austen-Brontë Dataset. (Note. You may need to remove the umlaut from "Brontë" in the folder and file names.)
To run the full version of Mallet, follow the instructions in the Programming Historian Tutorial.

How do I visualise my topic model?

With difficulty.
A good start is the Lexos Multicloud tool (if the GUI Topic Modeling tool outputs clean data). See How to Create Topic Clouds with Lexos for instructions.
Other methods require coding and/or the use of outside visualisation tools like Serendip or dfr-browser.

Where do I go from here?

Try Mallet from the command line. Make sure to output the word-topic-counts file.
Feed the word-topic-counts file to the Lexos Multicloud Tool for better results.
Try the Dariah-DE Python Visualizing Topic Models tutorial.