Topic Modelling

Workshop on Building and Strengthening Digital Humanities through a Regional Network at San Diego State University, October 23-24, 2015

Scott Kleinman, California State University, Northridge / scott.kleinman@csun.edu

What is Topic Modelling?

  • A form of unsupervised machine learning used to identify categories of meaning in collections of texts
  • Developed for information search and retrieval
  • Increasingly employed by scholars of history and literature to model meaning in their texts

A topic model is a model of a collection of texts that assumes text are constructed from building blocks called "topics". Topic modelling algorithms use information in the texts themselves to generate the topics; they are not pre-assigned.

A topic model can produce amazing, magical insights about your texts...

King Arthur and his knights outside CamelotKing Arthur: 'Camelot!'
Sir Lancelot: 'Camelot!'Sir Galahad: 'Camelot!'
Squire: 'It's only model.

What is a "Topic"?

A probability distribution over terms.

Say what?

A list of terms (usually words) from your text collection, each of which has a certain probability of occurring with the other terms in the list.

Topic Model of US Humanities Patents (Courtesy of Alan Liu)

Topic NumberProminenceKeywords
00.05337atlanta buck sherman coffee lil city soldiers union donaldson war opera music men dance wrote vamp
10.03937chinese singapore quantum scientific china percent han evidence xinjiang physics bjork language study ethnic test culture pretest memory miksic
20.06374dr medical pain osteopathic medicine brain patients doctors creativity care health smithsonian touro patient benson cancer physician skorton physicians
30.12499book poetry books literary writer writers american literature fiction writing poet author freud english novels culture published review true
40.14779museum art mr ms arts museums artist center artists music hong kong contemporary works director china painting local institute
50.0484love robinson mr godzilla slater gorman movie mother lila literature read sachs happy taught asked writing house child lived
60.03013oct org street nov center museum art gallery sundays saturdays theater free road noon arts connecticut tuesdays avenue university
70.09138johnson rights editor mr wilson poverty civil war vietnam kristof president writer jan ants hope human bill lyndon presidents
80.18166technology computer business ms tech engineering jobs mr women people science percent ireland work skills companies fields number company
90.08085israel women police violence black gender war church poland white northern officers country trial racism rights civil justice rice
100.94475advertisement people time years work make life world year young part day made place back great good times things
110.08681mr chief smith russian vista company russia financial times equity dealbook million reports street private berggruen york bank executive
120.1135street york show free sunday children saturday theater city monday tour friday martin center members students manhattan village west
130.17297times video photo community commencement article york lesson credit read tumblr online students blog digital college plan twitter news
140.4395university american research mr studies international faculty state center work dr director arts universities academic bard advertisement education history
150.55649people human world professor science humanities time knowledge life questions study learn social find ways change thinking problem don
160.10946mr ms professor marriage york wrote degree newark mondale mother born received father school aboulela ajami price married home
170.3896years government mr president report programs public american humanities state ms year million information board budget today left private
180.07622religion religious buddhist faith philosophy traditions god derrida philosophers life beliefs hope buddhism jesus doctrine stone deconstruction theology lives
190.31793students school education college schools student teachers teaching graduate year harvard percent colleges class high graduates job learning universities

What do Topics Represent?

  • Subjects
  • Themes
  • Discourses
  • Meaningless junk

Good topics are normally judged by the "semantic coherence" of their terms, but there is proven statistical heuristic for demonstrating this.

Typically, human intuition is used to label the topics (e.g. Religion and Deconstructionism: religion religious buddhist faith philosophy traditions god derrida ...).

Less semantically coherent topics can be the most interesting because they bring together terms human users might not relate.

Junk topics can be ignored, but a "good" topic model will have a relatively low percentage of junk topics.

What do Topic Models Tell Us?

  • The topics present in the collection
  • The prominence of individual topics in the collection
  • The prominence of individual terms in each topic
  • The prominence of each topic in each document in the collection
  • The most prominent documents associated with each topic

Methodological Questions

  • How does the algorithm work work?
  • How do we identify "good" and "bad" topics? "Good" and "bad" topic models?
  • What epistemological issues are raised by topic modelling?

Such methodological questions make very good discussion points for students.

Examples

How did they do that?

  • Flashy visualisation techniques that have nothing to do with topic modelling
  • Topic Modelling Tools like Mallet

Mallet

ProsCons
Considered the best implementation of Latent Dirichlet Allocation (LDA), the easiest topic modelling algorithm Written in Java, requires installation from the command line
The project is maintained and has a growing Digital Humanities user-base Output is difficult to use

GUI Topic Modeling Tool

ProsCons
.JAR wrapper for Mallet requires no installation Doesn't implement all Mallet options
Easy to use graphical user interface for Mallet Mallet version is unclear
Produces handy HTML topic browser Project is not maintained
  Output is difficult to use

Other Options

Topic Modelling Resources

What do I need?

How do I visualise my topic model?

Where do I go from here?