Scott Kleinman, California State University, Northridge / scott.kleinman@csun.edu
A topic model is a model of a collection of texts that assumes text are constructed from building blocks called "topics". Topic modelling algorithms use information in the texts themselves to generate the topics; they are not pre-assigned.
A topic model can produce amazing, magical insights about your texts...
A probability distribution over terms.
A list of terms (usually words) from your text collection, each of which has a certain probability of occurring with the other terms in the list.
Topic Number | Prominence | Keywords |
---|
0 | 0.05337 | atlanta buck sherman coffee lil city soldiers union donaldson war opera music men dance wrote vamp |
1 | 0.03937 | chinese singapore quantum scientific china percent han evidence xinjiang physics bjork language study ethnic test culture pretest memory miksic |
2 | 0.06374 | dr medical pain osteopathic medicine brain patients doctors creativity care health smithsonian touro patient benson cancer physician skorton physicians |
3 | 0.12499 | book poetry books literary writer writers american literature fiction writing poet author freud english novels culture published review true |
4 | 0.14779 | museum art mr ms arts museums artist center artists music hong kong contemporary works director china painting local institute |
5 | 0.0484 | love robinson mr godzilla slater gorman movie mother lila literature read sachs happy taught asked writing house child lived |
6 | 0.03013 | oct org street nov center museum art gallery sundays saturdays theater free road noon arts connecticut tuesdays avenue university |
7 | 0.09138 | johnson rights editor mr wilson poverty civil war vietnam kristof president writer jan ants hope human bill lyndon presidents |
8 | 0.18166 | technology computer business ms tech engineering jobs mr women people science percent ireland work skills companies fields number company |
9 | 0.08085 | israel women police violence black gender war church poland white northern officers country trial racism rights civil justice rice |
10 | 0.94475 | advertisement people time years work make life world year young part day made place back great good times things |
11 | 0.08681 | mr chief smith russian vista company russia financial times equity dealbook million reports street private berggruen york bank executive |
12 | 0.1135 | street york show free sunday children saturday theater city monday tour friday martin center members students manhattan village west |
13 | 0.17297 | times video photo community commencement article york lesson credit read tumblr online students blog digital college plan twitter news |
14 | 0.4395 | university american research mr studies international faculty state center work dr director arts universities academic bard advertisement education history |
15 | 0.55649 | people human world professor science humanities time knowledge life questions study learn social find ways change thinking problem don |
16 | 0.10946 | mr ms professor marriage york wrote degree newark mondale mother born received father school aboulela ajami price married home |
17 | 0.3896 | years government mr president report programs public american humanities state ms year million information board budget today left private |
18 | 0.07622 | religion religious buddhist faith philosophy traditions god derrida philosophers life beliefs hope buddhism jesus doctrine stone deconstruction theology lives |
19 | 0.31793 | students school education college schools student teachers teaching graduate year harvard percent colleges class high graduates job learning universities |
Good topics are normally judged by the "semantic coherence" of their terms, but there is proven statistical heuristic for demonstrating this.
Typically, human intuition is used to label the topics (e.g. Religion and Deconstructionism: religion religious buddhist faith philosophy traditions god derrida ...).
Less semantically coherent topics can be the most interesting because they bring together terms human users might not relate.
Junk topics can be ignored, but a "good" topic model will have a relatively low percentage of junk topics.
Such methodological questions make very good discussion points for students.
Pros | Cons |
---|---|
Considered the best implementation of Latent Dirichlet Allocation (LDA), the easiest topic modelling algorithm | Written in Java, requires installation from the command line |
The project is maintained and has a growing Digital Humanities user-base | Output is difficult to use |
Pros | Cons |
---|---|
.JAR wrapper for Mallet requires no installation | Doesn't implement all Mallet options |
Easy to use graphical user interface for Mallet | Mallet version is unclear |
Produces handy HTML topic browser | Project is not maintained |
Output is difficult to use |
word-topic-counts file
.word-topic-counts
file to the Lexos Multicloud Tool for better results.