Our project takes a relatively new computational modeling method, topic modeling, and situates it historically by investigating its paper-based antecedents. Our current focus is a comparison between the eighteenth-century subject index—a genre that reflects the period’s emphasis on producing systematic knowledge—and the twenty-first-century topic modeling method known as Latent Dirichlet Allocation. Prefacing a machine-learning term with a historical designation sounds odd (in part, because there is no such thing as an eighteenth- or nineteenth-century topic model), but we contend that this de-familiarizing move offers a useful way of locating the conceptual work being done by topic modeling. Our critical method enacts a kind of epistemological anachronism by producing a statistical correlation of eighteenth-century subject headings and topic distributions.
In most cases, quantitative methods like topic modeling are used to identify a pattern within a set of textual data, and that pattern is interpreted within an already established set of historical, cultural, or generic expectations. D. Sculley and Bradley Pasanek have usefully explored the challenge this approach presents because of its tendency to produce hermeneutic circularity (Sculley and Pasanek 410). Our approach does not solve the problem of circularity; we accept that it is a given obstacle that faces any interpretive act. Instead we confront circularity by relating the computational model to another historical referent (the subject index), which turns our interpretation from the model itself to the relationship between the model and the index, at specific sites of correlation and contradiction.
We first imagined our tool, the Networked Corpus, as an algorithmic method for marking passages with topical or discursive similarities. Inspired by the practice of cross-referencing passages in printed books, the tool generates hyperlinks between passages that share topics according to a topic model. From this original conception, the tool echoed the practice of Renaissance commonplacing, which involved collecting literary exemplars under thematic or logical headings. Our tool differed, however, in not giving fixed names to the topical units—instead of enabling navigation from heading to passage, the Networked Corpus encourages users to navigate from one passage to another, without assuming any prior knowledge about what the topics that connect might be.
One of our reasons for choosing this design is that the topics of topic modeling are difficult to name in a way that accurately reflects what they represent. Although the topics can sometimes appear to correspond to headings such as one might find in an index, they almost always contain words that are difficult to account for in this simplistic sort of interpretation. Thinking about what sort of navigation paradigms might work best with topic modeling led us to more general questions about the ways in which abstract models of texts can guide reading. It also occurred to us that some of the same issues that we were dealing with could also be applicable to the sort of print indexing that became popular in the late eighteenth century, which could be said to involve its own abstract model of textual content.
We decided to put these two different models into dialogue by comparing a print index from the eighteenth century to a topic model trained on the same text. We chose the index from the 1784 edition of Adam Smith’s Wealth of Nations because it is an exceptionally detailed index, and because Smith’s text exemplifies one of the theoretical issues that we’re dealing with—the relation between abstract models and concrete particulars. The text also frequently switches between different conceptual frameworks that have distinctive vocabularies, making it unusually well-suited to topic modeling, which generally does not work very well when trained on a single book. We downloaded a copy of the text and index from Project Gutenberg, and used a Python script to split the file up and parse the index into a data structure that could be easily manipulated.
Our first goal after parsing the index was to come up with a way of determining how similar it was to a topic model. To do this, we needed to find possible matches between index headings and topics. There is a conceptual difficulty here, because the index and the topic model do not have quite the same structure. While the topic model assumes that pages can draw on topics to varying degrees represented by numbers between zero and one, the index headings either refer to a page or do not. There are also a large number of index entries for very specific subjects that only refer to a few pages, so we would only expect a fraction of the index headings to correlate with any topics.
We decided that the best way of dealing with this was to use a rank correlation formula—specifically, Spearman’s rho. This method correlates topics with index headings entirely in terms of a rank ordering; in the case of the topic model, the pages are ranked by the coefficient for the topic, and in the case of the index, all of the pages indexed under the heading are ranked above all those that are not. Using this definition, a perfect correlation would mean that the pages indexed under that heading always have a higher topic coefficient than the ones that are not. Each of the cases where this is not true decreases the correlation coefficient.
The first thing that we used this calculation for was to determine how many topics we should include in the topic model in order to get it to match up with the index as well as possible according to the coefficient that we have selected. For each possible number of topics from 5 to 60, we generated 40 topic models on The Wealth of Nations using the topic modeling program MALLET (and automating things using Python). For each of these models, we then determined the number of index headings that correlate relatively well (rho >= 0.25) with some topic in the model. We plotted the results using R:
Since the number of index headings matched does not increase much after the number of topics exceeds 40, we concluded that a topic model with 40 topics would be the best to compare to the index, and generated a model with this number of topics to use as our comparator.
Although we did find correlations that were strong enough to establish matches between the index and the topic model, the highest correlation coefficients are still fairly low (generally around 0.35), suggesting that the topic model does not do a very good job of predicting where a page will appear in the index. However, many of the matches do make conceptual sense, despite the large number of pages where the correlations break down. For example, the topic with the top words “wages labour common workmen employments year employment” correlates best with the index heading “Labour”. Our hunch was that looking at the particular passages where the index and the topic model fail to match up could be revealing about the different assumptions that underlie the two models. To enable this sort of reading, we created a special version of the Networked Corpus that shows the index and the topic model side-by-side, and enables the user to view a list of the passages where a particular heading and topic do and do not coincide.
Interpreting the two models by means of this visualization has enabled us to gain a new perspective on a technology that is so familiar as to appear transparent—the index—and has also helped us to better understand some of the conceptual issues that can arise in the interpretation of topic models. Although the code that we developed for comparing topic models and indexes is of relatively limited applicability, we believe that our theoretical approach could be applied elsewhere. Many of the computational methods we use today have precedents in the pre-computer era; and many artifacts from the past can be understood as embodiments of abstract models that could be interpreted through comparison with computational analogues. Our purpose in writing software to facilitate these comparisons is not the construction of new tools that can be used repeatedly, but the interrogation of the tools that we are already using, be they old or new.
All of the code we used in this project is available online here and, for the Smith version here.
McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002. Online. Bender, John, and Michael Marrinan. The Culture of Diagram. Stanford: Stanford University Press, 2010.
Sculley, D., and B. M Pasanek. “Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities.” Literary and Linguistic Computing 23.4 (2008): 409–424.