A week or two ago, we had an interesting discussion here on EMOB about the uses and limits of computers as aids to literary scholarship. Below are some of my own summary thoughts on the subject, along with links to related work that I’ve found useful or thought-provoking.
The first point to mention is that the things computers are good at are very different from the things humans are good at. The worthwhile work in digital humanities (“DH” for short, a synonym for computationally assisted humanities research) keeps this fact in mind. Computers are useful for doing quickly certain basic (that is, boring) tasks that humans do slowly. They’re really good at counting, for instance. But sometimes, happily, these kinds of quantitative improvements in speed produce qualitative changes in the kinds of questions we can pose about the objects that interest us. So we literary scholars don’t want to ask computers to do our close reading for us. We want them to help us work differently by expanding what we can read (or at least interpret) and how we can read it. And we want to keep in mind that reading itself is just one (extraordinarily useful) analytical technique when it comes to understanding literary or social-aesthetic objects.
There are two main classes of literary problems that might immediately benefit from computational help. In the first, you’re looking for fresh insights into texts you already know (presumably because you’ve read them closely). In the second, you’d like to be able to say something about a large collection of texts you haven’t read (and probably can’t read, even in principle, because there are too many of them; think of the set of all novels written in English). In both cases, it would almost certainly be useful to classify or group the texts together according to various criteria, a process that is in fact at the heart of much computationally assisted literary work.
In the first case, what you’re looking for are new ways to connect or distinguish known texts. Cluster analysis is one way to do this. You take a group of texts (Shakespeare’s plays, for instance), feed them through an algorithm that assesses their similarity or difference according to a set of known features or metrics (sentence length, character or lemma n-gram frequency, part of speech frequency, keyword frequency, etc.—the specific metrics need to be worked out by a combination of so-called “domain knowledge” and trial and error), and produce a set of clusters that rank the relative similarity of each work to the others. Typical output looks something like this figure from Matthew Jockers’ blog (click the image to see it full size in its original context):
Read this diagram from the top down; the lower the branch point between two items or groups, the more closely related they are.
This may or may not be interesting. Note in particular that the cluster labels are supplied by the user, outside the computational process. In other words, the algorithm doesn’t know what the clusters mean, nor what the clustered works have in common. Still, why does Othello cluster with the comedies rather than the tragedies (or the histories, to which the tragedies are more closely related than the comedies)? The clustering process doesn’t answer that question, but I might never have thought to ask it if I hadn’t seen these results. Maybe I won’t have anything insightful to say in answer to it, but then that’s true of any other question I might ask, and at least now I have a new potential line of inquiry (which is perhaps no mean thing when it comes to Shakespeare).
(As an aside, the extent to which I’m likely to explain the categorization of Othello as a simple error instead of as something that requires further thought and attention will depend on how well I think the clustering process works overall, which in turn will depend to at least some extent on how well it reproduces my existing expectations about generic groupings in Shakespeare. The most interesting case, probably, is the one in which almost all of my expectations are met and confirmed—thereby giving me faith in the accuracy of the overall clustering—but a small number of unexpected results remain, particularly if the anomalous results square in some way with my previously undeveloped intuitions.)
Even more compelling to me, however, is the application of these and related techniques to bodies of text that would otherwise go simply unread and unanalyzed. If you’re working on any kind of large-scale literary-historical problems, you come up very quickly against the limits of your own reading capacity; you just can’t read most of the books written in any given period, much less over the course of centuries. And the problem only gets worse as you move forward in time, both because there’s more history to master and because authors keep churning out new material at ever-increasing rates. But if you can’t read it all, and if (as I said above) you can’t expect a computer to read it for you, what can you possibly do with all this stuff that currently, for your research purposes, may as well not exist?
Well, you can try to extract data of some kind from it, then group and sort and classify it. This might do a few different things for you:
- It might allow you to test, support, or refine your large-scale claims about developments in literary and social history. If you think that allegory has changed in important and specific ways over the last three centuries, you might be able to test that hypothesis across a large portion of the period’s literary output. You’d do that by training an algorithm on a smallish set of known allegorical and non-allegorical works, then setting it loose on a large collection of novels. (This process is known as supervised classification or supervised learning, in contrast to the un- or semi-supervised clustering described briefly above. For more details, see the Jain article linked at the end of this post.). The algorithm will classify each work in the large collection according to its degree of “allegoricalness” based on the generally low-level differences gleaned from the training set. At that point, it’s up to you, the researcher, to make sense of the results. Are the fluctuations in allegorical occurrence important? How does the genre vary by date, national origin, gender, etc.? Why does it do so? In any case, what’s most exciting to me is the fact that you’re now in position to say something about these works, even if you won’t have particular insight into any one of them. Collectively, at least, you’ve retrieved them from irrelevance and opened up a new avenue for research.
- The same process might also draw your attention to a particular work or set of works that you’d otherwise not have known about or thought to study. If books by a single author or those written during a few years in the early nineteenth century score off the charts in allegoricalness, it might be worth your while to read them closely and to make them the objects of more conventional literary scholarship. Again, the idea is that this is something you’d have missed completely in the absence of computational methods.
- Finally, you might end up doing something like the Shakespearean clustering case above; maybe a book you do know and have always considered non-allegorical is ranked highly allegorical by the computer. Now, you’re probably right and the computer’s probably wrong about that specific book, but it might be interesting to try to figure out what it is about the book that produces the error, and to consider whether or not that fact is relevant to your interpretation of the text.
One note of particular interest to those who care deeply about bibliography. In an earlier post about Google Book Search (a service tellingly renamed from the original Google Print), there was some debate about whether GBS is a catalog or a finding aid, and whether or not full-text search takes the place of human-supplied metadata. I think it’s obvious that both search and metadata are immensely useful and that neither can replace the other. One thing that text mining and classification might help with, though, is supplying metadata where none currently exists. Computationally derived subject headings almost certainly wouldn’t be as good as human-supplied ones, but they might be better than nothing if you have a mess of older records or very lightly curated holdings (as is true of much of the Internet Archive and GBS alike, for instance).
Finally, some links to useful and/or interesting material:
- The MONK Project. A discussion of MONK, which is an attempt to bring corpus-oriented text analysis to the English-department mainstream, set all this in motion here on EMOB.
- A recent (2009) review essay on the state of computational clustering methods. A bit technical (it’s from a CS department), but still very readable. Jain, A. K. “Data Clustering: 50 Years Beyond K means” (PDF).
- John Unsworth (dean of library and information science at UIUC) et al.’s “How Not to Read a Million Books,” an essay on what to do with large text corpora. Plenty of good links, too.
- As an example, Tanya Clement has done nifty digital work on Gertrude Stein: Clement, T. E. “‘A thing not beginning and not ending’: using digital tools to distant-read Gertrude Stein’s The Making of Americans. Literary and Linguistic Computing 23.3 (2008): 361.
- Literary and Linguistic Computing, in which Clement’s article appeared, is the leading literary DH journal.
- There are two large essay collections from Blackwell, both of which are freely available online: The Companion to Digital Humanities and the Companion to Digital Literary Studies. They provide a good overview of the field, and their contributors lists are useful who’s whos.
- Martin Mueller, professor of English and classics at Northwestern, has an interesting blog on what he calls “literary informatics”: “DATA: Digitally Assisted Text Analysis.”
- Matthew Jockers, mentioned above, is prominent in the field and has a number of cool projects going. See his blog at Stanford.
- Lisa Spiro (Rice University Libraries) and Dan Cohen (history, GMU) are well worth following to keep up on new developments in digital humanities.
- My own blog, “Work Product,” which covers whatever I happen to be working on at the moment, often DH-related.