Reading with Machines

A week or two ago, we had an interesting discussion here on EMOB about the uses and limits of computers as aids to literary scholarship. Below are some of my own summary thoughts on the subject, along with links to related work that I’ve found useful or thought-provoking.

The first point to mention is that the things computers are good at are very different from the things humans are good at. The worthwhile work in digital humanities (“DH” for short, a synonym for computationally assisted humanities research) keeps this fact in mind. Computers are useful for doing quickly certain basic (that is, boring) tasks that humans do slowly. They’re really good at counting, for instance. But sometimes, happily, these kinds of quantitative improvements in speed produce qualitative changes in the kinds of questions we can pose about the objects that interest us. So we literary scholars don’t want to ask computers to do our close reading for us. We want them to help us work differently by expanding what we can read (or at least interpret) and how we can read it. And we want to keep in mind that reading itself is just one (extraordinarily useful) analytical technique when it comes to understanding literary or social-aesthetic objects.

There are two main classes of literary problems that might immediately benefit from computational help. In the first, you’re looking for fresh insights into texts you already know (presumably because you’ve read them closely). In the second, you’d like to be able to say something about a large collection of texts you haven’t read (and probably can’t read, even in principle, because there are too many of them; think of the set of all novels written in English). In both cases, it would almost certainly be useful to classify or group the texts together according to various criteria, a process that is in fact at the heart of much computationally assisted literary work.

In the first case, what you’re looking for are new ways to connect or distinguish known texts. Cluster analysis is one way to do this. You take a group of texts (Shakespeare’s plays, for instance), feed them through an algorithm that assesses their similarity or difference according to a set of known features or metrics (sentence length, character or lemma n-gram frequency, part of speech frequency, keyword frequency, etc.—the specific metrics need to be worked out by a combination of so-called “domain knowledge” and trial and error), and produce a set of clusters that rank the relative similarity of each work to the others. Typical output looks something like this figure from Matthew Jockers’ blog (click the image to see it full size in its original context):

Read this diagram from the top down; the lower the branch point between two items or groups, the more closely related they are.

This may or may not be interesting. Note in particular that the cluster labels are supplied by the user, outside the computational process. In other words, the algorithm doesn’t know what the clusters mean, nor what the clustered works have in common. Still, why does Othello cluster with the comedies rather than the tragedies (or the histories, to which the tragedies are more closely related than the comedies)? The clustering process doesn’t answer that question, but I might never have thought to ask it if I hadn’t seen these results. Maybe I won’t have anything insightful to say in answer to it, but then that’s true of any other question I might ask, and at least now I have a new potential line of inquiry (which is perhaps no mean thing when it comes to Shakespeare).

(As an aside, the extent to which I’m likely to explain the categorization of Othello as a simple error instead of as something that requires further thought and attention will depend on how well I think the clustering process works overall, which in turn will depend to at least some extent on how well it reproduces my existing expectations about generic groupings in Shakespeare. The most interesting case, probably, is the one in which almost all of my expectations are met and confirmed—thereby giving me faith in the accuracy of the overall clustering—but a small number of unexpected results remain, particularly if the anomalous results square in some way with my previously undeveloped intuitions.)

Even more compelling to me, however, is the application of these and related techniques to bodies of text that would otherwise go simply unread and unanalyzed. If you’re working on any kind of large-scale literary-historical problems, you come up very quickly against the limits of your own reading capacity; you just can’t read most of the books written in any given period, much less over the course of centuries. And the problem only gets worse as you move forward in time, both because there’s more history to master and because authors keep churning out new material at ever-increasing rates. But if you can’t read it all, and if (as I said above) you can’t expect a computer to read it for you, what can you possibly do with all this stuff that currently, for your research purposes, may as well not exist?

Well, you can try to extract data of some kind from it, then group and sort and classify it. This might do a few different things for you:

It might allow you to test, support, or refine your large-scale claims about developments in literary and social history. If you think that allegory has changed in important and specific ways over the last three centuries, you might be able to test that hypothesis across a large portion of the period’s literary output. You’d do that by training an algorithm on a smallish set of known allegorical and non-allegorical works, then setting it loose on a large collection of novels. (This process is known as supervised classification or supervised learning, in contrast to the un- or semi-supervised clustering described briefly above. For more details, see the Jain article linked at the end of this post.). The algorithm will classify each work in the large collection according to its degree of “allegoricalness” based on the generally low-level differences gleaned from the training set. At that point, it’s up to you, the researcher, to make sense of the results. Are the fluctuations in allegorical occurrence important? How does the genre vary by date, national origin, gender, etc.? Why does it do so? In any case, what’s most exciting to me is the fact that you’re now in position to say something about these works, even if you won’t have particular insight into any one of them. Collectively, at least, you’ve retrieved them from irrelevance and opened up a new avenue for research.
The same process might also draw your attention to a particular work or set of works that you’d otherwise not have known about or thought to study. If books by a single author or those written during a few years in the early nineteenth century score off the charts in allegoricalness, it might be worth your while to read them closely and to make them the objects of more conventional literary scholarship. Again, the idea is that this is something you’d have missed completely in the absence of computational methods.
Finally, you might end up doing something like the Shakespearean clustering case above; maybe a book you do know and have always considered non-allegorical is ranked highly allegorical by the computer. Now, you’re probably right and the computer’s probably wrong about that specific book, but it might be interesting to try to figure out what it is about the book that produces the error, and to consider whether or not that fact is relevant to your interpretation of the text.

One note of particular interest to those who care deeply about bibliography. In an earlier post about Google Book Search (a service tellingly renamed from the original Google Print), there was some debate about whether GBS is a catalog or a finding aid, and whether or not full-text search takes the place of human-supplied metadata. I think it’s obvious that both search and metadata are immensely useful and that neither can replace the other. One thing that text mining and classification might help with, though, is supplying metadata where none currently exists. Computationally derived subject headings almost certainly wouldn’t be as good as human-supplied ones, but they might be better than nothing if you have a mess of older records or very lightly curated holdings (as is true of much of the Internet Archive and GBS alike, for instance).

Finally, some links to useful and/or interesting material:

The MONK Project. A discussion of MONK, which is an attempt to bring corpus-oriented text analysis to the English-department mainstream, set all this in motion here on EMOB.
A recent (2009) review essay on the state of computational clustering methods. A bit technical (it’s from a CS department), but still very readable. Jain, A. K. “Data Clustering: 50 Years Beyond K means” (PDF).
John Unsworth (dean of library and information science at UIUC) et al.’s “How Not to Read a Million Books,” an essay on what to do with large text corpora. Plenty of good links, too.
As an example, Tanya Clement has done nifty digital work on Gertrude Stein: Clement, T. E. “‘A thing not beginning and not ending’: using digital tools to distant-read Gertrude Stein’s The Making of Americans. Literary and Linguistic Computing 23.3 (2008): 361.
Literary and Linguistic Computing, in which Clement’s article appeared, is the leading literary DH journal.
There are two large essay collections from Blackwell, both of which are freely available online: The Companion to Digital Humanities and the Companion to Digital Literary Studies. They provide a good overview of the field, and their contributors lists are useful who’s whos.
Martin Mueller, professor of English and classics at Northwestern, has an interesting blog on what he calls “literary informatics”: “DATA: Digitally Assisted Text Analysis.”
Matthew Jockers, mentioned above, is prominent in the field and has a number of cool projects going. See his blog at Stanford.
Lisa Spiro (Rice University Libraries) and Dan Cohen (history, GMU) are well worth following to keep up on new developments in digital humanities.
My own blog, “Work Product,” which covers whatever I happen to be working on at the moment, often DH-related.

This entry was posted on August 11, 2009 at 1:58 pm and is filed under Digital Literary Studies, Google Book Search, MONK. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

23 Responses to “Reading with Machines”

Reading with Machines « Work Product Says:
August 11, 2009 at 2:02 pm | Reply
[…] to Comments I just put up a longish post over at Early Modern Online Bibliography called “Reading with Machines.” It’s a highly selective and impressionistic overview of literary DH work, plus a […]

LikeLike
Jennifer Says:
August 11, 2009 at 5:01 pm | Reply
Wow. There’s a lot to process in this post, and I haven’t had the chance to do it thoroughly yet (and won’t have the chance to work all the way through all the links and so forth in the immediate future), so I’m not going to be saying anything scintillating in this comment, but your analysis is getting me to look in a very helpful direction. Thanks!

LikeLike
Anna Battigelli Says:
August 11, 2009 at 5:17 pm | Reply
Thanks, Matthew. What a helpful overview! Jockers’ clustering and classification of Shakespeare’s plays still mystifies me: you mention the strangeness of listing Othello as a comedy, but why is Midsummer’s Night Dream listed as a tragedy? One can perhaps guess why Romances like Cymbelline, The Winter’s Tale, and The Tempest are pushed into the tragedy cluster, though that, too, merits analysis. If the classification program works, then perhaps these aberrations will tell us something interesting about the particular use of language in these plays or, depending on what gets analyzed, something about the plays’ structure.

Thanks also for the nice bibliography. I will spend some time looking over the links you provide. Again, thanks!

LikeLike
Eleanor Shevlin Says:
August 11, 2009 at 11:08 pm | Reply
Thanks so much, Matthew, for this very rich post. I especially liked your reminder about what computers can do well that humans can’t and the need to play to the strengths of the machine when devising computational projects. Your point that reading, despite being a prime method for textual analysis, is not the only “analytical technique when it comes to understanding literary or social-aesthetic objects” dovetails well with the capability these tools afford for asking new questions and discovering new ways of working with texts.

As for your remarks about GBS, I wonder if you know of anyone who is working on devising computational subject headings for Google texts–Google or others (if that would be a possibility)? Such a project could be fascinating.

As you note, it is very telling that Google changed the name of its “library” from Google Print to Google Book Search. One of the reasons Google gave for the switch was that users came to “Google Print” expecting to be able to print everything they found. I would add that people were probably expecting full-text access and the like (some new to Google still approach it with this expectation)–much like one would find on Project Gutenberg or accessing articles (if one belonged to a subscribing institution) found in JStor or Project Muse. GBS enables you to search for books, of course, but its real strength in its present incarnation is, in my mind, its capability of searching across texts.

Also, many thanks for the list of selected links–they are welcome and well-chosen.
I have added three of the readings from the Blackwell Companion works

Warwick’s “Print Scholarship and Digital Resources” and Deegan and Tanner’s “Conversion of Primary Sources” (Companion to Digital Humanities) and Damian-Grintt’s “Eighteenth-Century Literature in English and Other Languages: Image, Text, and Hypertext” (Companion to Digital Literary Studies)
http://www.digitalhumanitieshttp://www.digitalhumanities

to the bibliography (see Post 2) because they seemed quite relevant to our upcoming roundtable discussions (Anna had already included Steggle’s “Knowledge Will be Multiplied” from the Blackwell Compantion to DLS).

LikeLike
- Matthew Wilkens Says:
  August 12, 2009 at 12:03 pm | Reply
  As for your remarks about GBS, I wonder if you know of anyone who is working on devising computational subject headings for Google texts–Google or others (if that would be a possibility)? Such a project could be fascinating.
  
  I’m not aware of any such projects beyond some vague hints about “collaboration” from people at Stanford, but I’d be surprised if Google didn’t have something like this in development, since it’s so closely related to their search business. They’re probably the only ones who could develop such a thing at the moment, since no one else has programmatic access to their holdings. If and when the settlement is approved and academic researchers get “non-consumptive” access to the data, I suspect this would be an attractive early project.
  
  LikeLike
  - Eleanor Shevlin Says:
    August 12, 2009 at 9:43 pm
    Thanks, Matthew. I figured that at this stage they were the only ones who could develop such headings at this stage because of the kind of access needed to their holdings.
    
    Also, I had forgotten about the “non-consumptive” access to the data clauses in the settlement. Would any aspect of your project fall under such use?
    
    LikeLike
  - Eric Says:
    August 25, 2009 at 11:28 pm
    I’d be careful about the Google Book Settlement.
    
    One of the problems is that the settlement prohibits creation of a service on the “research corpus” that competes with Google’s (see page 82 of the settlement).
    
    So if you do a better job at text mining / entity extraction etc. than Google (or other Rightsholders) and you’d liked to make your results usable so other researchers can build upon them (“Scholarship as a Service”?) then you’d be in breach of the terms of use of the research corpus.
    
    LikeLike
  - Eleanor Shevlin Says:
    August 26, 2009 at 2:40 am
    Oh, I am quite concerned about the way the settlement will affect Google Book Search.
    
    LikeLike
  - Anna Battigelli Says:
    August 26, 2009 at 8:43 am
    I would be interested in hearing more about the settlement clause regarding GBS’s “non consumptive” access.
    
    LikeLike
  - Matthew Wilkens Says:
    August 26, 2009 at 10:11 am
    No one really knows how the GBS research corpus will work as a practical matter. But I have to say it doesn’t surprise or concern me that you can’t use it to compete with Google. They’re giving you a huge, valuable resource and saying “have at it for your scholarly work.” That’s a pretty good deal, in my book. If you want to turn that access into a business, well, then you’re going to need to pay. Sure, I wish the whole thing were just being donated to the public domain, but that’s not what’s on offer.
    
    My overall take on the settlement: http://wp.me/pl9RM-8O
    
    LikeLike
  - Anna Battigelli Says:
    August 26, 2009 at 10:53 am
    Matthew’s positive review of the GBS settlement is well worth reading: http://wp.me/p19RM-80.
    
    LikeLike
  - Eric Says:
    August 26, 2009 at 11:52 am
    Matt, I read your take on the GBS and you make a good argument for supporting it. However, I can’t agree with it without some key changes. In my mind the dangers of a entrenching Google as a monopoly in this space far outweigh the benefits offered by the settlement.
    
    There are other important objections with regard to the privacy issues and user data capture that will be required under the access and use restrictions. Remember this is a company that already monitors a tremendous amount of user data (some 88% of all web traffic! http://knowprivacy.org/), and is moving toward “behavioral advertising”.
    
    What’s bad about this for scholars? I think there can be a “chilling effect” with the privacy issues. Google does not have the same values found in your university library, and will exploit data about your use of their corpus. They can also remove works with no notice or recourse, again, not like a university library.
    
    With regard to the research use of the corpus, it’s true nobody will know how they will play out. I think for researchers on the computational side, it’ll be a huge boon, since they’ll have a big data set to use to test new algorithms.
    
    However, humanities scholars are on the more “applied” side of this. They’re more likely to want to use text-mining techniques to better understand a collection. Where I see a problem is that they will not have clear permissions to share their understandings, especially as a new service (say one with enhanced, discipline-specific metadata over a portion of the corpus). Because that service may “compete with Google” or other “Rightsholders”. I really think that restriction matters.
    
    The settlement also places restrictions on data extracted (through mining and other means) from copyrighted works. This is a also a problem, because it weakens the public domain status of facts/ideas. If Google launches a Wolfram|Alpha like service on this corpus, they will also likely act like Wolfram|Alpha and claim ownership of mined “facts”.
    
    None of this is good for researchers in the long term. Now, I’m not saying this has to be a totally “open” resource (it can’t because of the copyright status of many of the books). All I’m saying is that we should be REALLY concerned. We should push for some additional protections.
    
    On that note, here’s a nice idea:
    http://www.eff.org/deeplinks/2009/06/should-google-have-s
    
    LikeLike
Anna Battigelli Says:
August 12, 2009 at 8:15 am | Reply
Thomas Rommel’s 2004 chapter called “Literary Studies” in the Blackwell Companion to Digital Humanities cites Jerome McGann’s explanation for why computational analysis has not yet taken hold in literary studies. According to McGann, the

general field of humanities education and scholarship will not take the use of digital technology seriously until one demonstrates how its tools improve the ways we explore and explain aesthetic works – until, that is, they expand our interpretational procedures. (McGann, Radiant Textuality, 2001: xiii).

This makes sense to me. We need a clearly outlined map of how interpretation is enriched or expanded through computational analysis. Perhaps this exists somewhere?

LikeLike
- Matthew Wilkens Says:
  August 12, 2009 at 9:41 am | Reply
  I took McGann to mean that digital work will be taken seriously when and only when we start reading articles and books with traditional literary concerns that happen to use digital techniques as part of their working method. In other words, it’s all well and good to talk about the potential of digital research, but what we really need are more examples of existing, good, interesting, and specific digitally assisted literary critical results.
  
  I agree with McGann completely. There’s not much out there yet that meets those criteria, but it’s coming, and soon (I hope). Of course, non-digital folks could be forgiven if “coming soon” sounded familiar to them.
  
  LikeLike
  - Anna Battigelli Says:
    August 12, 2009 at 10:17 am
    Your project on allegory and parts of speech will be important in this respect (and in many other respects, too).
    
    LikeLike
Eleanor Shevlin Says:
August 12, 2009 at 11:48 am | Reply
Anna’s and Matthew’s comments made me think of another variation of how computational tools/programs can assist in handling large textual corpa in ways not otherwise possible: Peter Robinson’s (et. al.) Canterbury Project. Among other initiatives, this project created software (Collate–available for a long time only for use on Macs) to build on the magisterial, pre-digital-age work of John Manly and Edith Rickert’s The Text of the Canterbury Tales Studied on the Basis of All Known Manuscripts (8 vols) and afford better, more effective ways of presenting the information about variants found across the 84 manuscripts of Chaucer’s CT that had appeared by 1500. One interested in learning more about the fruits of this project might consult Robinson’s “The History, Discoveries, and Aims of the Canterbury Tales Project,” Chaucer Review 38.2 (2003): 126-139. Although some might worry that the ability to manipulate the choice of “base” text and witnesses by users of electronic editions decenters the authority of the scholarly edition, I see this capability as a plus and one that could lead to greater discoveries. (My exposure to this project when I attended the Center for Electronic Texts in the Humanities [CETH] in 1996 was what first stimulated my interest in dh–my focus there was scholarly electronic editions; Robinson was the convener for these workshops.)

Robinson also considers why electronic scholarly editions have yet to catch on in his 2005 piece, “Current issues in making digital editions of medieval texts; or, do electronic scholarly editions have a future?”, Digital Medievalist, 1.1 (Spring 2005).

See also “Where We Are with Electronic Scholarly Editions, and Where We Want to Be” Jahrbuch für Computerphilologie Online 1.1 (2005). [In print in Jahrbuch für Computerphilologie 2004, 123-143.]

LikeLike
Matthew Wilkens Says:
August 12, 2009 at 1:48 pm | Reply
One more link, for those not already swamped 🙂

Working Together or Apart: Promoting the Next Generation of Digital Scholarship. Washington, D.C.: Council on Library and Information Resources, 2009. (PDF)

It’s a hefty report from CLIR on the state of various DH projects and issues. Of particular interest re: this post is Douglas Oard’s essay “A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences,” which begins on page 34. It’s whirlwind indeed and covers more ground (necessarily in less depth) than I did, but it’s worth a look if you’d like to know with whom to make useful friends on campus. And it mostly agrees with me on a few lightly polemical points, so I approve heartily 🙂

LikeLike
- Anna Battigelli Says:
  August 12, 2009 at 8:46 pm | Reply
  I liked Oard’s “Whirlwind Tour,” though it successfully calls attention to the complexity of machine reading–and to the difficulty of getting machine reading to work (or of knowing what it is we want it to do).
  
  Thanks!
  
  LikeLike
Eleanor Shevlin Says:
August 12, 2009 at 5:53 pm | Reply
Again, many thanks, Matthew! Please don’t worry about overwhelming us. This information is very good to have…

LikeLike
Eighteenth-Century Book Tracker « Early Modern Online Bibliography Says:
August 12, 2009 at 8:09 pm | Reply
[…] using these cost-free sites as the foundation for the kinds of digital humanities approaches that Matthew Wilkens describes so well (what John Unsworth calls “not reading” and Tanya Clement calls […]

LikeLike
Dave Mazella Says:
August 12, 2009 at 10:49 pm | Reply
This is a really great, really informative post. I feel like this is a great starting-point for those exploring this topic. (and I agree with those who think that what machine-reading really needs is a really good book and new critical voice that demonstrate the virtues of the approach particularly well) Thanks, Matthew. DM

LikeLike
Anna Battigelli Says:
August 13, 2009 at 8:16 am | Reply
Dave is right. This morning I looked at Lisa Spiro’s marvelous site and was again impressed by something we have not yet mentioned here: the way these computerized tools facilitate collaborative authorship. She cites John Unsworth on a theme Dave and others have discussed frequently and usefully in the Long Eighteenth:

In the cooperative model, the individual produces scholarship that refers to and draws on the work of other individuals. In the collaborative model, one works in conjunction with others, jointly producing scholarship that cannot be attributed to a single author. This will happen, and is already happening, because of computers and computer networks. Many of us already cooperate, on networked discussion groups and in private email, in the research of others: we answer questions, provide references for citations, engage in discussion. From here, it’s a small step to collaboration, using those same channels as a way to overcome geographical dispersion, the difference in time zones, and the limitations of our own knowledge.

Collaborative work facilitated by scholars working together to analyze machine-read statistics, or sharing bibliographies on Zotero, or simply mapping information as Matthew and others have on various blogs, seems so helpful, and yet I think the humanities have not, until recently, caught on to the merits of collaborative work. The sciences are far ahead of us in this respect.

This is a bit of a digression from machine-reading, but these thoughts were provoked by Matthew’s generosity in putting together a very useful overview and by the scholarly riches shared on the sites he lists.

LikeLike
Anna Battigelli Says:
August 16, 2009 at 5:15 pm | Reply
John Unsworth’s “How Not to Read a Million Books” provides the most efficient overview of MONK I have seen. It includes examples of several projects now being undertaken with the help of text-mining tools such as MONK.

LikeLike