Monk Project

Among the text databases included in the Monk (metadata offer new knowledge) Project are ECCO and EEBO (both of which are part of the Text Creation Partnership (TCP). While not addressing bibliographic errors, this initiative does hold relevancy for our discussions on improving these tools. In particular, this project’s efforts are apparently aimed at providing scholars with the means to work more effectively and simulatenously with texts created and housed in different databases.

A recent PowerPoint presentation about the Monk Project, Tools for Textual Data (May 20, 2009), by John Unsworth sketches such issues as treating text as data, the Monk Project’s efforts to facilitate means to “mix and match” texts that reside in different databases, the development of features that will enable searches that users may wish to conduct (for example, what adjectives does a given author favor the most?), and the acceptable level of curatorial/user intervention. The tools being developed to allow both the posing of questions that users may wish to ask and the mining of the data to yield responses to these queries seem highly promising.

Under “Questions for Discussion” (slide 22), I was interested in the two-part query, “Should users be allowed to change, correct, or improve data? If so, under what constraints or conditions?”. Thes question set seems directly pertinent to our discussion of how to improve bibliographic issues in these databases, but it rightly also asks about what sorts of constraints should (or need) to be in place–the answer to which would speaks to issues of quality control. Another question, “Should those who provide collections also collect the results of work done on their collections? Why or why not?,” was surprising to me. While I could see how gathering information about the ways that the collections were being used and the results obtained could help developers improve these databases’ functionality and accuracy, the collection of this information–especially by the owners of databases that are commercial enterprises–seemed far more worrisome to me.

This entry was posted on July 24, 2009 at 5:18 pm and is filed under ECCO, EEBO, MONK, TCP. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

50 Responses to “Monk Project”

Anna Battigelli Says:
July 25, 2009 at 6:45 am | Reply
The Monk Project interests me, but the tutorial wasn’t enough to help me use it in a serious way. I think I’d need to watch someone use it first.

I’m also having difficulty seeing whether these digital environments are more than computerized concordences. Unlike single-author printed concordances, they allow word/phrase searches for a computerized field that includes the works of many authors or of a particular time period. But if–as Steve’s post, Grafton’s note, and other material suggests–the searching is not 100% accurate, it still seems necessary to read the material in order to produce a result worthy of scholarly discussion.

I understand Eleanor’s concern regarding the databases’ interest in collecting use information, but I also see why the owners would want some data regarding the instrument’s use in order to strengthen its functionality. It’s a tricky situation.

LikeLike
Eleanor Shevlin Says:
July 25, 2009 at 12:44 pm | Reply
I may have missed something, but I didn’t see a tutorial. The PowerPoint is taken from Unsworth’s presentation at a Summer Institute on Humanities Data Curation in the Humanities (SIHDC) sponsored by the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign (UIUC)–one of the collaborating institutions (where Markley is based in English) for 18thConnect.

The Monk Project is is creating a centralized platform for textual analysis that coordinates different databases containing encoded TEI texts whose mark-up is very similar but contains variations that without something like MONK prevents them from being cross-searchable. (Hope I am explaining this clearly). In other words, while the Proquest’s EEBO texts are marked up in TEI coding and Cengage-Gale’s ECCO texts are marked up in TEI, certain elements in the respective encoding vary and thus intervention is needed to make them compatible with one another.
In other words, this Project would seem to have a bearing on 18thConnect. “18thConnect will bring together in one forum separate digital collections and texts as well as allow interdisciplinary collaborations by publishers, libraries and scholars” (The Next Big Thing in Humanities, Arts and Social Science Computing: 18thConnect).

Whenever possible and depending on one’s project, consulting the physical documents seems crucial–as Anna notes.

LikeLike
Anna Battigelli Says:
July 25, 2009 at 2:39 pm | Reply
The tutorials appear on the MONK workbench site. I found
them helpful, but I’ll have to spend more time using MONK
before I get a competent sense of it.

We have a number of platforms–NINES, 18th Connect,
and Monk–that are doing similar things. I’m curious about
what distinguishes Monk from the other two platforms, which
define themselves chronologically.

Additionally, as I’m sure readers already know, one can use ECCO II to search both ECCO’s texts and those texts in EEBO that have been encoded through TCP. Searching EEBO through ECCO only works if a library subscribes to both ECCO II and EEBO. Because the number of texts in EEBO that have been coded is small, I have not found this to be as promising as it surely will be when more texts are coded–and when the search mechanism works more accuately.

LikeLike
Eleanor Shevlin Says:
July 25, 2009 at 3:56 pm | Reply
Thanks, Anna–I did see the tutorials when I first explored this project, but I just hadn’t remembered in responding to your post (probably because I didn’t try them).

Also, thanks for the good reminder about the compatibility of ECCO II and EEBO when one’s library owns both. The aim of Monk, however, seems geared to facilitating (among other things) far more sophisticated textual analysis through more complicated queries as well as effective presentation of search results in the ways that would make sense to the average humanities scholar. As I understand it (and I may well be wrong), while NINES (and the plans for 18thConnect) offers a searchable repository for peer-reviewed digital scholarship, projects, exhibits, primary resources, images, and so forth and 18thConnect is addressing ways to improve OCR software, the Monk Project seems primarily to be a tool, an application for textual analysis. Monk (and its predecessors), moreover, aims to harness technology for manipulating texts–by viewing texts as data, the project seeks to broaden how we can mine texts in ways that were not conceivable when dealing with physical books.

LikeLike
Eleanor Shevlin Says:
July 25, 2009 at 8:07 pm | Reply
More on collecting/tracking results:

My initial concern with collecting results was perhaps a moment of unjustified paranoia. ECCO tracks searches and users can see the most popular searches for a given time if they so choose. This information at the local level, in fact, has given the library evidence to demonstrate the value of the purchase and argue for purchasing other, similar products. So I am really not sure why the issue would have struck me as possible problematic initially.

LikeLike
Anna Battigelli Says:
July 26, 2009 at 6:28 am | Reply
I have spent a bit more time exploring MONK, and I appreciate the premise behind its design that

the scholarly use of digital texts must progress beyond treating them as book surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them.

This seems to me correct. Additionally, the kinds of linguistic analyses MONK allows are of interest. For example, its predecessor, Wordhoard, shows us that the most frequent word in Homer’s work is “man,” while the most frequent word in Shakespeare’s works is “lord,” an interesting contrast. Similarly, it shows that women use the noun “love” about twice as often as men.

But the latter example immediately presents us with the question of the database being surveyed. Which women and men? Which chronological period? The “surveys” made possible by Wordhoard or MONK cover a patchwork of American and English text-bases, which seriously complicates the statistical significance of its results. For studies of Shakespeare, or of certain segments of early American literature, MONK would seem to work quite well. Most scholars outside of Big 10 universities will not, however, be able to use MONK to search EEBO or ECCO. This means that the period between the Renaissance and the eighteenth century (except for Shakespeare) is left out of chronological searches. As a result, accurate chronological surveys of shifting word use are still a long way away.

The various digital environments currently being explored–NINES, 18th Connect, MONK–are very promising. But at least two sets of problems remain constant: 1) the comprehensiveness of the databases being surveyed; and 2) searching processes that remain limited by tagging issues, OCR accuracy, and the compatibility of differently conceived databases. The tight restrictions on the use of EEBO and ECCO also impedes early modern works from being fluidly represented in these text-mining products.

I hope Laura and Bob will tell us more about the Gale/18th Connect agreement, which seems increasingly important.

LikeLike
Eleanor Shevlin Says:
July 26, 2009 at 1:08 pm | Reply
Good for experimenting with Monk, Anna! And the issues you raise are significant ones.

The points that Monk and its predecessors make about seeing text as data and digital editions as not merely surrogates of physical books were perhaps what interested me most. These points are certainly not new within digital humanities, but I have witnessed a tendency within humanities communities at large to consider texts and digital editions more as surrogates for the physical book/document. Such considerations hark back to my earlier comments about the differences between databases/repositories that act more as delivery systems (JStor, Project Muse) and other collections such as Google Books that serve as finding aids or those that enable other types of manipulation and exploration of data.

I also was interested in the queries about whether users should be able to change the data. And those queries reminded me of discussions in which I was involved briefly a while back. For a year or so I prepared collations (using Peter Robinson’s “Collate” program–available at the time only capable of operating on Macs) for the first volume of Neil Fraistat and Don Reiman’s The Complete Poetry of Percy Bysshe Shelley. As work progressed on the material for the print version of this work, discussions about whether electronic editions of the future should be more interactive arose. For example, should such editions allow users to perform their own collations using a different copytext of their choosing (or course the results of such work if permitted would not change the “authorized” electronic edition).

Finally, the issue of access comes up again and again, and in this case, as Anna, notes, using Monk with ECCO and EEBO will be limited initially to the Big Ten institutions. I had assumed that this limitiation was temporary as part of the development and testing of the capabilities of Monk and that once fine-tuned, Monk would be available to all institutions who had ECCO and EEBO–though my assumption may well be foolish. Yet even if this assumption is correct, one still needs to have aaccess to ECCO and EEBO, and the fact remains that so many institutions have been unable to afford the purchase of these collections.

LikeLike
Dave Mazella Says:
July 27, 2009 at 9:41 pm | Reply
When I went back to the WordPress posts tagged, “Digital Humanities,” I found this post by a Mellon postdoc at Rice:

http://workproduct.wordpress.com/2009/07/09/pos-frequencies-in-the-monk-corpus-with-additional-musings/

It’s part of a project involving the frequency of certain parts of speech across the MONK collection. It’s a little beyond me in terms of how it approaches allegory and complexity through statistical analysis and visualized data, but it does give some idea of what someone might try to do with a tool like this.

DM

LikeLike
Anna Battigelli Says:
July 28, 2009 at 5:45 am | Reply
Thanks, Dave. One ongoing thread on this blog has been the need to reintroduce the study of bibliography as a consequence of using digital text-bases like EEBO and ECCO and catalogues like the ESTC. Additionally, this post, and the link you provide in your comment, both suggest that linguistics is also experiencing a revival as a direct consequence of these electronic tools.

It would never have occurred to me that allegory could be identified by a text’s use of parts of speech. I would be very interested in knowing how parts of speech signal allegory. The assumption that allegory would be

more prominent and widespread in the short intervals between literary-historical periods than during the periods themselves

also puzzles me. I had always assumed that allegory might peak during periods of cultural stress but that it is always part of the literary imagination.

LikeLike
Eleanor Shevlin Says:
July 28, 2009 at 11:10 am | Reply
Yes, thanks, Dave… Like you and Anna, I was also a bit puzzled by the premises and hypothesis about allegory and literary periods found in this post about the Monk Project. Also, while there is general agreement about literary periods and their boundaries, on one level periods are themselves constructions.

There is a long history of using computing for textual analysis, stylometrics, attribution studies, and the like. And there has been some cross-over in this area with bibliography. See, for example, Joseph Rudman’s “Unediting, De-Editing, and Editing in Nontraditional Authorship Attribution Studies: With an Emphasis on the Canon of Daniel Defoe” PBSA 99.1 (March 2005).

For a definition of non-traditional authorship studies (and whether they are ready for mainstream bibliography), see Rudman’s “Non-Traditional Authorship Attribution Studies in Eighteenth Century Literature: Stylistics Statistics and the Computer”.

LikeLike
Matthew Wilkens Says:
July 28, 2009 at 3:06 pm | Reply
Howdy all,

I’m the author of the blog post in question about allegory and parts of speech in the MONK corpus. Glad it’s sparked some discussion here—my thanks to Anna for the pointer.

Maybe I can give a few quick answers about the premises of my project and some of the implementation details. The connection between allegory and periodization is something I’ve developed at length elsewhere; links in a moment, but the briefest summary is that moments of cultural/political/scientific upheaval are ones in which an old vocabulary or set of metaphors are being replaced by new ones. But how does that happen? You can’t just shoot forth a new worldview self-living. So I think one of the things that happens is you use the old vocabulary under a kind of rhetorical tension, stretching its meaning to describe a new set of conditions and assumptions. If you do that comprehensively, it’s an allegorical way of describing the new situation. [For a slightly longer overview, see http://workproduct.wordpress.com/2008/10/08/the-allegory-project/ which contains links to a few articles as well.]

As for parts of speech as markers of allegory, the reasoning is that allegory needs in a sense to be narratively and rhetorically simpler at the sentence and paragraph level than does non-allegorical writing, since allegory is already immensely complex at the full text level (because it maps at least two stories onto one another). The more complicated and detailed in the vehicle, the harder it is to maintain a clean mapping between it and the tenor. Even if that reasoning is right, though, it’s only a (reasonable) hunch that such rhetorical simplicity should take the form of, say, fewer adjectives and adverbs and more nouns and verbs. I’m the process now of looking for other bibliometric correlates to allegory, but I don’t have any even preliminary results to report yet … sorry 😦

On implementation, it’s maybe worth noting that although I’m using MONK’s texts, the data I’ve gathered has all come from tools outside the MONK workbench proper. In other words, you can’t produce analogous data or figures using the built-in tools in MONK. As some here have noted, MONK’s goal of providing a set of easy tools for exploratory text analysis may not quite have come off. The idea, though, was that you’d be able to log in, select some texts that you want to analyze, then run your selections though whatever defined toolchains MONK chose to expose. So there are supposed to be three steps: (1) Select texts, (2) select tools, (3) choose an output or visualization format. Maybe just knowing that much will help newcomers get a handle on the process—I know it confused the heck out of me the first time I tried to use the workbench (which is what the whole user-facing MONK interface is called).

Forgive the long comment! I’d be happy to answer anything else I can, whether about MONK or about my own work. Thanks for your interest!

LikeLike
Anna Battigelli Says:
July 28, 2009 at 3:31 pm | Reply
I find this very interesting–thanks, Matthew! Your approach to allegory is language-oriented, whereas I see now that my approach has always been (perhaps unreflectively) idea- or concept-oriented. I would never have guessed that allegorical language would have to be simpler than non-allegorical language, though I understand your logic. It will take me some time to think through this issue. I will spend some time looking at your links to get a better sense of your project.

It’s helpful to have your methodical approach for using MONK. I may return to MONK with your progression in mind. It’s very useful to have your voice on this blog. Please contribute again!

LikeLike
- Matthew Wilkens Says:
  July 28, 2009 at 3:44 pm | Reply
  Glad it’s of some use. I should say that my work on allegory didn’t start out as either linguistic or data-driven. But I faced the problem that a lot of us whose interests touch on literary history come up against at some point: How do you extend your argument beyond the small corpus of texts you can read closely? If I want to claim that allegory shifted in prominence over the last four centuries, can I plausibly claim to have dealt with the full output of that period in any rigorous way? Alas no, if to do so means to read all of it. So I went looking for proxy measures that don’t involve close reading.
  
  Moretti’s a touchstone on the reasons for doing this sort of thing, though not (for me) in his specific investigations or methods.
  
  LikeLike
Eleanor Shevlin Says:
July 28, 2009 at 3:52 pm | Reply
Many thanks for the elaboration, Matthew–quite helpful. I do wonder about the correlation between adjectives/adverbs and complexity, for I could see modifiers working to sharpen and clarify. But I would need to give this more thought. (My first thought would be to look at sentence structures as well as to identify the use/ prevelance of specific verbs that signal arguments by analogy and perhaps definition. Also, it might be good to consider the emergence of new words– Take the word “manuscript” for instance. Before the invention of printing, the word was not needed–all script was by hand. But printing changed that and created a need to distinguish a hand-written document from a printed one–and here the use of “manu” (hand / “la main”) is describing a particular kind of script.)
These musings are just that–and not meant to question what you are doing in the least–instead, please consider your work as being quite thought-provoking.

Also, many thanks for the three-step process for using Monk. I actually envisioned that this was the way this tool was intended to be used–but I never stated so directly in my poor attempts to clarify how Monk might work.

LikeLike
Anna Battigelli Says:
July 28, 2009 at 4:51 pm | Reply
Like many others, I feel wary about using social science methods for interpreting texts, but I see why you are drawn to something like that.

Can you say more about your objection to Moretti’s methods?
AB

LikeLike
- Matthew Wilkens Says:
  July 28, 2009 at 5:52 pm | Reply
  Thanks for the thoughts — a couple of quick things in reply:
  
  1. Yes, agreed, methods from the social sciences aren’t especially well suited to answering literary-interpretive questions (for which close reading remains the gold standard and will for the long—or infinite—future). But the thing is that we literary critics are often interested in social questions, too (for instance: how do novels of the Victorian period reflect changing gender roles in nineteenth-century Britain?), and if methods from the social sciences help us answer those questions (as well they should), then it seems to me we shouldn’t hesitate to borrow them, limited only by our modesty as people rarely extensively trained in their use.
  
  This is mostly to say that close reading works well so long as the object of your study is a work (or body of works) that you can, in fact, read closely. But that’s a tiny fraction of total literary-cultural output, especially after some point in the eighteenth century (to pick an arbitrary measure: when publication of novels passed about one a week); I’d like to be able to say something about all those other books, even if it’s not nearly as complete as what I could say if I could read them (which I can’t).
  
  I might add that this has much to do with the now dated-sounding but still relevant canon debates: We have canons not because we don’t want to read everything, but because we can’t. We’re not getting any less finite with time, but more texts and artifacts are always being produced. If we want to go meaningfully beyond the canon rather than just rearranging its contents, we’re going to have to do something in addition to reading closely a very small number of books.
  
  2. I should be clear that I don’t really object to Moretti’s methods. I just don’t think they go far enough—he has a great insight about distant reading, but then often goes on to read pretty closely. The “Graphs” chapter of “Graphs Maps Trees” is promising, though limited to a relatively small number of texts (dozens or hundreds). The “Maps” chapter is devoted to individual texts and is based entirely on close reading. That chapter’s interesting insofar as it thinks about spatial relationships in serious terms, but that’s really part of literary criticism proper. And I don’t know what to make of the “Trees” chapter, because I don’t share many of Moretti’s intuitions about the relationships between particular texts. But there, too, you really need to have read everything Moretti’s read to engage with his claims, and that’s not distant reading.
  
  That said, I understand that the book is a kind of proof of concept, and at that level I like it quite a lot. For more on what Moretti’s up to these days in larger-scale data-based projects, you might want to have a look at his collaborator Matthew Jockers’ blog, especially this post on genre classification: https://www.stanford.edu/~mjockers/cgi-bin/drupal/node/27
  
  LikeLike
Anna Battigelli Says:
July 28, 2009 at 9:14 pm | Reply
Very interesting–thanks! All of this will give me material to mull over. Of course we are all interested in social questions, and I do follow your logic. Still, I’ll have to think this over and look at Jockers’ blog before I can answer in any minimally satisfactory way. I look forward to hearing more about your work, especially regarding POS and allegory. Again, thanks.
AB

LikeLike
Anna Battigelli Says:
July 29, 2009 at 10:19 am | Reply
I’ve looked at Matthew Joggers’ blog and see that he used computerized scanning, based on the usage of high frequency word and punctuation tokens, to correctly classify a selection of novels and Shakespeare’s plays by genre. Such scanning instruments clearly should be explored further. I’m all for electronic tools, and for this kind of quantifiable exercise. I also think that the brain’s calculations in taste and judgment (far less empirically verifiable if they are verfiable at all) need to remain at the center of our engagement with literature. But that does not in the least take away from the value of these online tools. Again, thanks for the information.

LikeLike
Dave Mazella Says:
July 29, 2009 at 9:14 pm | Reply
The reason why “close reading” works so well as an interpretive practice is that it usually focuses on a pregiven unit to read not just closely, but comprehensively. That’s why the New Critics famously did better with lyric poems, whose boundaries are clear, as long as we ignore bibliography, as opposed to novels, which have distressing problems of reference to a referential universe outside the text. This doesn’t even begin to deal with the question of genres like letters, autobiography, memoir, etc.

I am interested in the possibilities of machine reading that many of these projects (MONK but also 18th Connect) promise, but I think all of these projects must contend with the problem of categorization and retrieval, and to what extent genre-attributions can be made by machine-reading surrogates. I’m still at a loss at how Jockers’ Shakespearean examples worked, for instance. This is not in any way a judgment of the viability of these projects, but a query of Matthew or anyone else who can point to a published project like this that goes beyond Moretti’s speculative model, which Matthew indicates really reverts back to “close reading” rather than machine-reading. DM

LikeLike
Eleanor Shevlin Says:
July 29, 2009 at 10:15 pm | Reply
I would like to hear more, too. I can envision, however, on one level how machine-reading could deliver genre-attributions, perhaps because I typically view genres as linguistic constructs foremost. If one identifies a set of traits tied to definitions of specific genres and then runs a program keyed with this information to read a corpus of texts, then the results could be satisfactory. That said, it would seem that the intellectual decisions about categorization had already been made in advance of the machine-reading. So I am uncertain what would be gained (like Dave, I am not questioning the worth of such projects—I may well be misunderstanding their function and goals, etc.).

On the other hand, it could be interesting to see where a text such as Swift’sGulliver’s Travels would fall in the results–and if its place would alter if one used its original title or its now-accepted one. Or what the computational reader would make of Behn’s Love Letters between a Nobleman and His Sister–or the number of texts (arguably not all belonging to the same genre) that bear either “Atlantis” or “Atalantis” in their titles. Yet titles may well not matter to the machine-reading process—though for human readers titles generate important expectations about the genre of the texts they label (admittedly—and sometimes intentionally—titles can mislead).

David Fishelov has examined how literary genre theories are often rooted in one of the following analogies (typically associated with a given scientific paradigm): 1) biological (focuses on evolution of form), 2) family resemblance (shared traits), 3) institutional, and 4) speech-act (Metaphors of Genre, Penn State, 1993, pp. 1-2). Several of these conceptions would appear well-suited to formulating the rules governing the machine-reading to identify the generic identity of texts.

For me, I prefer at times to think about genre in terms of property–real estate/neighborhoods. As new texts move in, the neighborhood can change, rising and falling in value… and texts can form entirely new neighborhoods.

LikeLike
Anna Battigelli Says:
July 30, 2009 at 7:03 am | Reply
I agree with Dave in wanting to hear more about what data Jockers used to produce a computerized classification of plays and novels by genre. Some of Jockers’ other work, publicized last year in the Chronicle and picked up by the Boston Globe, is easier to understand. For example, he charts Irish- American fiction published between 1800 and 2000 by setting to demonstrate that Chicago and San Francisco are more frequently settings in Irish-American fiction than Boston and that studies of such fiction have been too East-Coast-centric. Though I find these results interesting, I would need much more additional data (such as the birth cities of the author and of characters, the popularity of the novels and of the authors, the novels’ quality [less easy to quanitfy]) to arrive at such a conclusion. If the most significant Irish-American novels are published in one place, even if there are fewer of them, isn’t that important? Or if a region defines itself by its Irish-American heritage and thus prizes such novels more than a region that produces more of these novels, doesn’t that need to be taken into account? Doesn’t the relation between a micro-culture and its fiction need to be part of the equation?

This diachronic distant reading interests me, but it seems to be in interesting but early stages. Even with strong statistical methods, until we get OCR software that is more reliable, machine-readable studies for eighteenth-century texts are still approximations. For now, it seems to me that what we really need are more comprehensive, reliable, and refined searches within text-bases like EEBO and ECCO.

All of this returns us to MONK’s claim, cited in Eleanor’s post on Google books that

the scholarly use of digital texts must progress beyond treating them as book surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them

In part I agree with this. But at least equally important, when it comes to EEBO and ECCO, is the ease with which they provide access to the content of books for close reading. I look forward to the day when searching within these databases (or among databases, possibly through 18thConnect) becomes a more methodical and a more reliable enterprise.

LikeLike
Matthew Wilkens Says:
July 30, 2009 at 12:02 pm | Reply
Very interesting stuff. Quick thoughts:

On Jockers’ clustering methods: These are unsupervised clustering techniques. In other words, they don’t involve training the underlying algorithm on any known-good data (i.e., you don’t give the algorithm a set of texts and say “Here, these are comedies. Go find others like them in this second set of unknown texts.”). You just feed the algorithm a bunch of texts plus some features to examine (frequent words, groups of characters, sentence lengths, whatever—you may well not know in advance what will work best) and ask it which texts most closely resemble one another based on those features. You end up with clusters that are more or less closely related; it’s then up to you to decide what those clusters represent (as, for example, with comedies, tragedies, and histories in Shakespeare).

John F. Burrows (now retired) is someone to read on this stuff and has probably done more with cluster-based literary analysis than anyone (Jane Austen has been one of his principal objects). You can see his CV at http://www.allc.org/content/busa/jburrows.html and an article that will give you a taste of things at http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405103213/9781405103213.xml&chunk.id=ss1-4-4&toc.depth=1&toc.id=ss1-4-4&brand=9781405103213_brand.

All this is distinguished from supervised or training-based methods, which can be more laborious, but are aimed more squarely at a specific problem in which you want to build in some knowledge about relevant groupings.

OK, but why would anyone want to do this, when it’s so obviously inferior to close reading for the cases where close reading works well? I can think of two (very different) reasons one might want to machine-classify texts by genre (or by other features, like author, national origin, author gender, etc.). In the first case, you already know the right answers (because you’ve already read the books, say), but you’re looking for interesting outliers or misclassifications, which you will then try to explain using traditional methods. A semi-hypothetical example: You cluster Shakespeare’s play and find that “Othello” consistently clusters with the comedies. Why? Who knows, but it might be an interesting question to address with traditional critical methods, and it might lead you to new insights about the play and its relationship to the rest of Shakespeare’s works.

The second case (more interesting to me, but perhaps not to others) is one in which you have a large body of text that you just can’t read (every novel published between 1650 and 1923, for example). You’d like to know how these works vary by genre, author gender, place and date of composition, etc., all in service of literary-historical questions (How did the gothic novel develop in England and the U.S.? How did women’s writing evolve during the nineteenth century?). If you can machine classify them, you have at least something to go on, where otherwise you’d be limited to a much smaller set of works. If you want to argue that Austen marked an important turning point in British writing by women, wouldn’t it be potentially useful to know how many novels by women were published in the years and decades around her career?

Unfortunately, there hasn’t yet been much published along these lines, mostly because both the large corpora and the computational resources it requires have only been widely available in recent years. A (slightly different) example would be Tanya Clement’s recent work on Stein’s “Making of Americans” (a strange case itself, in that it’s a single book of such size and experimental style that it rewards treatment as a corpus):http://llc.oxfordjournals.org/cgi/content/abstract/23/3/361

None of this says anything about quality, influence, readership, etc., all of which are obviously important in such arguments. I’m not suggesting that these sorts of methods can account for everything (though I can imagine ways that they could be used to begin to address those concerns—sales figures are no great mystery, perhaps, though they might be hard to track down). But they can produce important supporting information, information that makes our critical arguments stronger than they would be otherwise. We should welcome that. And after all, there’s no critical method that does account for everything, and no substitute for making an argument with compelling reasoning and evidence, whatever its source.

Which leads to one last brief point: Absolutely none of this work is inherently reductive or positivist, or at least no more so than other types of literary criticism. It tends to produce numbers and figures, which look different from the kinds of evidence we’re used to dealing with. But they’re just as much in need of interpretation and context and argument as are passages of text for explication. I don’t know anyone in the field who thinks otherwise, but this seems to be a persistent fear of those coming to it from the outside.

LikeLike
- Anna Battigelli Says:
  July 30, 2009 at 1:39 pm | Reply
  Many thanks for this patient and lucid explication.
  John Burroughs’ site is indeed helpful. Some of it
  simply baffles me, however. I’d like to know why
  Samuel Pordage’s
  
  unusual preponderance of verbs couched in the present tense . . . [reflects] the rhetorical stance of an ostensibly philosophic observer of affairs.
  
  The reference is to The Medal Reversed, hardly a text I would describe as “philosophic” in the sense of distanced reflection. I would love to hear from Burroughs how he arrives at this link between present-tense verbs and a philosophical stance.
  
  Readers of this blog are very interested in this work, but absorbing and understanding its conceptual framework requires time. Your time trying to explicate it is very helpful.
  AB
  
  LikeLike
- Michael Witmore Says:
  September 24, 2009 at 10:32 am | Reply
  Very interesting post. As follow up, I actually *have* clustered Shakespeare’s plays and found that Othello clusters with the comedies. You can read the post and look at the results here: http://winedarksea.com/?p=228.
  
  MW
  
  LikeLike
  - Anna Battigelli Says:
    September 24, 2009 at 10:56 am
    What an interesting post! One of my questions is whether the text-tagging activity of Docuscope shows us language patterns that could be used in both tragedy and comedy. That is, the evidence Docuscope provides is not that Othello is actually a comedy but that Shakespeare appropriates and inverts comedic language patterns to create a tragic scene. You make this point or something like it in your post. If I understand this correctly, the conclusion to be drawn says more about Shakespeare’s flexible use of language patterns than about genre alone.
    
    Is this an accurate assessment of your article?
    AB
    
    LikeLike
  - Eleanor Shevlin Says:
    September 24, 2009 at 8:34 pm
    Anna’s remarks seem sound to me… Are we indeed understanding your argument. Also, while the point does seem to convey more about Shakespeare’s use of language, it also raises interesting reminders about the flexibility as well as the “constructedness” of genres. Your discussion of botanical illustraters offers an example of how generic categories are more pliable than one may think.
    
    These databases as well as the digitial tools that enable linguistical analysis may shed light not only on particular texts but also on the way literary history has often been constructed around genres.
    
    LikeLike
  - Michael Witmore Says:
    September 25, 2009 at 9:46 am
    Thank you both for your comments. Yes, I am trying to advocate a more fluid conception of genre, in part because I believe we know from experience that genres are relational (they take shape through contrasts), but also because I think we can — on the linguistic level — show certain migratory borrowings of linguistic patterns that generally belong to one genre but that be adapted to another. So while I think it is true that certain linguistic patterns occur across generic lines in the Shakespearean corpus, it is important that they show a statistically significant distribution into quadrants using PCA. If there weren’t some initial clustering — a division that has a certain historical authority because it overlaps with the genre classifications of Shakespeare’s Folio editors — then both the pattern and the outliers would be less meaningful.
    
    But the main point is the one you both highlight: that genre penetrates to the linguistic level, and that these patterns can themselves be adapted by authors to anti-generic ends. I think this is as much a point about genre itself as it is about Shakespeare’s theatrical practices.
    
    Jonathan Hope and I have laid out a more comprehensive argument about how genre works with Docuscope tagged texts in a collection on Tragicomedy edited by Raphael Lyne and Subha Mukherji. There we use the Wittgenstinian idea of “family resemblance” to talk about what a factor is (in factor analysis) and why a factor might capture something as fluid as genre.
    
    Again, thanks for your supportive comments. I’m really enjoying reading the posts on this blog.
    
    MW
    
    LikeLike
  - Eleanor Shevlin Says:
    September 25, 2009 at 10:56 am
    Thank you, Michael, for the reference ot the Lyne and Mukherji’s collection. I will certainly check it out.
    
    I am very interested in genere theory and the various anaologies used to construct models of genre. (I’ve been developing a model of genre using property and real estate analogies, but I am quite interested in the range of models out there–whether evolutionary, family resemblance, speech-act, and so forth.) I look forward to reading yours and Hope’s piece on docuscope tagged texts.
    
    ES
    
    LikeLike
  - Anna Battigelli Says:
    September 25, 2009 at 11:33 am
    Those following this thread will be interested in the article Michael Witmore co-authored with Jonathan Hope: “Shakespeare by the Numbers: On the Linguistic Texture of the Late Plays,” in Early Modern Tragicomedy, ed. Raphael Lyne and Subha Mukherji (Cambridge: D.S. Brewer, 2007). It’s well worth reading, not just for what it reveals about generic structures but also for what it suggests about Shakespeare’s creative command of those structures.
    
    LikeLike
Matthew Wilkens Says:
July 30, 2009 at 12:09 pm | Reply
Just posted a longish comment that doesn’t seem to have gone through yet. In the meantime, one thought on OCR quality. Better OCR is better than worse OCR, but it might not be as critical for machine-based methods as one would imagine. Many of those methods don’t turn on single occurrences of rare words, but instead on distributions of very common words (or groups of words, or groups of characters, etc.). Even a decent number of randomly-distributed errors won’t necessarily trouble them.

But of course OCR errors tend not to be random, which could skew the results. But also, the fact that they aren’t random means you can take some steps to correct for them. Maybe. In the ideal case. 🙂

LikeLike
Matthew Wilkens Says:
July 30, 2009 at 12:11 pm | Reply
Ah yes, that long comment is probably sitting in the spam queue because it contained several outside links. If so, it’ll need to be hand-cleared.

LikeLike
- Anna Battigelli Says:
  July 30, 2009 at 1:08 pm | Reply
  Hi Matthew:
  
  I’ve checked for the long comment but cannot find any comment in need of approval…? Hope we can retrieve it somehow. I’ll continue to check, but if it doesn’t show up, could you resend it?
  
  8 mins. later: I finally thought to look under Spam. Must be that the number of links triggered Akismet to label it incorrectly! Now it’s available.
  AB
  
  LikeLike
Matthew Wilkens Says:
July 30, 2009 at 12:57 pm | Reply
Eep, I’ll stop, I swear … but one thing that I ought to have added, especially on a blog devoted to bibliography, is that good metadata about the texts in a corpus is really valuable. A lot of machine-classification work is moot if you already reliably know things like author, national origin, date and place of publication, subject headings, etc. It’ll never be perfect or certain in every case, but it’s incredibly useful where it exists.

LikeLike
Comments as Blogging « Work Product Says:
July 30, 2009 at 4:24 pm | Reply
[…] machine-aided work that I should probably pull together here at some point. In the meantime, see this post and thread. But check out the whole blog, too—plenty of smart to go […]

LikeLike
Eleanor Shevlin Says:
August 1, 2009 at 6:02 am | Reply
Matthew, let me join Anna in thanking you for your extremely helpful explanations/elaborations. I had misunderstood what was being done with the Shakespeare texts, and your additional remarks clarify the approach for me. The mining of texts using common linguistic qualities (not pre-identified as traits of entity X or Y) to discover the relationships that emerge is akin to my interest in mapping.

And I was very pleased to see that Anna had noted Jockers’ mapping of Irish-American authors. I just returned from being away for two-days and thus off-line, but one of the last things I had reviewed before I left was Jockers’ project–and I almost posted on this work because I think it offers a suggestive model, but time was too short. In many ways Anna’s comments about the sorts of questions this data raises seem quite on target. At the same time, the data one collects and the results that emerge could well cause us to re-think the questions we typically ask or give rise to new or different kinds of projects. Moreover, by combining multiple databases and mapping as a totality, a project could provide evidence that will address such questions as the ones Anna posed about Jockers’ Irish-American study.

To that end, some might be interested in Fiona A. Black, Bertram H. MacDonald, and J. Malcolm W. Black’s “Geographic Information Systems: A New Research Method for Book History,” Book History 1 (1998): 11-31. [Book History is available on Project Muse]. As they note,

Emphasizing one type of data in contrast to the other, however, leads to an impoverished understanding of a phenomenon as complex as print culture. If quantitative data is privileged, the danger exists that facts rather than understanding will dominate. Yet quantitative data can serve as significant reference points from which qualitative data can be understood. One should not, as John Pickles has argued, “see the new informational databases as merely more efficient counting machines . . . [but as] new technologies and tools . . . [for] accounting, recording, archiving, overlaying, cross-referencing, and mapping information.” (13)

For more recent work on uses of GIS in historical research, see the special issue of Historical Geography: An Annual Journal of Research, Commentary, and Reviews, Emerging Trends in Historical GIS, ed., Anne Kelly Knowles, vol. 33 (2005).

For a literary example, see Mapping the Lakes: A Literary GIS.

In terms of Matthew’s second example of why one might want to engage in machine-classifications of work, his mention of a range of texts from 1650-1923 and accompanying questions made sense to me. Yet, the question he posed about Austen triggered me to think of the wealth of information available in Peter Garside, James Raven and Rainer Schöwerling, gen. ed., The English Novel 1770-1829: A Bibliographical Survey Of Prose Fiction Published In The British Isles 2vols. (Oxford: OUP, 2000). I can be too much of an associative thinker, and I mention this trigger not as an objection to Matthew’s example, but for two other reasons. For one, running queries using the information found here combined with actual texts could yield some unexpected relationships. And, two, it reminded me that I had meant to mention Joseph Rudman and his work on non-traditional authorship attribution studies. Available online, his piece “Non-Traditional Authorship Attribution Studies in Eighteenth-Century Literature: Stylistics Statistics and the Computer” offers a good overview of this work, particularly as it pertains to the long eighteenth century (including Burrows–who undertook studies with the late Harold Love, as many might know), but Rudman also warns,

Now let me give a major caveat – before any non-traditional study is undertaken, a rigorous and complete traditional study must be done – non-traditional methods are tools to be employed by the traditional scholars – and surely not the most important tools.

Rudman’s work has appeared in Script & Print, the journal of the Bibliographical Society of Australia and New Zealand (Burrow’s work has appeared here, too), and he has given several presentations on the Bibliography, Book History, and Textual Studies panels that I’ve organized for many years now for EC/ASECS (having taken up the tradition begun by Jim May). I mention these details to suggest that bibliography and digital humanities have closer, more long-standing ties than one might think.

LikeLike
if we read literary histories, why should we be afraid of machine-reading? « The Long Eighteenth Says:
August 4, 2009 at 4:37 pm | Reply
[…] while I was mulling over this, I realized that I’d just read an entire series of exchanges on EMOB and on this blog about the allure of “machine reading” books for literary criticism. […]

LikeLike
Anna Battigelli Says:
August 5, 2009 at 6:42 am | Reply
Dave’s related thread on the Long Eighteenth raises a question I have about trusting these searches. If the books are inadequately scanned, with missing pages, flawed OCR, and deficient selection criteria, in no sense can a keyword search be considered comprehensive. Yet bloggers repeatedly insist that somehow the size of the field searched makes up for deficient quality of the scanning.

I can see how keyword searching can lead to serendipitous discoveries and how it taps material “trapped” in books about whose existence one might not know. This is truly great. But I cannot yet see how keyword searching can be “trusted” to yield definitive or even significant results. As a hit-or-miss discovery tool, it’s fantastic; as a research tool, it leaves me puzzled. Perhaps I am missing something?

LikeLike
- Matthew Wilkens Says:
  August 5, 2009 at 9:59 am | Reply
  Hi Anna,
  
  I think you’re right that keyword searching can’t be trusted as definitive when the underlying scans and OCR are known to be imperfect. There are some things that can be done to help with this, like fuzzy searches and known common OCR errors, but they’ll never catch all (and only) errors. I guess the question is what one is trying to do with a keyword search; if it’s truly important for your purposes to catch every (and only) instance of a particular word, then you have a problem. But I think there might be surprisingly many use cases that don’t depend on that level of precision. If you’re looking for books about a person or place, for instance, they’re likely to contain multiple references to those entities, plus they’re likely to have a cluster of related terms that you could also search for (this amounts to a smart, human-mediated version of named-entity extraction).
  
  Still, you’re absolutely right that you might miss a (small? i hope?) fraction of relevant titles (or include irrelevant ones) using that method. So if, say, you want to catalog books, and it’s therefore important to you to catch every instance, then you’re out of luck. I’d just point out that in that case you’re back to the ur-issue: if you can read a book, you’ll almost certainly give better answers about it than a computer will. The interesting case for computers is the hulking mass of material you can’t read yourself.
  
  Incidentally, are your thoughts related to the recent exchange that Dave posted (via Peter Brantley) between (grossly oversimplifying here) searchers and catalogers re: Google Books? . If so, I’ll go on record as saying that it’s absurd that Google hasn’t been much, much better about using existing metadata. Why the heck would you throw away known-good (or known-pretty-good) human-generated data when you already have it? It’s there! It’s good! It’s free! It’s better than not having it! What’s not to love?
  
  LikeLike
Matthew Wilkens Says:
August 5, 2009 at 10:00 am | Reply
Gah, previous post deleted the deleted the relevant link to Dave’s post:

A Debate over the Usefulness of Google Book Search

LikeLike
Eleanor Shevlin Says:
August 5, 2009 at 11:13 am | Reply
As Anna remarks, keyword searching on Google can’t be trusted as definitive. But as Matthew notes, the importance of this definitiveness depends on the reasons one is searching. I use Google Book search almost daily–but I would never use it for searches intended to be definitive. Yet I am using it for a variety of reasons: to find a passage/reference in books I own–and I then consult the print; to gain a better sense of the relevance of a work to a particular project–and if it seems to be, I then obtain the book; to search in tandem with many other searches (for example those on A2A–Britain’s National Archives database); to discover connections that I would not have found otherwise. I offer examples in my July 15, 2009 at 5:04 pm post on The Long Eighteenth Century.

Google Book seems to be quite fruitful in the area of 19th-century titles–but it also has eighteenth-century works (its random assortment of Harrison & Co. reprints of 18th-century novels (1780-1788) often offers far better images than those found on ECCO–because they are made from the actual work and not a microfilm copy of it.

As for the metadata issue, I agree with Matthew that it seems in explicable why Google did not incoporate it from the start–but as the Peter Brantley posted exchange notes, there is a serious interest in improving these tools as Don Waters in this exchange notes:

There is much serious work to be done in organizing, ordering, relating, annotating, tagging and otherwise managing the huge corpus of printed and other materials that is migrating to the digital environment in part with Google’s leadership and massive investment. There is also a growing interest in scholarly and academic circles in this work, and an emerging willingness to invest in the necessary activities given the digitized corpus so that it serves research interests and helps generate new knowledge.

LikeLike
Dave Mazella Says:
August 5, 2009 at 12:12 pm | Reply
I think that Eleanor’s notion of these technologies as potentially providing “finding aids” has been helpful for me to understand better their proper use. In other words, they don’t provide a self-sufficient “object” of study, but they should lead you to those objects. They are not the “library,” they are the “card catalog,” if you will. Now, having said that, the power of the keyword search is obviously no substitute for the structures and classifications of an organized catalog. The problem is who decides which “objects” to categorize, and how we might preserve but also refine these categories of objects? I think with repositories the scale of Google’s everyone underestimated the value of the existing metadata on the libraries they shoved in there willy-nilly.

But one of the reasons why I posted my follow-up to this on the Long 18th is that I share a lot of Anna’s nervousness about the value of such digital finding aids, without ever realizing that for one kind of reader (e.g., me trying to learn more about potential authors) even printed literary histories like the inestimable J.M.W. Tompkins constitute just another finding aid.

DM

LikeLike
Eleanor Shevlin Says:
August 5, 2009 at 12:53 pm | Reply
Dave articulates the way I have tended to view and use Google Book quite well. I should also note that my view of Google Book is based on its present form–but with changes (better metadata, bibliographic info, etc) to GB, I could see it serving other functions.

Nor is GB any substitute for traditional finding aids, catalogs, etc. But my searches have consistently yielded sources that I would almost certainly never have found through traditional procedures– case in point, the listing for the James Harrison (whose identity I’ve been untangling from the James Harrison who serves as my interest) in The Admission registers of St. Paul’s school, from 1748 to 1876.

LikeLike
Anna Battigelli Says:
August 5, 2009 at 1:14 pm | Reply
The recent comments by Matthew, Eleanor, and Dave regarding Google Books interest me and help flesh out the power of GB. The Harrison example makes clear the value of a search on GB. I just wish GB were an even more powerful finding aid–with better OCR, no missed pages, fuller selection, more complete metadata, and more accurate bibliographical material. These aren’t small problems, even if the existing GB search yields frequent serendipity.

To return to MONK, these questions make clear the blend of comprehensiveness and accuracy that scholars still hope to see in a scholarly tool like MONK.

LikeLike
Dave Mazella Says:
August 6, 2009 at 7:45 am | Reply
I’d also like to stress how helpful Matthew’s comments have been, because what I’m gathering from his account of “machine reading” is that it represents more of a “browsing” (indirect, serendipitous, clustering) purpose than a “searching” one (direct, precise, isolating) for the objects it studies, and it certainly cannot interpret the texts for its users. Through the clustering, it has the potential to generate insights that would otherwise be unavailable to us. It’s an “insight generator” at one remove from the material, in the same way that a set of results from a Google or a JSTOR search would present us with potential insights into our query. And then you have to go and read everything.

Frankly, the trained or supervised clustering techniques described by Matthew seem more promising for criticism, because I just think that the extension of existing understandings is a more intuitive process than trying to sort out a machine-generated set of results whose commonalty we need to guess at. But I appreciate the fact that numbers, frequencies, distributions, etc. are just as important for this kind of history, and require the same kind of standards of evidence, argumentation, etc. as our conventional procedures of close reading (which look pretty iffy to my colleagues in the sciences and even the social sciences). DM

LikeLike
Anna Battigelli Says:
August 6, 2009 at 9:25 am | Reply
Dave’s distinction between “browsing,” which is indirect, serendipitous, and clustering and “searching,” which is direct, precise, and isolating is very helpful. Matthew, would you be willing to write a brief post on “machine reading”? It could be very basic and short with the goal of helping readers begin to see the kind of serendipitously deep skimming these digital tools allow.

Christine Ruolo touches on this kind of reading in her MLA talk for Cynthia Wall’s session, “Got ECCO?”

LikeLike
- Matthew Wilkens Says:
  August 6, 2009 at 2:13 pm | Reply
  Thanks to all for the kind words — I’m glad that my comments have been helpful. I’ll be glad to put together a post on machine reading and computationally assisted literary work. Will include some links there, too, to other projects in or about the field.
  
  Should be up later today if all goes well (famous last words, of course).
  
  LikeLike
Eleanor Shevlin Says:
August 6, 2009 at 11:09 am | Reply
Yes, let me join Anna and Dave in thanking Matthew too–and echo Dave’s request.

I also want to share this recent article that seems relevant to our discussion:

Robin Valenza, “How Literature Becomes Knowledge: A Case Study,” ELH 76.1 (spring 2009): 215-245. (Available through Project Muse)

Abstract:

The first section of the essay inquires, “Is literature a special kind of knowledge?”; the second, “Is literary criticism a special kind of knowledge?” Through an analysis of Samuel Richardson’s Clarissa and Samuel Johnson’s Dictionary, the essay shows how a literary work’s status as an object of knowledge can be determined by its use. The essay proposes that the eighteenth-century notion of “index-learning”–reading a text by way of its index–and its more recent incarnation–”search engine learning”–combined with techniques of close reading can yield a new kind of literary-critical knowledge that might be called “slow reading.”

Valenza notes some affinity between her concept of “slow reading” and Moretti’s “distant reading”, and also discusses electronic indexing in databases such as ECCO.

LikeLike
Anna Battigelli Says:
August 7, 2009 at 7:17 am | Reply
Thanks, Eleanor, for suggesting Valenza’s very interesting article. She discusses the tensions between index-reading and New Critical close readings in fresh ways. Johnson’s suggestion that Richardson add an index to Clarissa suggests his interest in quick access to the novel’s moral sentiments. That Richardson complied with that request, providing not just a table of moral sentiments but an index to that table, reflects an awareness of one kind of knowledge the novel contains.

This example will be helpful in considering the merits of machine reading.

LikeLike
Reading with Machines « Early Modern Online Bibliography Says:
August 11, 2009 at 6:11 pm | Reply
[…] By Matthew Wilkens A week or two ago, we had an interesting discussion here on EMOB about the uses and limits of computers as aids to literary scholarship. Below are some of my own summary thoughts on the subject, along with links to related work that […]

LikeLike
Technology and the “Republic of Letters” « Early Modern Online Bibliography Says:
December 28, 2009 at 1:54 pm | Reply
[…] Jonathan Rose, who drew my attention to this project through his posting of it on SHARP-L, noted the potential of GIS technology for book-history projects. Canadian book Historians Bertrum MacDonald and Fiona Black have already begun to realize this potential. Their article “Geographic Information Systems: A New Research Method for Book History” (Book History 1 (1998): 11-31) can be found through Project Muse, and they have also “proposed a long-term, international, collaborative project using GIS for comparative analyses of defined elements of print culture in several countries. An Advisory Board is being established, which currently includes scholars in the United States and the United Kingdom. The project has three primary goals: to explore the methodology through a variety of applications concerning various aspects of book history; to aid comparative studies; and to provide the foundation for an electronic atlas of book history” (GIS for Book History International Collaborative Project, description frm Fiona Black’s website). Such technology of course has rich potential for other projects, and we have had various mentions of such projects in past emob posts including comments on the Monk Project. […]

LikeLike