Archive for the ‘MONK’ Category

Technology and the “Republic of Letters”

December 28, 2009

The “sell” for a recent article on Mapping the Republic of Letters, a Stanford University digital humanities project led by Dan Edelstein and Paula Findlen, highlights the ways in which technology is altering our understanding of the past and shaping the kinds of questions we can ask:

Researchers map thousands of letters exchanged in the 18th century’s “Republic of Letters” – and learn at a glance what it once took a lifetime of study to comprehend

In this case researchers have applied GIS (geographical information system) mapping technology to explore the wealth of letters exchanged by Enlightenment figures. As the article details, the computer mapping of correspondence from the Enlightenment (the dates focus on 1759 to 1780, but the project also contains letters from the Renaissance) has enabled the relationship among vast amounts of material to be organized and presented in flexible ways. This YouTube video, Tracking 18th-century “social network” through letters, shows snapshots of the trajectories of Locke’s and Voltaire’s correspondence:

The “big pictures” that this project facilitates are altering perceptions of Enlightenment networks and their influences. As the video demonstrates, despite French views of England as an incredible site of religious freedom and tolerance, Voltaire actually corresponded very little with those in England.

What is especially interesting (but not surprising) is the importance of metadata and collaboration to this project’s success. That Oxford “supplied the metadata for 50,000 letters,” Dan Edelstein explains,
“allow[ed] the project to go “beyond any of our expectations.” Mapping the Republic of Letters has also acquired the data for all of Benjamin Franklin’s correspondence, and talks are underway to obtain data from other European sources.

Projects such as TCP and 18thConnect, which are establishing rich, reliable metadata for digital texts, are expanding the possibilities for scholarly exploration of past textual worlds, both for individual and collaboratively-driven scholarship.

Jonathan Rose, whose post on SHARP-L drew my attention to this project, noted the potential of GIS technology for literary and intellectual history. Canadian book historians Bertrum MacDonald and Fiona Black have already begun to realize this potential for book historians. Their article “Geographic Information Systems: A New Research Method for Book History” (Book History 1 (1998): 11-31) can be found through Project Muse, and they have also

proposed a long-term, international, collaborative project using GIS for comparative analyses of defined elements of print culture in several countries. An Advisory Board is being established, which currently includes scholars in the United States and the United Kingdom. The project has three primary goals: to explore the methodology through a variety of applications concerning various aspects of book history; to aid comparative studies; and to provide the foundation for an electronic atlas of book history (GIS for Book History International Collaborative Project, description from Fiona Black’s website).

Such technology of course has rich potential for other projects, and we have had various mentions of such projects in past emob posts including comments on the Monk Project.

For more recent work on uses of GIS in historical research, see the special issue of Historical Geography: An Annual Journal of Research, Commentary, and Reviews, Emerging Trends in Historical GIS, ed., Anne Kelly Knowles, vol. 33 (2005).

Reading with Machines

August 11, 2009

A week or two ago, we had an interesting discussion here on EMOB about the uses and limits of computers as aids to literary scholarship. Below are some of my own summary thoughts on the subject, along with links to related work that I’ve found useful or thought-provoking.

The first point to mention is that the things computers are good at are very different from the things humans are good at. The worthwhile work in digital humanities (“DH” for short, a synonym for computationally assisted humanities research) keeps this fact in mind. Computers are useful for doing quickly certain basic (that is, boring) tasks that humans do slowly. They’re really good at counting, for instance. But sometimes, happily, these kinds of quantitative improvements in speed produce qualitative changes in the kinds of questions we can pose about the objects that interest us. So we literary scholars don’t want to ask computers to do our close reading for us. We want them to help us work differently by expanding what we can read (or at least interpret) and how we can read it. And we want to keep in mind that reading itself is just one (extraordinarily useful) analytical technique when it comes to understanding literary or social-aesthetic objects.

There are two main classes of literary problems that might immediately benefit from computational help. In the first, you’re looking for fresh insights into texts you already know (presumably because you’ve read them closely). In the second, you’d like to be able to say something about a large collection of texts you haven’t read (and probably can’t read, even in principle, because there are too many of them; think of the set of all novels written in English). In both cases, it would almost certainly be useful to classify or group the texts together according to various criteria, a process that is in fact at the heart of much computationally assisted literary work.

In the first case, what you’re looking for are new ways to connect or distinguish known texts. Cluster analysis is one way to do this. You take a group of texts (Shakespeare’s plays, for instance), feed them through an algorithm that assesses their similarity or difference according to a set of known features or metrics (sentence length, character or lemma n-gram frequency, part of speech frequency, keyword frequency, etc.—the specific metrics need to be worked out by a combination of so-called “domain knowledge” and trial and error), and produce a set of clusters that rank the relative similarity of each work to the others. Typical output looks something like this figure from Matthew Jockers’ blog (click the image to see it full size in its original context):

Three dramatic genres in Shakespeare

Read this diagram from the top down; the lower the branch point between two items or groups, the more closely related they are.

This may or may not be interesting. Note in particular that the cluster labels are supplied by the user, outside the computational process. In other words, the algorithm doesn’t know what the clusters mean, nor what the clustered works have in common. Still, why does Othello cluster with the comedies rather than the tragedies (or the histories, to which the tragedies are more closely related than the comedies)? The clustering process doesn’t answer that question, but I might never have thought to ask it if I hadn’t seen these results. Maybe I won’t have anything insightful to say in answer to it, but then that’s true of any other question I might ask, and at least now I have a new potential line of inquiry (which is perhaps no mean thing when it comes to Shakespeare).

(As an aside, the extent to which I’m likely to explain the categorization of Othello as a simple error instead of as something that requires further thought and attention will depend on how well I think the clustering process works overall, which in turn will depend to at least some extent on how well it reproduces my existing expectations about generic groupings in Shakespeare. The most interesting case, probably, is the one in which almost all of my expectations are met and confirmed—thereby giving me faith in the accuracy of the overall clustering—but a small number of unexpected results remain, particularly if the anomalous results square in some way with my previously undeveloped intuitions.)

Even more compelling to me, however, is the application of these and related techniques to bodies of text that would otherwise go simply unread and unanalyzed. If you’re working on any kind of large-scale literary-historical problems, you come up very quickly against the limits of your own reading capacity; you just can’t read most of the books written in any given period, much less over the course of centuries. And the problem only gets worse as you move forward in time, both because there’s more history to master and because authors keep churning out new material at ever-increasing rates. But if you can’t read it all, and if (as I said above) you can’t expect a computer to read it for you, what can you possibly do with all this stuff that currently, for your research purposes, may as well not exist?

Well, you can try to extract data of some kind from it, then group and sort and classify it. This might do a few different things for you:

  1. It might allow you to test, support, or refine your large-scale claims about developments in literary and social history. If you think that allegory has changed in important and specific ways over the last three centuries, you might be able to test that hypothesis across a large portion of the period’s literary output. You’d do that by training an algorithm on a smallish set of known allegorical and non-allegorical works, then setting it loose on a large collection of novels. (This process is known as supervised classification or supervised learning, in contrast to the un- or semi-supervised clustering described briefly above. For more details, see the Jain article linked at the end of this post.). The algorithm will classify each work in the large collection according to its degree of “allegoricalness” based on the generally low-level differences gleaned from the training set. At that point, it’s up to you, the researcher, to make sense of the results. Are the fluctuations in allegorical occurrence important? How does the genre vary by date, national origin, gender, etc.? Why does it do so? In any case, what’s most exciting to me is the fact that you’re now in position to say something about these works, even if you won’t have particular insight into any one of them. Collectively, at least, you’ve retrieved them from irrelevance and opened up a new avenue for research.
  2. The same process might also draw your attention to a particular work or set of works that you’d otherwise not have known about or thought to study. If books by a single author or those written during a few years in the early nineteenth century score off the charts in allegoricalness, it might be worth your while to read them closely and to make them the objects of more conventional literary scholarship. Again, the idea is that this is something you’d have missed completely in the absence of computational methods.
  3. Finally, you might end up doing something like the Shakespearean clustering case above; maybe a book you do know and have always considered non-allegorical is ranked highly allegorical by the computer. Now, you’re probably right and the computer’s probably wrong about that specific book, but it might be interesting to try to figure out what it is about the book that produces the error, and to consider whether or not that fact is relevant to your interpretation of the text.

One note of particular interest to those who care deeply about bibliography. In an earlier post about Google Book Search (a service tellingly renamed from the original Google Print), there was some debate about whether GBS is a catalog or a finding aid, and whether or not full-text search takes the place of human-supplied metadata. I think it’s obvious that both search and metadata are immensely useful and that neither can replace the other. One thing that text mining and classification might help with, though, is supplying metadata where none currently exists. Computationally derived subject headings almost certainly wouldn’t be as good as human-supplied ones, but they might be better than nothing if you have a mess of older records or very lightly curated holdings (as is true of much of the Internet Archive and GBS alike, for instance).

Finally, some links to useful and/or interesting material:

Monk Project

July 24, 2009

Among the text databases included in the Monk (metadata offer new knowledge) Project are ECCO and EEBO (both of which are part of the Text Creation Partnership (TCP). While not addressing bibliographic errors, this initiative does hold relevancy for our discussions on improving these tools. In particular, this project’s efforts are apparently aimed at providing scholars with the means to work more effectively and simulatenously with texts created and housed in different databases.

A recent PowerPoint presentation about the Monk Project, Tools for Textual Data (May 20, 2009), by John Unsworth sketches such issues as treating text as data, the Monk Project’s efforts to facilitate means to “mix and match” texts that reside in different databases, the development of features that will enable searches that users may wish to conduct (for example, what adjectives does a given author favor the most?), and the acceptable level of curatorial/user intervention. The tools being developed to allow both the posing of questions that users may wish to ask and the mining of the data to yield responses to these queries seem highly promising.

Under “Questions for Discussion” (slide 22), I was interested in the two-part query, “Should users be allowed to change, correct, or improve data? If so, under what constraints or conditions?”. Thes question set seems directly pertinent to our discussion of how to improve bibliographic issues in these databases, but it rightly also asks about what sorts of constraints should (or need) to be in place–the answer to which would speaks to issues of quality control. Another question, “Should those who provide collections also collect the results of work done on their collections? Why or why not?,” was surprising to me. While I could see how gathering information about the ways that the collections were being used and the results obtained could help developers improve these databases’ functionality and accuracy, the collection of this information–especially by the owners of databases that are commercial enterprises–seemed far more worrisome to me.