Archive for October, 2012

T-PEN: A New Tool for Transcription of Digitized Manuscripts

October 22, 2012

One of the exciting turn of events for scholars has been the growing number of unpublished, hand-written documents now available on the world wide web. Textual scholars no longer have to travel to distant countries for view the essential manuscript(s) for their research. Instead, they can now sit themselves down in front of their laptop and display each successive page. This has moved many sources that were once difficult to access into the “completely accessible” category.

But does that make them usable?  Despite the desire to make many manuscript collection freely accessible, many digital repositories use “tiled-based” viewers in order to protect unauthorized copying of the collection. This is completely understandable, but those viewers sometimes place limits on how a digital surrogate can be viewed. They can even make it difficult for scholars to extract what they often want most: a transcription of the manuscript’s content. Moreover, the current practice of transcribing from digitized pages can easily permit mistakes to occur. Transcribers currently move from the image to a word processing application in another display window (either on the same screen or on a different monitor). That process can easily mimic the same mistakes that the original scribe could make: haplography (omission of content between similar or identical words; “saut du même au meme”), dittography (repetition of letters or syllables), duplication or omission (of letters, words, or lines), often caused by homoearcton and homoeoteleuton (similar beginnings and endings of words), and transpositions. Could it then be possible to make these digital manuscripts both accessible and highly usable?

T-PEN (Transcription for Paleographical and Editorial Notation) seeks to address both the accessibility and usability of digital repositories. Developed by the Center for Digital Theology of Saint Louis University, in collaboration with the Carolingian Canon Law Project of the University of Kentucky, this new digital tool is a sophisticated web-based application that assists scholars in transcribing these manuscripts. To reduce the likelihood of transcription errors, we took advantage of digital technology to place both the transcription and the exemplar in a manner that minimized the visual movement between the two as much as possible. We accomplished this with a simple but novel visualization of the lines of script in the exemplar, which we integrated with interactive transcription spaces. To build the tool, we developed an algorithm for “parsing” the lines of script in an image, and a data model that connected the image delivery of manuscript repositories with the actions of transcribers.

But we wanted T-PEN to offer more than just a means to ensure good transcription. We had, in fact,  three goals in mind:

  1. To build a tool useful for any kind of scholar, from the digital Luddite to those obsessed with text encoding;
  2. To provide as many tools as possible to enhance the transcription process;
  3. To help scholars make their transcriptions interoperable so that those transcriptions would never be locked into the world of T-PEN alone.

After two years of design, development, and intensive testing this tool is now available to the wider public. It was built in the first instance for those working with pre-modern manuscripts, but there is nothing in its design that would prevent early modern scholars from exploiting T-PEN for their purposes. T-PEN is a complex application and to explain every function would take several posts. Instead, I want to provide a brief overview of how someone can set up a transcription project, how they can use T-PEN to produce high-quality work and finally how to get transcriptions out of T-PEN and into other applications or contexts.

Choosing your Manuscript

T-PEN is meant to act as a nexus between digital repositories and the scholar. To date, we have negotiated access to over 3,000 European manuscripts and we are working on further agreements to expand that list. Our aim is to have a minimum of 10,000 pre-modern European manuscripts available for transcription. Even with that number, we will never be able to satisfy all potential users. We therefore enabled private uploads to extend T-PEN’s usability. Many scholars have obtained digital images of a manuscript and they have permission to make use of them for research purposes. Private uploads to T-PEN are an extension of that “fair use.”  Users zip the JPG images into a single file and then upload them to T-PEN. These type of projects can only add five additional collaborators (see project management, below), and they can never become public projects. Currently T-PEN can support around 300 private projects, and we are expanding our storage capacity for more.

T-PEN's Catalog of Available Manuscripts

Transcribing your Manuscript

Once you select your manuscript you can immediately begin your transcription work. T-PEN does not store any permanent copies of the page images, so each time you request to see a page T-PEN loads the image from the originating repository. If you have never transcribed the page before, T-PEN takes you to the line parsing interface. This adds a little time to the image loading as T-PEN parses the image in real time. When it finishes, you will see a page that looks like this:

T-PEN's Line Parsing Interface

T-PEN attempts to identify the location of each line on the page and then uses alternating colors to display those coordinates. As you can see, we make no claim of absolute perfection. We worked on this algorithm for  almost two and half years and after extensive testing, we’ve been able to promise, on average, an 85% success rate. There are a number of factors that prohibit complete accuracy and so we offer a way for the transcriber to introduce corrections herself. You can add, delete or re-size columns; and insert or merge lines as well. You can even adjust the width of individual lines if they vary in length. You can even combine a number of lines if you want to have them grouped together for your  transcription. Sometimes, manuscripts don’t merge well in our modern, rectilinear world: many handwritten texts were written at an angle or were so tightly bound that the page could not be photographed as flat. T-PEN ultimately doesn’t care: what really matters for connecting transcription to a set of coordinates on a digital image. What really matters is that the left side of the line box aligns with the written text. That’s the anchor.

When you are satisfied with the line parsing, you can start transcribing. The transcription interface looks like this:

T-PEN Transcription User Interface

This interface allows you to transcribe line by line, with the current line surrounded by a red box. There are some basic features to note. First, as you transcribe the previous line is noted above because so often sentence units are split across lines. Transcription input is stored in Unicode and T-PEN will take whatever language set the user has enabled his computer to type. If there are special characters in the manuscript, the transcriber can insert them either by clicking on the special character button (the first ten are hot-keyed to CTRL+1 through 0).

Second, users can encode their transcription as they go. On this aspect, T-PEN is both innovative and provocative. Many scholarly projects that include text encoding often adopt a three-step process: the scholar transcribes the text and then hands it to support staff to complete the encoding, which is finally vetted by the scholar. However, there are many times in which semantic encoding of transcriptions has to include how the text is presented on the page. T-PEN innovatively allows scholars to integrate transcription (with the manuscript wholly in view) and encoding into one step. Often the best encoder is the transcriber herself. That innovation comes with a provocative concept, however. In digital humanities where TEI is the reigning orthodoxy, T-PEN is at least heterodox if not openly heretical. T-PEN’s data model does not expect,  nor require, a transcription to be encoded much less utilize TEI as the basis of structured text. Instead, T-PEN treats all XML elements as simply part of the character stream. T-PEN can support transcribers who don’t want to encode at all as well as those who are wholly committed to the world of TEI. For those who want to encode, a schema can be linked to a project to produce a set of XML buttons that can be used in the transcription interface.

Project Management

For those who simply want to start transcribing, project management will not be that important. For those who envisage a more sustained project (and perhaps a collaborative one at that), it will be vital. There are a number of components in managing a T-PEN project, but here I want to highlight two of them.

Collaboration. Like most digital tools, T-PEN allows you to invite collaborators to join your project. All members of a project have to be registered on T-PEN (but that’s free and requires only providing your full name and an email address). Managing collaboration has three features, of which only a few projects will use all three. There is first adding and deleting project members. Any member of a project can see who is also a member, but only the project leader can add or delete members. A project leader can even have T-PEN send an invitation to a non-T-PEN person and invite them to join (and once they do, they automatically become part of that project).

Collaboration in Project Management

Second, there is a project log to inspect. This log records any activity that changes the content or parameters of the project. This can be particularly helpful when tracking down how a transcription has changed in a shared project (and a user can display the history of each line in the Trasnscription UI). Finally, projects can make use of T-PEN’s switchboard feature. This is for transcription projects that may be part of a larger project, and where the transcriptions will be aggregated in another digital environment. Switchboard does two things for a project: (1) it allows different projects to share the same XML schema so that all transcriptions will conform to the larger project’s standards; and (2) it will expose the transcription through a web service to permit easy export to the larger project.

Project Options. The two more important options are button management and setting the transcription tools. As seen in the screen shot of the transcription interface, users can use buttons to insert both XML elements and special characters. Those buttons are created and modified as part of the project options. If there is an XML schema for the project, a project leader can link it to the project. Then in button management, the elements in that schema populate the XML button list. The button populator does not discern between metadata elements and elements found in the body of an encoding schema. Users then have to modify the button list to cull the elements that won’t be used during transcription. There’s an additional advantage to editing that list: each button can gain a more readable title. This can be helpful if the encoding schema exploits the varying use of the <seg>  or the <div> elements in TEI. When the possible deployment of the tag might be unclear to those with less experience with TEI, a more straightforward title can become a better guide to its use.

Special characters allow the user to identify characters in the UTF-8 system which may not be represented on a standard keyboard. These can be created by entering the correct Unicode value for the character. The first 10 characters are mapped to hotkeys CTRL+1 through 0.

Finally, the set of tools that are available on the transcription interface are set in project options. T-PEN has thirteen tools built-in and most of them were included to assist transcribers of pre-modern manuscripts. Some will be helpful to editors of modern texts. If those tools are unhelpful, then the user can expand that list of tools: all that is needed a name of the tool and its URL. Once attached to the project, the user will be able to access that tool in the transcription interface.

Getting your Transcription out of T-PEN

Digital tools often fall into one of two categories. “Thinking” tools are ones that allow users to manipulate and process datasets in order to test a certain idea or to visualize an abstract concept. They can also allow the user to annotate a resource as a way of processing the scholar’s conception of the object’s meaning or the hermeneutical framework it may require. These tools are invaluable, but they do not easily produce results that can be integrated into a print or digital publication. The second type is what I call the production tool. With these applications, the final objective is to produce something that can be integrated in other contexts. T-PEN falls firmly into this second category—although it has its own annotation tool with which a user can record observations about each manuscript page (and it is compliant with the W3C standard, the Open Annotation Collaboration). Scholars transcribe normally one of three reasons; to create a scholarly edition; to place those transcriptions in footnotes or in the appendices of a monograph; or to integrate an encoded text into a larger resource.

T-PEN supports four basic export formats: XML/plaintext, where the user can filter out one or more XML tags; PDF; RTF which is compatible with most word processors; and finally, basic HTML. For the first one, if the user has attached a header to the project, that header can be included in the export. There is an important caveat here:  T-PEN was not designed to be an XML editor. We do offer a basic, well-formedness check (which stops at the first error), but T-PEN does not offer full validation services. Most scholars who encode with T-PEN export their transcriptions to an XML editor for full validation of the file. The last three export formats include some simple transformation for text decoration (italics, bold, etc.). Users can also identify the whole transcription or specify a range based on the pagination (or foliation) of the manuscript.

T-PEN's Export Options

This post only covers the basics of T-PEN. There are more features available to the user. There is a demonstration video on YouTube  where you can walk with one of T-PEN’s research fellows as she begins a transcription project.  T-PEN is freely available, thanks to a major investment from the Andrew W. Mellon Foundation and a Level 2 Start-up grant from the National Endowment for the Humanities. So go to t-pen.org and register for an account.

From Boston to Peru: Reading Books at the Boston Athenaeum and the Peru Free Library

October 9, 2012

How are we to bring order into this multitudinous chaos and so get the deepest and widest pleasure from what we read? 

V. Woolf, “How to Read a Book”

Photo Credit: Megan Manton/Boston Athenaeum

“To enter the building is to feel an overwhelming impulse to read.”  So wrote Sarah Schweitzer about the Boston Athenaeum in a 2009 Boston.Com article.  Indeed, pushing back the building’s red, leather-bound doors, one plunges into the world of reading like a sea-creature slipping into the ocean’s depths.

How is it that a building can transform us from scatter-brained urban land creatures subject to Boston’s many disparate calls into more focused beings equipped to swim through the world of learning?  It may be that the library’s high ceilings and twelve floors expand our sense of possibility, inviting the mind to unbend.  Certainly, the Athenaeum’s quiet aura of uninterrupted work offers a refuge from the jostling noise of the city’s streets.  Fellow readers lost in concentration call us to our task.  Art, sculpture, newspapers, journals, 750,000 books, maps–all await, encouraging inquiry.  The interior’s opulence telegraphs the value of spending time with books, transporting us to a lost age when leisure allowed one to linger over fictions and treatises, sermons and histories, maps and art, with nothing more pressing awaiting than afternoon tea.

But the Athenaeum’s true luxury is something even more precious and more rare than comfort and splendor alone: it offers the order necessary for sustained reading.

We see this order in the carefully designed reading spaces enticing one to that concentrated state of mind so beneficial for reading.  Solid walnut tables provide space for research materials.  Desks tucked between bookshelves beckon. Upholstered chairs placed next to side tables allow readers to sit next to stacks of books and begin the task of browsing.  The reference room displays recent journals side-by-side on long tables (shown below) carefully ordering the chaotic possibilities before us.

Photo Credit: Megan Manton/Boston Athenaeum

In short, the library has been designed for readers by readers to encourage us to leave the tyranny of the present by plunging into the otherworldly and timeless worlds contained in books.  Seated at the Athenaeum, we can take down volumes and, in Woolf’s words, “make them light up the many windows of the past; we can watch the famous dead in their familiar habits and fancy sometimes that we are very close and can surprise their secrets, and sometimes we may pull out a play or a poem that they have written and see whether it reads differently in the presence of the author.”

Photo credit: Megan Manton/Boston Athenaeum

The Boston Athenaeum is a subscription library.  To borrow books and use the upper floors requires a membership fee beyond the reach of many.  But the first floor is open to the public six days a week, and the Athenaeum’s programs, including concerts, are open to the public free of charge.  Its value as a public space is at least threefold: it is a research and membership library; an art museum and public gallery; and a public forum for lectures, readings, concerts, and other events.

Perhaps most of all, the Boston Athenaeum is a valuable icon reminding us of the civic value placed by a community on reading.

Less palatial, but no less essential, are the public spaces created by our public libraries.  Situated by the apple orchards of upstate New York is the Peru Free Public Library (shown below), a lovely 1927 structure that blends the old and the new.  It maintains its early twentieth-century elegance, even as it runs on solar energy.

Photo credit: Theresa Sanderson

Smaller in scale than the Boston Athenaeum (it holds about 14,000 items), it, too, beckons readers with its carefully arranged reading spaces.  A fireside (below) often warms  readers working at the reference room’s long tables during the shortening fall days and throughout the winter.

Photo credit: Theresa Sanderson

Carefully arranged reading spaces offer an opportunity to clear one’s head:

Photo credit: Theresa Sanderson

A children’s reading room is designed to invite young minds to the world of books:

Photo credit: Theresa Sanderson

The Peru Free Library’s many activities bind the community through art shows, pottery shows, book sales, children’s activities, public lectures, and other events.  Like the Boston Athenaeum, the Peru Free Library is carefully and creatively managed.

Public reading spaces like the Boston Athenaeum and the Peru Free Library contribute immeasurably to their communities and to their readers, allowing them to expand their sense of who they are.   By orchestrating spaces designed to slow us down long enough to stop skimming and sink into deep reading, they encourage a more studied approach to thought than is possible away from books.  If we feel as Woolf did, that heaven is “one continuous unexhausted reading,” the Boston Athenaeum and the public libraries that share its commitment to encouraging reading make it a little easier to experience heaven on earth.

Early English Books Online Text Creation Partnership: User Survey

October 8, 2012

Posted on behalf of the EEBO-TCP project

Please help the Early English Books Online Text Creation Partnership
plan for the future by filling in our user survey, and be entered into
a prize draw to win one of ten £50 Amazon vouchers!

http://bit.ly/EEBO-TCPSurvey

The survey is part of a JISC-funded project SECT:Sustaining the
EEBO-TCP Corpus in Transition, which is investigating the impact and
sustainability of the EEBO-TCP collection. For more details on the
project, go to http://www.bodleian.ox.ac.uk/eebotcp/SECT

New Digital Projects II: Vernacular Aristotelianism and Digitized Archives at the Wellcome Library

October 4, 2012

The following guest post, the second of two parts, is by Andie Silva, Wayne State University

In a previous post, I discussed the Vernacular Aristotelianism database featured during the first week of a two-week workshop at the University of Warwick this past summer.  During that workshop, Chris Hilton, senior archivist at the Wellcome Library, presented that library’s massive restructuring of its archives and plans to digitize their entire collection.

As if that were not already an impressive undertaking, the Wellcome promises that all their material will be available, not only for open access (with a library membership card—free with in-person visit) but also for sharing: users will be free to copy, link, and even embed any digital materials from the Wellcome for any non-commercial purposes.

However, as Hilton demonstrated, digitization is not enough: without the proper coding and re-cataloguing of the material, most users won’t know where to look, or what to look for. What he calls the “white box” syndrome is a constant challenge to digital archivists: how does one translate the intricate, detailed knowledge of the archivist into a blank search box? In a way, that is largely impossible; having heard many tales of “found treasures” from scholars who took the time to get to know a librarian and talk to them about their research, I am not one to underestimate the value of physically visiting an archive.

Of course, not everyone is able to do so, and that is where the digital archive comes in. While digitization cannot fully supplant true archival research, it allows instead for new kinds of research. Take, for instance, the already fully digitized Wellcome Arabic Manuscripts. Thanks to a very generous grant by the JISC’s Islamic Studies Program (and, no doubt, some very hard-working grad students), this digital archive is an online researcher’s dream. Each manuscript has been photographed in full, including the covers, binding, and original coloring and detailing. Because the whole book, and not just a close-up of the pages, has been photographed, the researcher is better able to grasp the sizing, page setting, and general condition of the manuscript. From this broader view, the reader can then zoom in to a specific page and actually read its contents. Granted, this process is a little slow—however, given the quality and viewing options, I can’t see that as a major flaw.

Another fantastic improvement is that the thumbnails of each page appear in a separate frame, allowing the viewer to browse the entire work while inspecting specific pages. This kind of “horizontal browsing” (although in this case the frame is vertical) is something Hilton hopes will be applied to the rest of the Wellcome digital materials. According to Hilton, this extra frame will also contain information about related materials, cross-searching, and external links. I imagine that due to monetary and time constraints the rest of the materials will not be as detailed as the Arabic Manuscripts. Nonetheless, this collection demonstrates the incredible amount of information and details that are possible for those implementing digitizing projects. Thanks to those who catalogued and annotated the Arabic collection, researchers have the option to investigate even material details like binding and physical conditions of a manuscript and never have to pay more than the price of an internet connection.

While digital projects may not (and perhaps should not) replace material archives, they offer new possibilities for research. Scholars interested in statistics, for example, are now better able to quantify and analyze data at the speed of a search engine. One of the workshop participants, for instance, questioned the use of “Publics” in the title “Reading Publics,” arguing that it was not a word contemporary to Renaissance audiences and therefore inaccurate to describe their acts of reading, purchasing, and engaging with books. His claim was backed by a database search for the use of the word “public” in sixteenth and seventeenth-century texts, which revealed only a few works using the word anywhere in the text. He quickly realized, however, that his initial search had failed to elicit books that included variant spellings or synonyms. What’s more, his research was limited to English texts, and (more importantly) texts that had already been transcribed by the Text Creation Partnership for EEBO (another exciting project that is not yet entirely available to the public).

This example makes clear some of the limiting aspects of digital research: we are always, sometimes unawares, conditioned by the parameters of the search box—and, more specifically, by whoever coded the keywords into the database. This example also highlights a new kind of conversation that is made possible by virtue of digital projects.  New endeavors like Vernacular Aristotelianism and the Wellcome Arabic Manuscripts show us that digital archives have the opportunity (perhaps even, I dare say, the responsibility) to rethink literary categories, to open up new angles for research, and to foreground aspects of book production and reception beyond the figure of the author.

Early Modern OCR Project (eMOP) Receives Mellon Grant

October 2, 2012

ANNOUNCEMENT

English Professor Laura Mandell, Director of the Initiative for Digital Humanities, Media, and Culture (IDHMC), along with two co-PIs Professor Ricardo Gutierrez-Osuna and Professor Richard Furuta, are very pleased to announce that Texas A&M has received a 2-year, $734,000 development grant from the Andrew W. Mellon Foundation for the Early Modern OCR Project (eMOP, http://emop.tamu.edu ). The two other project leaders, Anton DuPlessis and Todd Samuelson, are book historians from Cushing Rare Books Library.

Over the next two years, eMOP will work to improve scholarly access to an extensive early modern text corpus. The overarching goal of eMOP is to develop new methods and tools to improve the digitization, transcription, and preservation of early modern texts.

The peculiarities of early printing technology make it difficult for Optical Character Recognition (OCR) software to discern discrete characters and, thus, to render readable digital output. By creating a database of early modern fonts, training the software that mechanically types page images (OCR) to read those typefaces, and creating crowd-sourced correction tools, eMOP promises to improve the quality of digital surrogates for early modern texts. Receiving this grant makes possible improving the machine-translation of digital page images with cutting-edge crowd-sourcing and OCR technologies, both guided by book history. Our goal is to further the digital preservation processes currently taking place in institutions, libraries, and museums globally.

The IDHMC, along with our participating institutions and individuals, will aggregate and re-tool many of the recent innovations in OCR in order to provide a stable community and expanded canon for future scholarly pursuits. Thanks to the efforts of the Advanced Research Consortium (ARC) and its digital hubs, NINES, 18thConnect, ModNets, REKn and MESA, eMOP has received permissions to work with over 300,000 documents from Early English Books Online (EBBO) and Eighteenth-Century Collections Online (ECCO), totaling 45 million page images of documents published before 1800.

The IDHMC is committed to the improvement and growth of digital projects and resources, and the Mellon Foundation’s grant to Texas A&M for the support of eMOP will enable us to fulfill our promise to the scholarly community to educate, preserve, and develop the future of humanities scholarship.

For further information, including webcasts describing the problem and the grant application as submitted, please see the eMOP website: http://emop.tamu.edu

For more information on our project partners, please see the following links.
ECCO at Gale-Cengage Learning
EBBO at ProQuest
Performant Software
SEASR
Professor Raghavan Manmatha at the University of Massachusetts Amherst
The IMPACT project at the Koninklijke Bibliotheek – National Library of the Netherlands
PRImA at the University of Salford Manchester
Department of Computer Science and Engineering, Texas A&M University
The Initiative for Digital Humanities, Media, and Culture, Texas A&M University
Cushing Memorial Library and Archives
The OCR Summit Meeting Participants


Follow

Get every new post delivered to your Inbox.

Join 126 other followers