Archive for the ‘Digital Tools’ Category

Virtual Paul’s Cross Project website is now available for exploration!

May 8, 2013

st-paul

About a year ago, EMOB devoted a post to several NEH-funded digital projects. John N. Wall, Project Director and Professor of English Literature at NC State University, has let us know that the Virtual Paul’s Cross Project website is now available for exploration at http://vpcp.chass.ncsu.edu. We provide below the press release announcing its availability and invite EMOB readers to explore and comment.

The Virtual Paul’s Cross Project uses visual and acoustic modeling technology to recreate the experience of John Donne’s Paul’s Cross sermon for November 5th, 1622. The goal of this project is to integrate what we know, or can surmise, about the look and sound of this space, destroyed by the Great Fire of London in 1666, and about the course of activities as they unfolded on the occasion of a Paul’s Cross sermon, so that we may experience a major public event of early modern London as it unfolded in real time and in the context of its original surroundings.

The Virtual Paul’s Cross Project has been supported by a Digital Start-Up Grant from the National Endowment for the Humanities.

The Virtual Paul’s Cross Project has sought the highest degree of accuracy in this recreation. To do so, it combines visual imagery from the 16th and 17th centuries with measurements of these buildings made during archaeological surveys of their foundations, still in the ground in today’s London. The visual presentation also integrates into the appearance of the visual model the look of a November day in London, with overcast skies and an atmosphere thick with smoke. The acoustic simulation recreates the acoustic properties of Paul’s Churchyard, incorporating information about the dispersive, absorptive or reflective qualities of the buildings and the spaces between them.

This website allows us to explore the northeast corner of Paul’s Churchyard, outside St Paul’s Cathedral, in London, on November 5th, 1622, and to hear John Donne’s sermon for Gunpowder Day, all two hours of it, in the space of its original delivery and in the context of church bells and the random ambient noises of dogs, birds, horses, and crowds of up to 5,000 people.
There is a Concise Guide to the whole site here.

In keeping with the desire for authenticity, the text of Donne’s sermon was taken from a manuscript prepared within days of the sermon’s original delivery that contains corrections in Donne’s own handwriting. It was recorded by a professional actor using an original pronunciation script and interpreting contemporary accounts of Donne’s preaching style.

For John Donne’s Paul’s Cross sermon for November 5th, 1622 (in 15-minute segments), as heard from 2 different positions in the Churchyard, go here.

On the website, the user can learn how the visual and acoustic models were created and explore the political and social background of Donne’s sermon. In addition to the complete recordings of Donne’s Gunpowder Day sermon, one can also explore the question of audibility of the unamplified human voice in Paul’s Churchyard by sampling excerpts from the sermon as heard from eight different locations across the Churchyard and in the presence of four different sizes of crowd.

For excerpts of the sermon from eight different locations and in the presence of different sizes of crowd go here.

The website also houses an archive of materials that contributed to the recreation, including visual records of the buildings, high resolution files of the manuscript and first printed versions of Donne’s sermon for Gunpowder Day 1622, and contemporary accounts of Donne’s preaching style. In addition, the website includes an acoustic analysis of the Churchyard, discussion of the challenges of interpreting historic depictions of the Cathedral and its environs, and a review of the liturgical context of outdoor preaching in the early modern age.

To see the visual model in detail on a fly around video go here. This is especially dramatic if viewed in HD video and at Full Screen display.
This Project is the work of an international team of scholars, engineers, actors, and linguists. In addition to the Project Director, they include David Hill, Associate Professor of Architecture at NC State University; Joshua Stephens, Jordan Grey, Chelsea Sacks, and Craig Johnson, graduate students in architecture at NC State University; John Schofield, Archaeologist at St Paul’s Cathedral and author of St Paul’s Cathedral Before Wren (2011); David Crystal, linguist; Ben Crystal, actor; Ben Markham and Matthew Azevedo, acoustic engineers with Acentech, Inc; and members of the faculty in linguistics and their graduate students at NC State University, especially professors Walt Wolfram, Erik Thomas, Robin Dodsworth, and Jeff Mielke.

Wall’s team is now planning a second stage of this Project, with the goal of completing the visual model of Paul’s Churchyard, including a complete model of St Paul’s Cathedral as it looked in the early 1620’s, during John Donne’s tenure as Dean of the cathedral. This visual model will be the basis for an acoustic model of the cathedral’s interior, especially the Choir, which will be the site for restaging a full day of worship services, including Bible readings, prayers, liturgies from the Book of Common Prayer, sermons, and music composed by the professional musicians on the cathedral’s staff for performance by the cathedral’s organist and its choir of men and boys. They will be competing for our attention, as they did in the 1620’s, with the noise of crowds who gathered in the cathedral’s nave, known as Paul’s Walk, to see and be seen and to exchange the latest gossip of the day.

Preserving Digital Archives

April 28, 2013

Most attendees at the Beinecke Library’s recent conference on digital archiving–Beyond the Text: Literary Archives in the 21st Century“–arrived equipped with the idea that there is no preservation without loss.

What may have given some attendees pause, particularly those who work primarily on the first two centuries following the Reformation, is how much 21st-century digital stuff is being preserved–and how idiosyncratic the process of selection can be.

Faced with the data deluge of a contemporary literary figure’s electronic correspondence, for example, how do archivists determine what gets archived and what gets tossed?  Now that archiving can begin during a writer’s or publisher’s lifetime, without a family member’s interference (think Cassandra Austen), who shapes the archive?  And if digital archivists shape the archive, what principles of retention do they use?  Where do their loyalties lie? With the author?  Or with the data-hungry and feverishly scandal-mongering scholars of posterity?

The two-day conference raised unresolved and provocative questions, many of which focused on the problem of selection.  Fran Baker, the Assistant Archivist for John Rylands Library at the University of Manchester, discussed the complexity of archiving the Carcanet editorial papers, including email.  Hearing about the decision-making process determining what stays and what gets tossed may not seem new to librarians familiar with the problem of sorting and discarding, but in the context of shaping an archive, that decision-making process and its likelihood of error takes on urgency.

There were stories of forensic success, the most notable of which is Matthew Kirschenbaum’s narrative of the extensive and collective effort tracking down William Gibson’s electronic poem, “Agrippa,” which was designed to encrypt itself after a single reading.  That a text programmed to go away can be recovered suggests both the value of collaborating on large digital projects like The Agrippa Files and the perils of assuming that an author has control over her or his electronic archives.  Similarly, Beth Luey’s account of the rich storehouse of data contained in publishers’ records–sales data, copies printed, copies sold, print runs, design decisions, contracts, marketing files, legal disputes, reviews, book jacket design, subsidiary rights, and so forth–both encouraged work on publishers’ records and raised ethical and legal issues.  In the discussion that followed, for example, it became clear that though some publishers did not retain rejected manuscripts, others did, including pertinent correspondence and readers’ reports.

The Keynote talk by David Sutton noted that literary manuscripts are like no other manuscripts in that they offer insights into the act of creation.  He showcased ongoing projects that promote an awareness of digital literary archives:

Hazel Carby’s eloquent, harrowing, and culturally resonant account of tracing her family genealogy back to a slave owner’s carefully archived records, reminded everyone that archives preserve both the beautiful and the monstrous.

Diane Ducharme drew on her experience at the Beinecke to warn that however much we may desire an unmediated past and a pristine archival order free from editing and explicating, all archives arrive shaped and selected.  Her discussion underscored the importance of searching for the traces of a previous archivist’s work.

Micki McGee described her experience with the Yaddo Archive Project, which aims at providing visualizations of the social network of writers who worked at Yaddo.  She described the process of seeking a relational database with social network mapping and a visualization widget.  Though the project, Yaddo Circles, requires authentication and is not yet available for public view, this vimeo provides an overview.  Clicking here reveals the kind of relational visualization this project might produce.

McGee also recommended looking at the following projects:

These projects have potential for helping us recover the intensely sociable and highly competitive literary worlds of the long eighteenth century.   Like the many other provocative and interesting papers and introductions to sessions, they point a way forward even as they raise methodological, logistical, and even ethical questions.

This conference made clear the value of a longer conference, with sessions focusing on specific problems posed by digital archives of material both old and new.  I welcome contributions by others who attended the conference to help complete this cursory overview.

English Broadside Ballad Archive (EBBA at UCSB)

February 25, 2013

This is the second of a two-part series on free digital archives featuring English ballads.  It follows Eleanor’s discussion of the JISC-funded Broadside Ballad Initiative at Oxford.

The University of California at Santa Barbara has created a free digital ballad collection called The English Broadside Ballad Archive (EBBA), which provides access to more than 8,000 seventeenth-century ballads.  The collection includes ballads from the Pepys Collection, the Roxburgh Collection, the Euing Collection, and the Huntington Library.  EBBA is directed by Patricia Fumerton at UCSB.  This project was supported by the N.E.H.

Individual entries provide links to  sheet facsimiles, facsimile transcriptions, and often recordings.  These features facilitate introducing students both to ballads’ visual details–ornaments, woodcuts, columned verse–and to their tunes.

Cataloging is full and includes the following:

EBBA ID: An internal identifier. Each individual ballad in the archive has a unique EBBA ID.

Title: A diplomatic transcription of the ballad title as it appears on the ballad sheet. The title consists of all ballad text before the first lines of the ballad, including verse headers but excluding text recorded elsewhere under other catalogue headings (such as the license or author, date, publisher and printer imprints).

Date Published: The year—or, in most cases, range of years—during which EBBA believes the ballad to have been published. See Dates.

Author: The recognized author of the ballad in cases where an indication of authorship has been printed on the ballad or, in the case of Pepys ballads, when Weinstein has identified an author from external sources (e.g., Wing, Rollins).

Standard Tune: The standardized name for the melody (according to Claude M. Simpson or other reliable sources). Clicking the standard tune name will return all ballads with the same melody, including alternate tune titles.

Imprint: A diplomatic transcription of the printing, publishing, and/or location information as it appears on the ballad sheet.

License: A diplomatic transcription of the licensing or permission information as printed on the ballad.

Collection: The name of the collection to which the ballad belongs. In cases where the ballad is not part of a named collection, the name of the holding library plus “miscellaneous” will appear. For example, Huntington Library ballads that are not part of a collection are grouped as “HEH Miscellaneous.”

Sheet/Page: For ballads that are collected as independent sheets, the citation page displays the word “Sheet” and lists the sheet number given to it by its holding institution (usually part of its shelfmark). For ballads bound in a book, the citation page displays the word “Page” and lists the page number within the bound volume.

Location: The name of the holding institution.

Shelfmark: The shelfmark assigned by the holding institution.

ESTC ID: The Citation Number for the English Short Title Catalogue (ESTC). Use this number to find the full ESTC citation for any given ballad at http://estc.bl.uk/.

Keyword Categories: The keywords from EBBA’s standardized keyword list that relate to the ballad’s theme and content.

Notes: Clarify potential areas of confusion for users, such as ballads that have print on both sides of a sheet.

MARC Record: A link to our MARC-XML records

Additional Information: Information specific to each part of the ballad.

Title: Separate titles for multi-part ballads.

Tune Imprint: Tune title(s) as printed.

First Lines: A diplomatic transcription of the first two lines of the ballad text proper, below any heading information included in the title or elsewhere under other catalogue headings.

Refrain: Repeated lines at the end of or within ballad stanzas.

Condition: Description of ballad sheet damage and the current state of the sheet. (This information is from Weinstein and is currently for the Pepys collection only.)

Ornament: A list of decorations made of cast metal that appear on the ballad. Frequently used to fill empty spaces in the forme and/or to delimit parts of the ballad text, these ornaments include vertical rules, horizontal rules, and cast fleurons. (This information is from Weinstein and is currently for the Pepys collection only.)

Ballad scholars working with EEBO or ECCO will be familiar with the difficulty of finding ballads, making English Broadside Ballad Archive and Bodleian Library Broadside Ballads necessary.

Together with new printed resources, such as Patricia Fumerton and Anita Guerrini’s Ballads and Broadsides in Britain, 1500-1800 (Ashgate 2010) and Angela McShane’s Political Broadside Ballads of Seventeenth-Century England: A Critical Bibliography (Pickering & Chatto 2011), these digital resources provide a robust and growing archive  for the systematic study of a format whose transiency may have discouraged such studies in the past.

Text Encoding Initiative Seminar at Brown

February 6, 2013
Readers may be interested in the following announcement from Julia Flanders about a special NEH-funded “Taking TEI Futher” institute.  Additional information is available at the WWP’s webpage for their Seminars on Scholarly Text Encoding.

The deadline is approaching for applications to the introductory TEI customization workshop in the NEH-funded “Taking TEI Further” institutes. Please note that the dates for the “Publishing and Transforming TEI Data” seminar have been changed.

Taking TEI Further: TEI CustomizationBrown University, May 8-10, 2013Guest instructor: Trevor Muñoz, University of MarylandApplication deadline: February 15, 2013Taking TEI Further: Teaching with TEIBrown University, August 21-23, 2013Guest instructor: Jacqueline Wernimont, Scripps CollegeApplication deadline: June 1, 2013Taking TEI Further: Publishing and Transforming TEI DataBrown University, November 20-22, 2013 [note the date change!!]Guest instructor: David Birnbaum, University of PittsburghApplication deadline: August 15, 2013

**Travel funding is available of up to $500 per participant, up to $1000 for graduate student participants.**These seminars assume a basic familiarity with TEI, and provide an opportunity to explore specific topics in more detail, in a collaborative workshop setting.

These seminars are part of a series funded by the NEH and conducted by the Brown University Women Writers Project. They are aimed at people who are already involved in a text encoding project or are in the process of planning one, and are intended to provide a more in-depth look at specific challenges in using TEI data effectively. Each event will include a mix of presentations, discussion, case studies using participants’ projects, hands-on practice, and individual consultation.

The seminars will be strongly project-based: participants will share information about their projects with the group, discuss specific challenges and solutions, develop encoding specifications and documentation, and create sample materials (such as syllabi, docmentation, etc., as appropriate to the event). A basic knowledge of the TEI Guidelines and some prior experience with text encoding will be assumed.For more detailed information and to apply, please visithttp://www.wwp.brown.edu/encoding/seminars/

Digital Tools: Image Matching within Printed Materials

January 27, 2013

Book historians, bibliographers, and early modern scholars working on word and image relationships are no doubt excited by new digital tools that allow one to search and match images. The Oxford University’s JISC-funded Integrated Broadside Ballad Archives project has developed such software, and the capabilities of the resulting tool demonstrate the promise of image-matching software. (Developing the image-matching software tool represents only part of the project. As its title implies, the project’s main goal is “to integrate existing resources for the study of the English folk song and printed ballad tradition.” To that end, the project serves to supply a central resource hub for the Bodleian Libraries Ballad collections, University of California Santa Barbara’s online English Broadside Ballad Archive” (EEBA), and the Vaughan Williams Memorial Library’s Roud Broadside and Folk Tune Indexes.)

In a recent video, Oxford faculty Giles Bergel (English) Andrew Zisserman (Computer Science), and Relja Arandjelovic (Engineering Science) from the Broadside Ballad Connections offer a fascinating account and demonstration of the new image matching software and how it allows us to track images across early forms of printed literature. Not only is the software enabling and advancing existing scholarship, but, as Bergel notes near the video’s close, this software is generating new research questions such as “When do images became new images?”

This diagram offers a quick view of how this software operates and focuses on the following image:

From Image-Matching Explained http://ballads.bodleian.ox.ac.uk/wp-uploads/2012/06/Image-Matching_3.pdf

From Image-Matching Explained

One can also experience firsthand how this tool works by trying the demo.

We would be very interested in hearing about experiences using the image-matching functions of the Broadside Ballad Connections well as about other projects using image-matching software or similar tools that enable us to explore visual texts.

(Anna has plans to post on EBBA in the near future.)

Finding English Verse, 1650-1800

January 23, 2013

The following announcement comes from James Woolley, English, Lafayette College:

A revised and enlarged checklist of first-line indexes and fully searchable texts is available here.

Since last report (August 2010), quite a bit has happened. The Union First Line Index of English Verse has expanded significantly; it now includes, in addition to manuscript verse, printed verse 1603-1710, with more additions promised. We have a new index of Gentleman’s Magazine verse. Other big projects are announced in this update of the checklist as well. For a clue about what’s new, see the update history, p. 22.

Prof. Woolley would like to be apprised of news that ought to be mentioned in the checklist, or errors that ought to be corrected.

Folger Digital Texts: Shakespeare’s Plays, Cutting-Edge Code: A Powerful Research Tool for Scholars

December 6, 2012

The Folger is delighted to announce the launch of Folger Digital Texts. These are reliable, expertly edited, and free digital Shakespeare texts for use by researchers. Starting from the Folger Editions of Shakespeare’s works edited by Barbara Mowat and Paul Werstine, Folger Digital Texts uses XML to create a highly articulate indexing system. Researchers can read the plays online, download PDFs for offline reading, search a play or the whole corpus, navigate by act, scene, line, or the new Folger Throughline Numbers. In short, every word, space, and piece of punctuation has its own place online. Twelve plays are currently available, and the remainder of the works and poems will be released throughout 2013.

The XML-coded files are offered as a free download for noncommercial use by scholars and can be used as the groundwork for digital Shakespeare research projects, app development, and other projects.

The Folger Shakespeare Library editions, published by Simon and Schuster, remain available in print and as ebooks and include essays, glosses, notes, and illustrations from the materials in the Folger collections.

The Folger Digital Texts team includes Rebecca Niles, editor and interface architect, and Michael Poston, editor and encoding architect. They welcome your feedback at folgertexts (at) folger.edu.

If you click here, you will be taken directly to Folger Digital Texts.

T-PEN: A New Tool for Transcription of Digitized Manuscripts

October 22, 2012

One of the exciting turn of events for scholars has been the growing number of unpublished, hand-written documents now available on the world wide web. Textual scholars no longer have to travel to distant countries for view the essential manuscript(s) for their research. Instead, they can now sit themselves down in front of their laptop and display each successive page. This has moved many sources that were once difficult to access into the “completely accessible” category.

But does that make them usable?  Despite the desire to make many manuscript collection freely accessible, many digital repositories use “tiled-based” viewers in order to protect unauthorized copying of the collection. This is completely understandable, but those viewers sometimes place limits on how a digital surrogate can be viewed. They can even make it difficult for scholars to extract what they often want most: a transcription of the manuscript’s content. Moreover, the current practice of transcribing from digitized pages can easily permit mistakes to occur. Transcribers currently move from the image to a word processing application in another display window (either on the same screen or on a different monitor). That process can easily mimic the same mistakes that the original scribe could make: haplography (omission of content between similar or identical words; “saut du même au meme”), dittography (repetition of letters or syllables), duplication or omission (of letters, words, or lines), often caused by homoearcton and homoeoteleuton (similar beginnings and endings of words), and transpositions. Could it then be possible to make these digital manuscripts both accessible and highly usable?

T-PEN (Transcription for Paleographical and Editorial Notation) seeks to address both the accessibility and usability of digital repositories. Developed by the Center for Digital Theology of Saint Louis University, in collaboration with the Carolingian Canon Law Project of the University of Kentucky, this new digital tool is a sophisticated web-based application that assists scholars in transcribing these manuscripts. To reduce the likelihood of transcription errors, we took advantage of digital technology to place both the transcription and the exemplar in a manner that minimized the visual movement between the two as much as possible. We accomplished this with a simple but novel visualization of the lines of script in the exemplar, which we integrated with interactive transcription spaces. To build the tool, we developed an algorithm for “parsing” the lines of script in an image, and a data model that connected the image delivery of manuscript repositories with the actions of transcribers.

But we wanted T-PEN to offer more than just a means to ensure good transcription. We had, in fact,  three goals in mind:

  1. To build a tool useful for any kind of scholar, from the digital Luddite to those obsessed with text encoding;
  2. To provide as many tools as possible to enhance the transcription process;
  3. To help scholars make their transcriptions interoperable so that those transcriptions would never be locked into the world of T-PEN alone.

After two years of design, development, and intensive testing this tool is now available to the wider public. It was built in the first instance for those working with pre-modern manuscripts, but there is nothing in its design that would prevent early modern scholars from exploiting T-PEN for their purposes. T-PEN is a complex application and to explain every function would take several posts. Instead, I want to provide a brief overview of how someone can set up a transcription project, how they can use T-PEN to produce high-quality work and finally how to get transcriptions out of T-PEN and into other applications or contexts.

Choosing your Manuscript

T-PEN is meant to act as a nexus between digital repositories and the scholar. To date, we have negotiated access to over 3,000 European manuscripts and we are working on further agreements to expand that list. Our aim is to have a minimum of 10,000 pre-modern European manuscripts available for transcription. Even with that number, we will never be able to satisfy all potential users. We therefore enabled private uploads to extend T-PEN’s usability. Many scholars have obtained digital images of a manuscript and they have permission to make use of them for research purposes. Private uploads to T-PEN are an extension of that “fair use.”  Users zip the JPG images into a single file and then upload them to T-PEN. These type of projects can only add five additional collaborators (see project management, below), and they can never become public projects. Currently T-PEN can support around 300 private projects, and we are expanding our storage capacity for more.

T-PEN's Catalog of Available Manuscripts

Transcribing your Manuscript

Once you select your manuscript you can immediately begin your transcription work. T-PEN does not store any permanent copies of the page images, so each time you request to see a page T-PEN loads the image from the originating repository. If you have never transcribed the page before, T-PEN takes you to the line parsing interface. This adds a little time to the image loading as T-PEN parses the image in real time. When it finishes, you will see a page that looks like this:

T-PEN's Line Parsing Interface

T-PEN attempts to identify the location of each line on the page and then uses alternating colors to display those coordinates. As you can see, we make no claim of absolute perfection. We worked on this algorithm for  almost two and half years and after extensive testing, we’ve been able to promise, on average, an 85% success rate. There are a number of factors that prohibit complete accuracy and so we offer a way for the transcriber to introduce corrections herself. You can add, delete or re-size columns; and insert or merge lines as well. You can even adjust the width of individual lines if they vary in length. You can even combine a number of lines if you want to have them grouped together for your  transcription. Sometimes, manuscripts don’t merge well in our modern, rectilinear world: many handwritten texts were written at an angle or were so tightly bound that the page could not be photographed as flat. T-PEN ultimately doesn’t care: what really matters for connecting transcription to a set of coordinates on a digital image. What really matters is that the left side of the line box aligns with the written text. That’s the anchor.

When you are satisfied with the line parsing, you can start transcribing. The transcription interface looks like this:

T-PEN Transcription User Interface

This interface allows you to transcribe line by line, with the current line surrounded by a red box. There are some basic features to note. First, as you transcribe the previous line is noted above because so often sentence units are split across lines. Transcription input is stored in Unicode and T-PEN will take whatever language set the user has enabled his computer to type. If there are special characters in the manuscript, the transcriber can insert them either by clicking on the special character button (the first ten are hot-keyed to CTRL+1 through 0).

Second, users can encode their transcription as they go. On this aspect, T-PEN is both innovative and provocative. Many scholarly projects that include text encoding often adopt a three-step process: the scholar transcribes the text and then hands it to support staff to complete the encoding, which is finally vetted by the scholar. However, there are many times in which semantic encoding of transcriptions has to include how the text is presented on the page. T-PEN innovatively allows scholars to integrate transcription (with the manuscript wholly in view) and encoding into one step. Often the best encoder is the transcriber herself. That innovation comes with a provocative concept, however. In digital humanities where TEI is the reigning orthodoxy, T-PEN is at least heterodox if not openly heretical. T-PEN’s data model does not expect,  nor require, a transcription to be encoded much less utilize TEI as the basis of structured text. Instead, T-PEN treats all XML elements as simply part of the character stream. T-PEN can support transcribers who don’t want to encode at all as well as those who are wholly committed to the world of TEI. For those who want to encode, a schema can be linked to a project to produce a set of XML buttons that can be used in the transcription interface.

Project Management

For those who simply want to start transcribing, project management will not be that important. For those who envisage a more sustained project (and perhaps a collaborative one at that), it will be vital. There are a number of components in managing a T-PEN project, but here I want to highlight two of them.

Collaboration. Like most digital tools, T-PEN allows you to invite collaborators to join your project. All members of a project have to be registered on T-PEN (but that’s free and requires only providing your full name and an email address). Managing collaboration has three features, of which only a few projects will use all three. There is first adding and deleting project members. Any member of a project can see who is also a member, but only the project leader can add or delete members. A project leader can even have T-PEN send an invitation to a non-T-PEN person and invite them to join (and once they do, they automatically become part of that project).

Collaboration in Project Management

Second, there is a project log to inspect. This log records any activity that changes the content or parameters of the project. This can be particularly helpful when tracking down how a transcription has changed in a shared project (and a user can display the history of each line in the Trasnscription UI). Finally, projects can make use of T-PEN’s switchboard feature. This is for transcription projects that may be part of a larger project, and where the transcriptions will be aggregated in another digital environment. Switchboard does two things for a project: (1) it allows different projects to share the same XML schema so that all transcriptions will conform to the larger project’s standards; and (2) it will expose the transcription through a web service to permit easy export to the larger project.

Project Options. The two more important options are button management and setting the transcription tools. As seen in the screen shot of the transcription interface, users can use buttons to insert both XML elements and special characters. Those buttons are created and modified as part of the project options. If there is an XML schema for the project, a project leader can link it to the project. Then in button management, the elements in that schema populate the XML button list. The button populator does not discern between metadata elements and elements found in the body of an encoding schema. Users then have to modify the button list to cull the elements that won’t be used during transcription. There’s an additional advantage to editing that list: each button can gain a more readable title. This can be helpful if the encoding schema exploits the varying use of the <seg>  or the <div> elements in TEI. When the possible deployment of the tag might be unclear to those with less experience with TEI, a more straightforward title can become a better guide to its use.

Special characters allow the user to identify characters in the UTF-8 system which may not be represented on a standard keyboard. These can be created by entering the correct Unicode value for the character. The first 10 characters are mapped to hotkeys CTRL+1 through 0.

Finally, the set of tools that are available on the transcription interface are set in project options. T-PEN has thirteen tools built-in and most of them were included to assist transcribers of pre-modern manuscripts. Some will be helpful to editors of modern texts. If those tools are unhelpful, then the user can expand that list of tools: all that is needed a name of the tool and its URL. Once attached to the project, the user will be able to access that tool in the transcription interface.

Getting your Transcription out of T-PEN

Digital tools often fall into one of two categories. “Thinking” tools are ones that allow users to manipulate and process datasets in order to test a certain idea or to visualize an abstract concept. They can also allow the user to annotate a resource as a way of processing the scholar’s conception of the object’s meaning or the hermeneutical framework it may require. These tools are invaluable, but they do not easily produce results that can be integrated into a print or digital publication. The second type is what I call the production tool. With these applications, the final objective is to produce something that can be integrated in other contexts. T-PEN falls firmly into this second category—although it has its own annotation tool with which a user can record observations about each manuscript page (and it is compliant with the W3C standard, the Open Annotation Collaboration). Scholars transcribe normally one of three reasons; to create a scholarly edition; to place those transcriptions in footnotes or in the appendices of a monograph; or to integrate an encoded text into a larger resource.

T-PEN supports four basic export formats: XML/plaintext, where the user can filter out one or more XML tags; PDF; RTF which is compatible with most word processors; and finally, basic HTML. For the first one, if the user has attached a header to the project, that header can be included in the export. There is an important caveat here:  T-PEN was not designed to be an XML editor. We do offer a basic, well-formedness check (which stops at the first error), but T-PEN does not offer full validation services. Most scholars who encode with T-PEN export their transcriptions to an XML editor for full validation of the file. The last three export formats include some simple transformation for text decoration (italics, bold, etc.). Users can also identify the whole transcription or specify a range based on the pagination (or foliation) of the manuscript.

T-PEN's Export Options

This post only covers the basics of T-PEN. There are more features available to the user. There is a demonstration video on YouTube  where you can walk with one of T-PEN’s research fellows as she begins a transcription project.  T-PEN is freely available, thanks to a major investment from the Andrew W. Mellon Foundation and a Level 2 Start-up grant from the National Endowment for the Humanities. So go to t-pen.org and register for an account.

Early Modern OCR Project (eMOP) Receives Mellon Grant

October 2, 2012

ANNOUNCEMENT

English Professor Laura Mandell, Director of the Initiative for Digital Humanities, Media, and Culture (IDHMC), along with two co-PIs Professor Ricardo Gutierrez-Osuna and Professor Richard Furuta, are very pleased to announce that Texas A&M has received a 2-year, $734,000 development grant from the Andrew W. Mellon Foundation for the Early Modern OCR Project (eMOP, http://emop.tamu.edu ). The two other project leaders, Anton DuPlessis and Todd Samuelson, are book historians from Cushing Rare Books Library.

Over the next two years, eMOP will work to improve scholarly access to an extensive early modern text corpus. The overarching goal of eMOP is to develop new methods and tools to improve the digitization, transcription, and preservation of early modern texts.

The peculiarities of early printing technology make it difficult for Optical Character Recognition (OCR) software to discern discrete characters and, thus, to render readable digital output. By creating a database of early modern fonts, training the software that mechanically types page images (OCR) to read those typefaces, and creating crowd-sourced correction tools, eMOP promises to improve the quality of digital surrogates for early modern texts. Receiving this grant makes possible improving the machine-translation of digital page images with cutting-edge crowd-sourcing and OCR technologies, both guided by book history. Our goal is to further the digital preservation processes currently taking place in institutions, libraries, and museums globally.

The IDHMC, along with our participating institutions and individuals, will aggregate and re-tool many of the recent innovations in OCR in order to provide a stable community and expanded canon for future scholarly pursuits. Thanks to the efforts of the Advanced Research Consortium (ARC) and its digital hubs, NINES, 18thConnect, ModNets, REKn and MESA, eMOP has received permissions to work with over 300,000 documents from Early English Books Online (EBBO) and Eighteenth-Century Collections Online (ECCO), totaling 45 million page images of documents published before 1800.

The IDHMC is committed to the improvement and growth of digital projects and resources, and the Mellon Foundation’s grant to Texas A&M for the support of eMOP will enable us to fulfill our promise to the scholarly community to educate, preserve, and develop the future of humanities scholarship.

For further information, including webcasts describing the problem and the grant application as submitted, please see the eMOP website: http://emop.tamu.edu

For more information on our project partners, please see the following links.
ECCO at Gale-Cengage Learning
EBBO at ProQuest
Performant Software
SEASR
Professor Raghavan Manmatha at the University of Massachusetts Amherst
The IMPACT project at the Koninklijke Bibliotheek – National Library of the Netherlands
PRImA at the University of Salford Manchester
Department of Computer Science and Engineering, Texas A&M University
The Initiative for Digital Humanities, Media, and Culture, Texas A&M University
Cushing Memorial Library and Archives
The OCR Summit Meeting Participants

New Digital Projects I: Vernacular Aristotelianism and Digitized Archives at the Wellcome Library

September 28, 2012

The following guest post, the first of two parts, is from Andie Silva, Wayne State University

The University of Warwick, in association with the Newberry library, has been conducting a long-term research project on “Reading Publics.” This project, led by Professor Simon Gilson, Dr. David Lines, and Dr. Maude Vanhaelen, encourages conversations about communities of readers, evidence of readership and reception, and the social and cultural involvement of individual and networks of readers on the print marketplace. This research is possible in great part due to the growth of digitization projects and increasing availability of data and archival materials. As the project’s webpage outlines, however, “the availability of these resources not only varies greatly depending on language, author, country, and period, but also calls for careful methodological reflection.”

This summer, the program leaders organized three activities designed to foster conversation and scholarship on the topic of “Reading Publics” and digitization. I, along with nineteen other scholars from the United States, England, and Italy, was selected to participate in their final activity, a two-week workshop at the University of Warwick. During this workshop, we attended presentations on two new, exciting database and digitization projects: Vernacular Aristotelianism in Renaissance Italy, c. 1400-c. 1650 (University of Warwick); and the on-going project to digitize the entire catalogue at the Wellcome Library, one of the world’s largest collections of history of medicine materials. The following, the first of a two-part post, will focus on the Vernacular Aristotelianism.

The Vernacular Aristotelianism database was launched in May 2012. So far, the catalogue accounts for over 400 titles, half printed books and half manuscripts. The goal of its developers is to catalogue all vernacular works that reference Aristotle or interpret Aristotelian works, (including falsely-attributed texts)—a helpful addition to those researching reception and production of Aristotelian texts in early modern Europe. One of the greatest features of this database is the flexibility of its search engine. A sidebar menu allows searches to be conducted solely on “manuscripts,” “printed editions,” “authors,” “dedicatees,” and “printers.” Thus, a scholar interested in how many times Cosimo de Medici was the chosen dedicatee for Aristotelian-related works would quickly and relatively easily discover at least five works on her first attempt. The catalogued texts still appear in varying degrees of detail. All works, I believe, already have a basic listing, including date and location of first publication, author, printer, and a short description of the work.

A shorter, yet still impressive, number of records contain further detail: if the database’s current webmaster, Eugenio Refini, has physically visited the copy, he has shared his notes on the size, condition, and title-page details of particular editions. Since a lot of his notes pertain to specific copies, he will also note which edition he has seen, and where. Even better, Refini has put considerable effort in cataloguing paratextual information, including what kinds of paratexts are available in the work (epistles, indexes, notes) and whether or not the book includes any visual elements (though no specifics are given as to what kinds of visuals). A few texts also contain “internal descriptions,” where sections of the work are either fully transcribed or generally outlined.

This kind of deep-level information is still lacking from most North American databases and catalogue searches. Although it would be recognizably difficult to restructure a large website like EEBO so that it contains more non-authorial details (and do so consistently across records), many projects like Brown’s fantastic Women Writers Online or the University of Michigan’s Renaissance Liturgical Imprints could benefit from more comprehensive and transparent search options. Of course, that is not to mention many potentially exciting projects like British Literary Manuscripts Online and Arkyves, which are largely available by subscription only. This reliance on existing catalogues and older cataloguing methods, especially ones originally designed for material holdings, holds back many digital projects from their full potential as new search tools.

When the database was first presented at the workshop, we were impressed with the range of detail and information Dr. David Lines and Dr. Eugenio Refini have been able to gather. However, most of us were skeptical about their ability to offer the same level of detail for all their records. One pertinent suggestion from the group was the possibility of “crowd sourcing.” Although it could take a single scholar (or even a small group of scholars) a long time to add bibliographical details to all 400 works (their goal, I believe is to expand the database in the future), if users could submit their own notations, that work could happen quickly and effectively. This would no doubt enrich the database beyond its already incredible achievements and make a number of new kinds of research possible.

There are, of course, a few limitations to the database. In order to make so many search terms immediately within reach, the page is visually overwhelming. The search button at the top is easily missed amongst all the information on the center of the page, and the preloaded first record that opens with the database might at first be confusing. Once the search is successfully performed, the user will need to find the browsing buttons at the top left of the page to sort through each result. For those uneasy with technology, these immediate challenges might be intimidating, and the researcher would unfortunately be missing out on a valuable and incredibly detailed resource.

Even for those of us not performing research on Aristotle, this database raises some important issues. First, the range of non-canonic texts yet to be properly catalogued and annotated, let alone studied, remains overwhelming. Smaller, single-focused websites like Vernacular Aristotelianism highlight how crucial the Digital Humanities have been to providing new and productive avenues for scholarship. We need more projects like this (and perhaps more government funding to make them possible).

Secondly, the organizers have taken into consideration an important shift (by no means wholly “new” anymore, but still time-consuming due to limited search methods) in bibliographical studies, having to do with the analysis of paratextual material and surface-level concerns as integral aspects of textual production and reception. Although scholars like Helen Smith and Michael Saenger have greatly contributed to the study of paratextual and material elements, most of these materials remain uncatalogued. What’s left to the scholar of paratexts is a manual archival search, browsing through texts one by one either digitally or at national archives. Vernacular Aristotelianism provides a helpful starting point of information that, although it does not replace visiting the physical copy, broadens the scope of research and expands the specificity of academic projects.


Follow

Get every new post delivered to your Inbox.

Join 94 other followers