Archive for March, 2010

ASECS Summary of “Some Noisy Feedback” Roundtable, Albuquerque 3/18/10

March 27, 2010

ECCO, EEBO, and the Burney Collection: Some “Noisy Feedback” Roundtable

Chair: Anna Battigelli (SUNY Plattsburgh)   Panelists: Sayre Greenfield (University of Pittsburgh, Greensburg), Stephen Karian (Marquette University), James E. May (Penn State University—DuBois), Eleanor Shevlin (West Chester University), Michael Suarez (Rare Book School, University of Virginia).  Respondents: Jo-Anne Hogan, (ProQuest), Brian Geiger (ESTC, University of California, Riverside), and Scott Dawson (Gale/Cengage).

The following offers a summary of the roundtable that took place, Thursday,  March 18, 2010  at the ASECS 2010 conference in Albuquerque, N.M.  This session was the second part of a two-part series, the first part having been a roundtable discussion chaired by Eleanor Shevlin at the EC/ASECS meeting in Bethlehem, Pa in October 2009.  Copies of Eleanor’s summary of the EC/ASECS session (published in the Eighteenth-Century Intelligencer and also on this blog) were distributed at the outset of this session.  Many thanks to the members of the audience who so cheerfully presented themselves at an early hour on the conference’s first day.

Sayre Greenfield opened discussion with detailed working solutions to problems caused by ECCO’s OCR (optical character recognition) software.  He recommended that Gale provide an ECCO OCR troubleshooting page on their web site and noted that blogs like this one would be sure to start that process (see below).  Aided by Deidre Stuffer, he found ways to correct for errors stemming from the following letter combinations that OCR typically mistranslates: s, ss, and ct.  Using the word, fishmonger, he substituted for the s every other letter, then substituted numbers, and finally the wildcard question mark.  Advice from his search results, including how best to use the question mark as a wildcard, can be found on the ECCO OCR Troubleshooting Page on the “Pages” section of this blog.  He warned that using the question mark for any medial or initial s is problematic if one is using variables elsewhere, adding that ECCO does not allow wildcards for the first letter of a word.  Additionally, letters surrounding the s seem to affect how the OCR reads the s.  The double ss, for example, frequently morphs into fl, transforming passion into paflion. Word searching within a text also proved problematic.  Though he found 32 instances of passion or passions when he read John Tottie’s A View of Reason and Passion, his electronic search using passion* yielded only half of these.  Turning to ct, he found that OCR often reads ct as t, so that objection becomes objetion.  These results suggest that ECCO would help users by strengthening its web site, which currently recommends fuzzy searches to address OCR problems.  Fuzzy searches create too many false positive results.  Including a more robust help page on this issue is necessary.  (For now, see Sayre’s ECCO OCR Troubleshooting Page on this blog.)

Steve Karian began by acknowledging the indispensability of ESTC for bibliometrics, but he also identified four problems that need to be addressed if the ESTC is to become the powerful tool it can be for the twenty-first century.  The first is the ESTC’s unit of measurement: the ESTC record.  Users often equate an ESTC record with an imprint, title, edition, or an issue.  Because of variations in the correlation of record to item, one cannot simply assume that two parallel sets of search “hits” can be compared reliably.  As he puts it, “one is constantly comparing apples to oranges.”  Additionally, field records vary, limiting or complicating the kinds of searches that can be done.  These need to be standardized if searching is to become reliable.  The two ESTCs—one at UC-Riverside, the other at the British Library—use the same data but different interfaces.  Dates are complicated because they appear in two MARC (Machine-Readable Cataloguing) fields.  Steve recommended deleting the MARC record entirely and replacing it with a new database structure, one designed to expand and grow.  He called for a new stage of innovation, allowing the ESTC to transform itself from a bibliographical catalogue into a bibliographical database.  Only through such a transformation will the ESTC become the powerful tool it promises to be.

Jim May discussed the Burney Collection, which he argued should be called the Burney Collection of Newspapers, Periodicals, and Other Printed Matter.  Its material was first collected by Charles Burney, subsequently increased by the British Library, and eventually microfilmed before being turned over to Gale/Cengage.  It includes material dating back to the 1620s and beyond  1800 and material printed in Barbados, India, Ireland, and North America.  Citing James Tierney’s comments at the Bethlehem meeting, Jim noted that the collection includes 237 newspapers and 161 periodicals, 60 of which are partially available in Adam Matthews Eighteenth-Century Journals series or ProQuest’s British Periodicals.  Burney allows one to read an entire issue or study issues by year or month, and it offers searching, though this is problematic.  According to Jim’s results, searching sometimes yields only 10% of the relevant items.  Searching for “Tatler” between 1708 and 1712 yields 80 hits.  Though he has found hundreds of advertisements of Smollett’s Continuation of the Complete History of England, only few of these can be found through an electronic search.  Similarly, only a third or fewer of The London Evening Posts published 1760-61 turn up when you search for “London Evening”.  Robert Hume and Ashley Marshall have an essay forthcoming in Papers of the Bibliographical Society of America discussing Burney and noting, among other problems, how definite and indefinite articles interfere with searches.  Jim also cited Simon Tanner’s article in D-Lib Magazine (July/August 2009), which found the following accuracy rates for Burney: character 75%, word 65%, significant word 48.4%, capitalized word 47.4.% and number 59.3%.   The magnification feature enlarges pages by 100% and would be more useful if it magnified by 33%.  Spread dates are misrepresented, due to the lack of editorial apparatus explaining when newspapers were actually issued.  Burney’s lack of editorial apparatus, cross references, comments, and so forth is a deficit.  Having a scholarly editor–perhaps a graduate student or postdoc intership– would improve its utility.  Also needed is a review of the entire database.  A page dedicated to errors encountered by users would help, something EEBO is now working on with in its “EEBO Interactions, A Social Network.”

Eleanor Shevlin identified three pressing needs: 1) fostering greater awareness of the context of texts; 2) encouraging collaboration among users; and 3) cultivating greater access to these electronic resources.  She pointed to the need for bibliographical training in order to use these resources accurately and called for an examination of the cognitive effects these tools have on research processes.  Specifically, she wondered how EEBO’s TCP transcriptions or ECCO’s searching mechanism affects research methodology.  Noting that these tools provide opportunities to correct bibliographical inaccuracies, she urged the need for a more standardized process through which corrections could be forwarded to the ESTC or to commercial databases.  She also cited examples of productive collaboration among members of the bibliographic community, including her own experience correcting an error in Kansas’s Spencer Research library, a correction made possible by sending ECCO’s image of the British Library’s copy of a text to Kansas.  Finally, she noted that access continues to be a problem.  Scholars in the U.S. work at a notable disadvantage compared to scholars in the U.K. who typically have access to ECCO and ECCO II through the Joint Information Systems Committee (JISC).  ASECS President Peter Reill’s recent calls for feedback regarding access suggests that the issue is at least on the radar of those who can help, either through negotiations for large-scale access or  individual subscriptions.

Michael Suarez warned against the illusion of comprehensiveness in database searches.  Users are frequently unaware of what is missing in these databases, and the databases’ selectivity impoverishes word searches as tools for analysis.  Turning to the task of text-mining, he expressed skepticism regarding the mentalities of mining.  Where sustained engagement with individual texts allows for work linking texts to their culture and to other texts, textual extraction can produce radically decontextualized results.  Because these database tools are easy to use, we are, he warned, insufficiently uneasy with what they actually accomplish.  Suarez insisted that textual analysis demands an effort to fuse horizons between text and reader, a fusion that involves a reader’s deep engagement with a text’s historical context and with a text’s relationship to other texts.  Such contextualization, as James Boyd White would agree, is essential to a functional and robust literary hermeneutics.  Additionally, text-mining tools encourage scholars to work in even greater isolation, away from libraries and other scholars.  Precisely because the digital future will change the way we think, Suarez called for a greater bibliographical literacy in order to make these promising tools work properly.

Panelists’ Responses:

Jo-Anne Hogan (ProQuest)  agreed with Michael’s concern regarding the impact of these digitization projects.  She added that EEBO routinely receives emails pointing out errors, asking for missing items, and making recommendations, and that it works to incorporate these suggestions.  But she also noted a growing digital divide: concerns voiced at conferences like ASECS differed from those at conferences on the digital humanities.  At the latter, attendants ask EEBO to produce more tools for text-mining.  It is sometimes difficult to reconcile the competing requests received.  Money matters in these issues, and will always be a factor.  She agreed that more could be done to align the bibliographic data in EEBO with that in the ESTC and pointed out that efforts are under way to make that happen.  She also introduced the prospect of a social networking site for EEBO intended to facilitate communication between scholars and users so corrections can be reported and more contextual information can be made available.  We hope to hear more from her about this on this blog in the near future.  Access, she concluded, continues to be a concern, agreeing with Eleanor that it is unfortunate not to have a model for broad access in the U.S.  Personal subscriptions seem unlikely because such subscriptions cannot cover costs, at least not at subscription rates individuals are willing to pay. She hoped there might be a point in the future when ProQuest can provide broader access, but she could not guarantee such a thing.  More promising is the prospect that about half of the books in EEBO will soon be available for purchase at reasonable rates via Print on Demand.

Scott Dawson (Gale) agreed with Sayre’s suggestion that a Help screen dedicated to OCR problems  is an idea to consider seriously.  He added that Gale would look into post-OCR checks that might correct results.  18thConnect will help by testing new OCR software on ECCO page images, and that might solve problems.  Turning to Steve’s comments about ESTC, Scott noted that ECCO depends on ESTC for metadata, and that Gale is working with ESTC to add a link within the ECCO Full Citation to report problems with a given record.  He agreed with Jim May that Burney presents additional obstacles to getting accurate OCR  results.  Gale has been working with the British Library to resolve the issue of spread dates and hopes to have an update in the next few months.  On the issue of access raised by Eleanor, Scott mentioned that ECCO is concerned about the issue, but that by providing access to more than 500 institutions globally, it has helped make early modern printed material more accessible than is possible through hard copy or microfilm.  Tiered pricing and consortia-designed contracts help non-ARL institutions find ways to subscribe to ECCO.  He greed with Michael Suarez that ECCO is incomplete, even with the 50,000 titles added through ECCO II.   Gale is not planning an ECCO III.  But the possibility of linking missing titles to ECCO is being considered.

Brian Geiger (ESTC) outlined two main areas of work at the Center for Bibliographical Studies and Research (CBSR), which manages the North American branch of the ESTC.  First, they continue to upgrade and add records to the ESTC.  They are processing OPAC extracts from libraries, and recently began on an extract from Oxford University that resulted in some 200,000 records that will be matched against the file.  These OPAC extracts provide shelf marks (or call numbers) for existing items, and have turned up tens of thousands of new copies and hundreds of entirely new items.  They are adding urls from online collections.  EEBO, ECCO and TCP are matched, though not yet displayed by the public version at the British Library.  Brian has requested urls from Google and will do the same from Internet Archive.  They are digitizing title pages from paper reports submitted over the last two decades and will attach those images to the appropriate records, allowing users to compare a title page to its MARC record.  They hope to have many of the title pages in the ESTC by 2011.  And they have enhanced some 180,000 MARC records from title pages in ECCO.  Second, the ESTC has started to assess how to transform the project from an online catalog to a flexible and interactive database-driven research tool.  Brian corroborated Steve Karian’s assessment that this new resource should be built on relational databases, and noted with appreciation the value of the kind of collaborative thinking Steve offered about the project’s future.  Brian emphasized that a number of partner projects and institutions should be involved in the redesign, to ensure that the new project meets a variety of user needs and to try to plan for the sharing of information across platforms.  He mentioned some of the features that he thought should be included, among them user editing of bibliographic data and metadata and tools to send information to users about updates or changes to records.  He ended by pointing out that development of the database will require resources and the next stage of the ESTC’s evolution will be contingent on funding.  The ESTC is currently engaged in grant development.  It will be in a better position to discuss specific solutions once funding is secured.

ASECS Session: “ECCO, EEBO, and the Burney Collection: Some “Noisy Feedback” (roundtable)

March 13, 2010

Thursday, March 18,  9:45 – 11:15 a.m.

“ECCO, EEBO, and the Burney Collection: Some ‘Noisy Feedback’(Roundtable)    Alvarado E

Chair:    Anna BATTIGELLI, State University of New York, Plattsburgh

1.    Sayre GREENFIELD, University of Pittsburgh, Greensburg

2.    Stephen KARIAN, Marquette University

3.    James E. MAY, Pennsylvania State University, DuBois

4.    Eleanor F. SHEVLIN, West Chester University

5.    Michael F. SUAREZ, S.J., Rare Book School, University of Virginia

RESPONDENTS: ScottDAWSON,Gale/Cengage; Brian GEIGER, ESTC: Jo-Anne HOGAN, Proquest

Collaborative Reading: Elizabeth Scott-Baumann and Ben Burton’s “Encoding form: A proposed database of poetic form”

March 8, 2010

Elizabeth Scott-Baumann and Ben Burton’s recent paper,“Encoding form: A proposed database of poetic form”, for APPOSITIONS:
Studies in Renaissance / Early Modern Literature and Culture
‘s recent E-Conference: February-March, 2010, is suggestive of how new digital resources can be developed to augment the capabilities of existing tools such as EEBO and EECO. Responding many years later to Heather Dubrow’s 1979 call for “new methodology in early modern studies,” Scott-Baumann and Burton are constructing a database devoted to poetic form. Their project will afford a means of studying, historically and formally, poetic form by enabling queries about poetic form and generic transformations that resemble those we can now pose about words, thanks to electronic databases such as EEBO and EECO:

  • What is the origin (or origins) of a given form?
  • How does its structure, use, and meaning change over time?
  • Are there variations in use and meaning in different regions, or among different groups?
  • How does a given form relate to others, and how does this relationship change over time?
  • Concentrating on sixteenth- and seventeenth-century poetry, Scott-Baumann and Burton will use existing EEBO-TCP texts and enhance them with additional mark-up that builds upon Text Encoding Initiative (TEI) tags. As those familiar with TEI documentation will recall, its tags include ones designed for encoding verse: “stanza divisions, caesurae, enjambment, rhyme scheme, and metrical information, as well as a special purpose rhyme element to support the simple analysis of rhyming words.” Because encoding capabilities extend beyond merely marking general formal conventions and can also entail encoding that represent interpretive judgments, Scott-Baumann and Burton will experiment with both possibilities. The inevitably time-consuming nature of their task will probably result in building the databases in stages.

    As for publication plans for the database, its creators “aim to negotiate with EEBO and Chadwyck-Healey to find a form of publication which both respects intellectual property and commercial interests, while also making this rich new material accessible to the widest possible audience.” Scott-Baumann and Burton have clearly thought hard about issues of access and how to maximize this database’s availability for users. They present four different possible options, formulated with an eye to those lacking access to EEBO. As they note though, much will depend on what arrangements they are able to make with EEBO/Chadwyck-Healey.

    Noting that their database, once built, could be expanded beyond its present focus on the 1500s and 1600s to cover all periods of poetry, they then devote a section of their paper to its potential scholarly and pedagogical uses. Most obvious perhaps is the usefulness this planned tool could have on advancing work in historical formalism, an emerging approach that revisits “poetic form as historically specific, historically determined, and historically efficacious.” The ability to conduct specific searches across a significant number of poetic texts enables the quick capture of evidence to support or disprove what are currently only hypothetical propositions based on a small textual sample. Rightly claiming that this database “would change the way in which scholarship on poetic form is conducted, Scott-Baumann and Burton detail a wealth of possible questions and issues it could serve. This section also offers a range of pedagogical uses for this tool and addresses a range of audiences from the undergraduate to the secondary student.

    Before a brief conclusion, the paper then turns to discussing the two-stage pilot project for the database:

    1. A small database containing information on the metrical structures and rhyme schemes of all verse in the first edition of 10 texts published between 1590 and 1599. 2. A larger database containing information on the metrical structures and rhyme schemes of all verse in first editions of texts published during this period.

    Scott-Baumann and Burton’s database plans present another way of thinking about EEBO and how to augment its value. That they have proposed to build their database using EEBO-TCP seems essentially a wise plan, notwithstanding unsettled questions about access.* For one, linking one’s project to an already well-established resource should ensure its visibility. Too often very worthy projects are launched but remain unknown to many who would benefit from them. In addition, such a tie-in helps ensure continuity among resources. This augmentation of EEBO’s capabilities and the efforts to provide continuity are similar to what NINES and 18thConnect are offering later periods.

    *One of the access options does offer “[o]pen access to database and texts but not with mark up. …if we are not able to make the XML-encoded texts freely available, we would display the texts in their entirety [as users request them], but with the encoding invisible. … and display the verse with, for example, its stresses marked with accents, or its rhyme scheme colour-coded, rather than with visible tags.”


    Follow

    Get every new post delivered to your Inbox.

    Join 123 other followers