ECCO, EEBO, and the Burney Collection: Some “Noisy Feedback” Roundtable
Chair: Anna Battigelli (SUNY Plattsburgh) Panelists: Sayre Greenfield (University of Pittsburgh, Greensburg), Stephen Karian (Marquette University), James E. May (Penn State University—DuBois), Eleanor Shevlin (West Chester University), Michael Suarez (Rare Book School, University of Virginia). Respondents: Jo-Anne Hogan, (ProQuest), Brian Geiger (ESTC, University of California, Riverside), and Scott Dawson (Gale/Cengage).
The following offers a summary of the roundtable that took place, Thursday, March 18, 2010 at the ASECS 2010 conference in Albuquerque, N.M. This session was the second part of a two-part series, the first part having been a roundtable discussion chaired by Eleanor Shevlin at the EC/ASECS meeting in Bethlehem, Pa in October 2009. Copies of Eleanor’s summary of the EC/ASECS session (published in the Eighteenth-Century Intelligencer and also on this blog) were distributed at the outset of this session. Many thanks to the members of the audience who so cheerfully presented themselves at an early hour on the conference’s first day.
Sayre Greenfield opened discussion with detailed working solutions to problems caused by ECCO’s OCR (optical character recognition) software. He recommended that Gale provide an ECCO OCR troubleshooting page on their web site and noted that blogs like this one would be sure to start that process (see below). Aided by Deidre Stuffer, he found ways to correct for errors stemming from the following letter combinations that OCR typically mistranslates: s, ss, and ct. Using the word, fishmonger, he substituted for the s every other letter, then substituted numbers, and finally the wildcard question mark. Advice from his search results, including how best to use the question mark as a wildcard, can be found on the ECCO OCR Troubleshooting Page on the “Pages” section of this blog. He warned that using the question mark for any medial or initial s is problematic if one is using variables elsewhere, adding that ECCO does not allow wildcards for the first letter of a word. Additionally, letters surrounding the s seem to affect how the OCR reads the s. The double ss, for example, frequently morphs into fl, transforming passion into paflion. Word searching within a text also proved problematic. Though he found 32 instances of passion or passions when he read John Tottie’s A View of Reason and Passion, his electronic search using passion* yielded only half of these. Turning to ct, he found that OCR often reads ct as t, so that objection becomes objetion. These results suggest that ECCO would help users by strengthening its web site, which currently recommends fuzzy searches to address OCR problems. Fuzzy searches create too many false positive results. Including a more robust help page on this issue is necessary. (For now, see Sayre’s ECCO OCR Troubleshooting Page on this blog.)
Steve Karian began by acknowledging the indispensability of ESTC for bibliometrics, but he also identified four problems that need to be addressed if the ESTC is to become the powerful tool it can be for the twenty-first century. The first is the ESTC’s unit of measurement: the ESTC record. Users often equate an ESTC record with an imprint, title, edition, or an issue. Because of variations in the correlation of record to item, one cannot simply assume that two parallel sets of search “hits” can be compared reliably. As he puts it, “one is constantly comparing apples to oranges.” Additionally, field records vary, limiting or complicating the kinds of searches that can be done. These need to be standardized if searching is to become reliable. The two ESTCs—one at UC-Riverside, the other at the British Library—use the same data but different interfaces. Dates are complicated because they appear in two MARC (Machine-Readable Cataloguing) fields. Steve recommended deleting the MARC record entirely and replacing it with a new database structure, one designed to expand and grow. He called for a new stage of innovation, allowing the ESTC to transform itself from a bibliographical catalogue into a bibliographical database. Only through such a transformation will the ESTC become the powerful tool it promises to be.
Jim May discussed the Burney Collection, which he argued should be called the Burney Collection of Newspapers, Periodicals, and Other Printed Matter. Its material was first collected by Charles Burney, subsequently increased by the British Library, and eventually microfilmed before being turned over to Gale/Cengage. It includes material dating back to the 1620s and beyond 1800 and material printed in Barbados, India, Ireland, and North America. Citing James Tierney’s comments at the Bethlehem meeting, Jim noted that the collection includes 237 newspapers and 161 periodicals, 60 of which are partially available in Adam Matthews Eighteenth-Century Journals series or ProQuest’s British Periodicals. Burney allows one to read an entire issue or study issues by year or month, and it offers searching, though this is problematic. According to Jim’s results, searching sometimes yields only 10% of the relevant items. Searching for “Tatler” between 1708 and 1712 yields 80 hits. Though he has found hundreds of advertisements of Smollett’s Continuation of the Complete History of England, only few of these can be found through an electronic search. Similarly, only a third or fewer of The London Evening Posts published 1760-61 turn up when you search for “London Evening”. Robert Hume and Ashley Marshall have an essay forthcoming in Papers of the Bibliographical Society of America discussing Burney and noting, among other problems, how definite and indefinite articles interfere with searches. Jim also cited Simon Tanner’s article in D-Lib Magazine (July/August 2009), which found the following accuracy rates for Burney: character 75%, word 65%, significant word 48.4%, capitalized word 47.4.% and number 59.3%. The magnification feature enlarges pages by 100% and would be more useful if it magnified by 33%. Spread dates are misrepresented, due to the lack of editorial apparatus explaining when newspapers were actually issued. Burney’s lack of editorial apparatus, cross references, comments, and so forth is a deficit. Having a scholarly editor–perhaps a graduate student or postdoc intership– would improve its utility. Also needed is a review of the entire database. A page dedicated to errors encountered by users would help, something EEBO is now working on with in its “EEBO Interactions, A Social Network.”
Eleanor Shevlin identified three pressing needs: 1) fostering greater awareness of the context of texts; 2) encouraging collaboration among users; and 3) cultivating greater access to these electronic resources. She pointed to the need for bibliographical training in order to use these resources accurately and called for an examination of the cognitive effects these tools have on research processes. Specifically, she wondered how EEBO’s TCP transcriptions or ECCO’s searching mechanism affects research methodology. Noting that these tools provide opportunities to correct bibliographical inaccuracies, she urged the need for a more standardized process through which corrections could be forwarded to the ESTC or to commercial databases. She also cited examples of productive collaboration among members of the bibliographic community, including her own experience correcting an error in Kansas’s Spencer Research library, a correction made possible by sending ECCO’s image of the British Library’s copy of a text to Kansas. Finally, she noted that access continues to be a problem. Scholars in the U.S. work at a notable disadvantage compared to scholars in the U.K. who typically have access to ECCO and ECCO II through the Joint Information Systems Committee (JISC). ASECS President Peter Reill’s recent calls for feedback regarding access suggests that the issue is at least on the radar of those who can help, either through negotiations for large-scale access or individual subscriptions.
Michael Suarez warned against the illusion of comprehensiveness in database searches. Users are frequently unaware of what is missing in these databases, and the databases’ selectivity impoverishes word searches as tools for analysis. Turning to the task of text-mining, he expressed skepticism regarding the mentalities of mining. Where sustained engagement with individual texts allows for work linking texts to their culture and to other texts, textual extraction can produce radically decontextualized results. Because these database tools are easy to use, we are, he warned, insufficiently uneasy with what they actually accomplish. Suarez insisted that textual analysis demands an effort to fuse horizons between text and reader, a fusion that involves a reader’s deep engagement with a text’s historical context and with a text’s relationship to other texts. Such contextualization, as James Boyd White would agree, is essential to a functional and robust literary hermeneutics. Additionally, text-mining tools encourage scholars to work in even greater isolation, away from libraries and other scholars. Precisely because the digital future will change the way we think, Suarez called for a greater bibliographical literacy in order to make these promising tools work properly.
Jo-Anne Hogan (ProQuest) agreed with Michael’s concern regarding the impact of these digitization projects. She added that EEBO routinely receives emails pointing out errors, asking for missing items, and making recommendations, and that it works to incorporate these suggestions. But she also noted a growing digital divide: concerns voiced at conferences like ASECS differed from those at conferences on the digital humanities. At the latter, attendants ask EEBO to produce more tools for text-mining. It is sometimes difficult to reconcile the competing requests received. Money matters in these issues, and will always be a factor. She agreed that more could be done to align the bibliographic data in EEBO with that in the ESTC and pointed out that efforts are under way to make that happen. She also introduced the prospect of a social networking site for EEBO intended to facilitate communication between scholars and users so corrections can be reported and more contextual information can be made available. We hope to hear more from her about this on this blog in the near future. Access, she concluded, continues to be a concern, agreeing with Eleanor that it is unfortunate not to have a model for broad access in the U.S. Personal subscriptions seem unlikely because such subscriptions cannot cover costs, at least not at subscription rates individuals are willing to pay. She hoped there might be a point in the future when ProQuest can provide broader access, but she could not guarantee such a thing. More promising is the prospect that about half of the books in EEBO will soon be available for purchase at reasonable rates via Print on Demand.
Scott Dawson (Gale) agreed with Sayre’s suggestion that a Help screen dedicated to OCR problems is an idea to consider seriously. He added that Gale would look into post-OCR checks that might correct results. 18thConnect will help by testing new OCR software on ECCO page images, and that might solve problems. Turning to Steve’s comments about ESTC, Scott noted that ECCO depends on ESTC for metadata, and that Gale is working with ESTC to add a link within the ECCO Full Citation to report problems with a given record. He agreed with Jim May that Burney presents additional obstacles to getting accurate OCR results. Gale has been working with the British Library to resolve the issue of spread dates and hopes to have an update in the next few months. On the issue of access raised by Eleanor, Scott mentioned that ECCO is concerned about the issue, but that by providing access to more than 500 institutions globally, it has helped make early modern printed material more accessible than is possible through hard copy or microfilm. Tiered pricing and consortia-designed contracts help non-ARL institutions find ways to subscribe to ECCO. He greed with Michael Suarez that ECCO is incomplete, even with the 50,000 titles added through ECCO II. Gale is not planning an ECCO III. But the possibility of linking missing titles to ECCO is being considered.
Brian Geiger (ESTC) outlined two main areas of work at the Center for Bibliographical Studies and Research (CBSR), which manages the North American branch of the ESTC. First, they continue to upgrade and add records to the ESTC. They are processing OPAC extracts from libraries, and recently began on an extract from Oxford University that resulted in some 200,000 records that will be matched against the file. These OPAC extracts provide shelf marks (or call numbers) for existing items, and have turned up tens of thousands of new copies and hundreds of entirely new items. They are adding urls from online collections. EEBO, ECCO and TCP are matched, though not yet displayed by the public version at the British Library. Brian has requested urls from Google and will do the same from Internet Archive. They are digitizing title pages from paper reports submitted over the last two decades and will attach those images to the appropriate records, allowing users to compare a title page to its MARC record. They hope to have many of the title pages in the ESTC by 2011. And they have enhanced some 180,000 MARC records from title pages in ECCO. Second, the ESTC has started to assess how to transform the project from an online catalog to a flexible and interactive database-driven research tool. Brian corroborated Steve Karian’s assessment that this new resource should be built on relational databases, and noted with appreciation the value of the kind of collaborative thinking Steve offered about the project’s future. Brian emphasized that a number of partner projects and institutions should be involved in the redesign, to ensure that the new project meets a variety of user needs and to try to plan for the sharing of information across platforms. He mentioned some of the features that he thought should be included, among them user editing of bibliographic data and metadata and tools to send information to users about updates or changes to records. He ended by pointing out that development of the database will require resources and the next stage of the ESTC’s evolution will be contingent on funding. The ESTC is currently engaged in grant development. It will be in a better position to discuss specific solutions once funding is secured.