Posts Tagged ‘Gale/Cengage’

Collaboration, Costs, and Digital Resources

January 30, 2010

On February 19 and 20 Yale will host a graduate student symposium, The Past’s Digital Presence Conference: Database, Archive and Knowledge Work in the Humanities. A quick survey of the conference program and available abstracts indicate several topics that dovetail with issues or subjects that have engaged emob. Jessica Weare’s paper, “The Dark Tide: Digital Preservation, Interpretive Loss, and the Google Books Project”, for instance, examines the discarding of material evidence in the process of digitizing, Vera Brittain’s The Dark Tide. Similarly, Scott Spillman and Julia Mansfield’s presentation, “Mapping Eighteenth-Century Intellectual Networks”, discusses their work on Benjamin Franklin’s letters and their relationship within the Republic of Letters. The conference’s purpose also addresses many of the questions we have been posing on this blog:

■ How is digital technology changing methods of scholarly research with pre-digital sources in the humanities?
■ If the “medium is the message,” then how does the message change when primary sources are translated into digital media?
■ What kinds of new research opportunities do databases unlock and what do they make obsolete?
■ What is the future of the rare book and manuscript library and its use?
■ What biases are inherent in the widespread use of digitized material? How can we correct for them?
■ Amidst numerous benefits in accessibility, cost, and convenience, what concerns have been overlooked?

Peter Stallybrass is offering the keynote, and Jacqueline Goldsby will be the colloquium speaker, while Willard McCartney, Rolena Adorno, and others will appear on the closing roundtable. Such a lineup points to the range of perspectives represented. The conference is free to all affiliated with a university.

Among the places this conference has been announced is the JISC Digitisation News section of the UK Digitisation Programme website, and its announcement emphasizes the participation of students “from around the globe.”

Collaboration as it occurs across boundaries is the implicit topic of this posting, and I wish to use reports from the JISC website both as a springboard and as a contrast in the discussing the topic.

A 2008-2009 JISC report, Enriching Digital Resources 2008-2009, Enriching Digital Content program—a strand of the JISC Online Content Program—features a podcast with Ben Showers. Because of the national nature of JISC, the program described offers a unified, coherent approach to advancing digital resources for its higher institutions of education; it represents a collaborative agenda. In this podcast Showers explains the purpose of the program: Rather than fund the creation of new resources, the program invested £1.8 million to enhance and enrich existing digital content while also developing a system for universities and colleges to vet and recognize this work. He then turns to explaining the following four key benefits of this program:
• “unlocking the hidden—making things that are hard to access easy” to obtain and preserve. To illustrate, he uses CORRAL (UK Colonial Registers and Royal Navy Logbooks) project as an example of opening up primary data to make it not only much more available but also to preserve it.
• enhancing experiences of students. Here Showers exemplifies the Enlightening Science project at Sussex that offers students opportunities to watch video re-enactments of Newton’s experiments and read original texts by Newton and others.
• speeding up research—once a document has been digitized, there is no need to repeat the process. The document will now be available for all other researchers to use.
• widening participation—engaging broader audiences including not only faculty and students within Britain’s educational community but also participants globally.

Turning to the new goals for the 2009-2011 program cycle, Showers notes an emphasis on the “clustering” of content, that is bringing various projects together and establishing, when appropriate, links among them. Another focus is further building skills and strategies within institutions to deliver digital content effectively. Finally, he mentions the strengthening of transatlantic partnerships, and here the US National Endowment for the Humanities (NEH) is given as an example. Of course, there is a long history of scholarly collaboration between the NEH and British institutions—perhaps most notably the English Short Title Catalogue (ESTC).

Indeed, through collaborative digital grants offered by JISC and NEH several transatlantic projects are underway or near completion, including the Shakespeare Quartos Archive, a collaborative effort involving Oxford University and the Folger Library, and the St Kitts-Nevis Digital Archaeology Initiative, undertaken by Southampton University and the Thomas Jefferson Foundation, Charlottesville, VA, to advance scholarship on slavery. There are several others as well.

Both the goals and benefits detailed by Showers are ones that would attract the support of diverse parties, and they do parallel many arguments being made on this side of the Atlantic for such work, including ones advanced by the NEH. Moreover, this and other JISC reports suggest that JISC has also helped broker mutually beneficial relationships between British universities and commercial vendors such as Cengage-Gale and ProQuest. Yet another JISC report, The Value of Money, offers arguments that we need to be making and also points the obstacles and divides affecting various types of collaboration in the United States.

After offering the following figures on the return of money invested in the JISC,

• For each £1 spent by JISC on the provision of e-resources, the return to the community in value of time saved in information gathering is at least £18.

• For every £1 of the JISC services budget, the education and research community receives £9 of demonstrable value.

• For every £1 JISC spent on securing national agreements for e-resources, the saving to the community was more than £26.

the report summary offers the following remarks:

These are the figures revealed by a recently-published Value for Money report on JISC services. Although many countries have centrally provided research and education networks, and some have provided supplementary services, no other country has a comparable single body providing an integrated range of network services, content services, advice, support and development programmes.

The cost-effectiveness of JISC is again highlighted in two sidebars:

These figures suggest that for every £1 JISC spent on securing national agreements for e-resources, the saving to the community was more than £26
and
The added value, equivalent to more than £156m per year, suggests the community is gaining 1.4 million person/days, by using e-resources rather than paper-based information.

The end of the summary further reinforces why investments in JISC benefit the UK as a whole:

The value of JISC activities extends beyond the benefits identified here. Education and research are high-value commodities that play an important role in the UK economy and underpin the UK’s global economic position.

The JISC’s “Value of Money” report contains the types of arguments and data that we in the US need to be making. While our system of higher education does not operate under the centralized system that characterizes that of the UK, the push for more transparent reporting on and assessment of what our various universities and colleges are delivering perhaps provides an opportunity for new forms of collaboration. Through national scholarly societies, the NEH, Mellon Foundation, ALA, and more, we need to supply some “noisy feedback” from a dollars-and-cents/sense perspective about what investing in digital resources means not just for our institutions of higher learning but also for our society.

Advertisement

Summary of EC/ASECS Roundtable: Bibliography, the ESTC, and 18th-Century Electronic Databases

October 24, 2009

Bibliography, the ESTC, and 18th-Century Electronic Databases:  A Roundtable

Chair: Eleanor F. Shevlin (West Chester University)   Participants: James E. May (Penn State University—DuBois), James Tierney (University of Missouri—St. Louis), David Vander Meulen (University of Virginia), Benjamin Pauley (Eastern Connecticut State University), Brian Geiger (ESTC, University of California, Riverside), and Scott Dawson (Gale/Cengage).

The following offers a summary of the roundtable that took place, Saturday, October 10, 2009, at the EC/ASECS 2009 conference hosted by Lehigh University and held at Bethlehem, Pennsylvania, October 8-11, 2009.

 Jim May opened the roundtable, and his remarks highlighted and extended the discussion he offered in his essay, “Some Problems in ECCO (and ESTC),” in The Eighteenth-Century Intelligencer, 23.1 (Jan. 2009), the article that inspired this session and Anna Battigelli’s forthcoming roundtable at ASECS (March 18th, 9:45 am—11:15 am). Key issues Jim raised included the need to correct missing images, to address the “disappearance” of letters originally printed in red ink on title pages, and to bring the ESTC up to date. In addition, he noted that ECCO’s electronic index is not always representative of what is actually there digitally. Work is also needed on providing or revising information about subscription lists, textual history, and attributions in ESTC. While noting that he had already addressed problems with Burney in his The Eighteenth-Century Intelligencer article, 23.2 (May 2009) and that Jim Tierney would be discussing this tool next, Jim commented on the usefulness of Burney, particularly to those working on the history of a publication.

Turning to the Burney collection, Jim Tierney drew attention to the potentially confusing name for this electronic collection because it is not by any means restricted to newspapers. Instead, it includes a good number of periodicals as well. Specifically, the collection consists of 237 newspapers and 161 periodicals, and, furthermore, some of the titles included are neither newspapers nor periodicals. That the Burney digitized collection follows the Anglo-American cataloguing procedure of creating a new entry every time a newspaper undergoes a title change results in the illusion of more titles than actually exist as well as confusion about the history of a given newspaper. Jim also provided a detailed handout (posted here as a page) listing the digitized periodicals (note: not newspapers) in Burney. The handout includes notes about missing issues, other locations where titles in Burney can be found, and a tentative list of Burney titles duplicated by other digitization projects. The two overarching points Jim made was the failure to have scholars involved in the planning of Burney and other digitization projects and the need for far greater collaboration among the creators/purveyors of these databases, librarians, and scholars. That given titles in Burney often include only a few issues when other issues were available elsewhere and, if digitized, would have approached a more complete run, exemplify the need for far better coordination and collaboration.

While David Vander Meulen serves on the ESTC board, his remarks for the roundtable were offered in his role as a researcher and user of these tools. He began by noting that ESTC is an evolving tool—a work in progress—and that ECCO follows ESTC.  Moreover, even as it progresses, the ESTC is still “functional and valuable” even though it is incomplete. Nonetheless, “any addition to ESTC will change the context.” An important development occurred in 2006 when the British Library initiated free access to this tool. As for problems, the ESTC had made the decision to truncate titles and places. Yet ECCO generally offers the full titles, while expanded locations can occasionally be found by going to public library catalogues. To improve these resources, David explained, we need to have an easier way to convey corrections to the British Library or University of California Riverside (the North American home of the ESTC) and, equally important, an ongoing staff to process editorial changes and comments. In discussing this need for a means of processing updates, David also drew attention to whether the uncontrolled notes field should be visible. Unfortunately agencies that have funded the ESTC, as he explained in his closing remarks, have decided the project is complete.  Obviously, given ESTC’s status as a work-in-progress, such a decision presents additional problems to continued updating and correcting.

 Ben Pauley spoke next about a project he has initiated. He began by noting the lack of access that many institutions (and thus their scholars and students) have to paid databases such as EEBO and ECCO. Both Internet Archives and Google Books, however, have a number of eighteenth-century books in their freely accessible databases. Yet it is typically very hard to identify properly what text one has accessed. Viewing these freely available texts as an opportunity, Ben established The Eighteenth-Century Book Tracker, a project in which he is supplying the bibliographic data so sorely lacking in eighteenth-century texts found in Google Books. Doing so has compelled him to become a textual scholar or an “accidental bibliographer.” Thus far, he has recorded about 150 copies not appearing in ESTC. At present, the project features 480 texts and 4 periodicals. Ben has been asked to write an article on the Eighteenth-Century Book Tracker for The Eighteenth-Century Intelligencer that will detail much more about his undertaking.

Speaking as the Associate Director and Resident Manager of the Center (University of California Riverside), the North American home of the ESTC, Brian Geiger explained that the British Library’s ESTC role has focused on cataloguing its own collection and that the Univ. of California Riverside  has handled everything else. In addition to reiterating points about the problem with truncated titles, he also discussed the lack of subject headings as a shortcoming. Turning to the digital surrogates of early modern imprints, he explained that the ECCO and Adam Matthews collections are based on ESTC, but EEBO is not. Next Brian addressed the need to foster better communication between ESTC and scholars. While the channels of communication between ESTC and librarians have remained strong, that has not been the case with scholars. Like Ben, Brian will also be writing an article on the ESTC for The Eighteenth-Century Intelligencer.

 Scott Dawson from Gale-Cengage concluded the presentations by roundtable panelists. He first supplied an historical overview of ECCO and Burney. In 1982 Research Publications began to microfilm the “Eighteenth Century” microform collection. By 2002 twenty-six million pages of eighteenth-century titles had been filmed. This microfilm collection is the basis for ECCO, but using the ESTC in conjunction with the microfilm has been overall a real plus for the project.  ECCO II, released at the start of this year, features 50,000 additional titles. By mid 2010 ECCO II, representing holdings from fifteen libraries, will be completed (titles from the Harry Ransom Center are still being prepared). ECCO and ECCO II, combined, will have made 185,000 eighteenth-century titles available to subscribers. As for the digitization of Burney, that project was handled by the British Library and not Gale-Cengage. Scott also addressed some of the problems that can and cannot be corrected. When pages are blurred, for instance, the microfilm plays a key role in what can be done. If the microfilm is clear, then the page is re-filmed. Yet if the problem occurred because the page is blurred in the microfilm, then, from the perspective of Gale, nothing can be done. When duplications of a title are discovered, however, the duplications can be deleted. 

After all six panelists had offered opening statements, the discussion was opened to the audience’s questions and comments. The point perhaps most stressed in the discussion with the audience was a need for far greater involvement by scholars in the creation and improvement of digital resources. In terms of updating or correcting resources, questions arose about how this might be done and what types of controls are needed. In subsequent discussions, the creation of advisory boards and (or) the involvement of a committee representing ASECS arose as possible avenues for communicating and addressing the scholar’s perspective more effectively. The establishment of an advisory board and/or ties with ASECS could play a vital role in future projects, and members of a board or ASECS committee could also devise potential solutions to some of the shortcomings with existing tools.  The resurrection of Factotum, the now defunct ESTC news publication of the British Library (ceased with issue no. 40 in 1995), or the initiation of a similar publication would be a way of establishing regular, ongoing communication with a broader base of scholars. (For those interested in the content of previous issues, see the index for Factotum.) Of course, an obstacle here is staffing and funding. Questions also arose about plans to make Burney more complete by digitizing issues not included for a particular newspaper or periodical title but available elsewhere. Yet that this digitization project had been undertaken by the British Library (see final report) and not Gale complicates the issue. Also, when asked about any plans for an ECCO III, Scott explained that the creation of ECCO II caused surprise among many libraries that had purchased ECCO because they believed that ECCO was complete at the time. When ECCO II was introduced for purchase, libraries were promised that there would not be any additional forms of ECCO.  (Depending on the discovery of additional eighteenth-century titles, however, I see no reason that another collection could not be pursued; if enough material for another collection becomes available, then scholars need to insert and assert themselves in conversations with vendors and librarians and make the need and value of a third collection known.)

Another very real, pressing concern was the large number of scholars who do not have access to these databases and for whom their institutions are not likely to be able to afford these resources even in the future. The point was raised that all universities in the U.K. have access to ECCO and ECCO II for an annual hosting fee through the auspices of the Joint Information Systems Committee (JISC), “established by the UK further and higher education funding councils in 2006 to negotiate with publishers and owners of digital content.” Because the situation differs greatly in the U.S.—we have no higher education government council overseeing all our universities—we do not have such a prospect here. While Ben Pauley’s Eighteenth-Century Book Tracker promises to bring some order to the current anarchy that characterizes freely available eighteenth-century texts, his valuable project can’t and won’t solve the inequity of access in the United States.

Trial Access for Burney Collection and Search Methods

August 12, 2009

Gale/Cengage has generously agreed to offer a free trial of the Burney Collection for readers of this blog at http://access.gale.com/emob.  This provides us with an opportunity for an open discussion of the Burney Collection’s merits, both as a scholarly resource and as a pedagogical tool. 

In preparation for the two sessions on digital text-bases, it would be interesting to hear more about how users search Burney.  Search results can be overwhelming and show the need for the Library of Congress cataloguing and classification system to help categorize and make sense of the wealth of data that emerges from any given search.  Thomas Mann, a Reference Librarian at the Library of Congress, has a still useful 2005 discussion on the limits of computerized searching for research at http://www.guild2910.org/searching.htm.  Mann’s site might be particularly helpful in discussing computerized searching with students.  His example is that the 11,000,000 results for the word “Afghanistan” are unclassified, whereas under the LC system, they are neatly parsed into “Antiquities,” “Bibliography,” “Biography,” “Boundaries,” Civilization,” and so forth.  So the argument in favor of LC classification and cataloguing is clear.

On the other hand, it would be foolish to overlook the value of non-classified search results.  Matthew’s p0st on machine reading makes clear the value of understanding more about what computers can do.  But searching Burney isn’t necessarily clear from the outset.  It would be very interesting to hear more about how individuals use search methods within ECCO, EEBO, and particularly Burney.  We are grateful to Gale/Cengage for making this collective review possible.

Digital Textbases and Optical Character Recognition (OCR)

July 16, 2009

Experienced users of ECCO know about the limits of its full-text capability. The long s in eighteenth-century fonts is one of many peculiarities that can wreck an automated effort at optical character recognition (OCR). Though I’m grateful that I can search ECCO and other databases using full text, I often wonder how complete my search is. I usually get a sense of how many false hits I find, but how many true hits am I missing? How accurate are the full-text capabilities of these resources?

A recent article presents a method for assessing the accuracy of OCR using the British Library’s 19th Century Newspaper Project as a case study:

Simon Tanner, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive,” D-Lib Magazine 15.7/8 (2009).

This is available at:

http://www.dlib.org/dlib/july09/munoz/07munoz.html

The article briefly mentions Gale’s Burney newspapers project. One of the good points in this article concerns how we should measure accuracy:

Given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of “significant words” rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.

The term “significant words” refers to words that users are likely to search for, in contrast to function words (pronouns, prepositions, etc.). A textbase’s accuracy in terms of “significant words” is an appropriate yardstick for how useful its full-text search is.

The full article merits reading. The authors found that for significant word accuracy, the 19th Century Newspaper Project was 68.4% accurate and the Burney Newspapers was 48.4% accurate. Eighteenth-century newspapers can be astonishingly difficult to read even in the originals, so this low percentage is not that surprising. I suspect that ECCO is somewhere in between these two percentages.

Collaborative Readings #1: Ian Gadd’s “The Use and Misuse of Early English Books Online”

July 7, 2009
We are launching a series of “Collaborative Readings,” borrowing the model popularized so successfully by David Mazella and Carrie Shanafelt on The Long Eighteenth, to discuss some of the items on our bibliography.  “Collaborative Readings” can run concurrently with other postings.

To begin this series, I’ll summarize Ian Gadd’s lucid “The Use and Misuse of Early English Books Online,” which argues that using EEBO properly requires an understanding of its evolution and of the evolution of the catalogues on which it relies.  Particularly crucial, Gadd argues, is an understanding of EEBO’s historical reliance on ESTC.

Gadd’s article falls into three parts.  Part 1 describes the three catalogues on which EEBO and ECCO are based: 

  • STC: Pollard and Redgrave’s Short-title Catalogue of Books Printed in England, Scotland, & Ireland, and of English Books Printed Abroad, 1475-1640
  • WING: Donald Wing’s Short-title Catalogue of Books Printed in England, Scotland, Ireland, Wales, and British America, and of English Books printed in other Countries, 1641-1700
  • ESTC: English Short Title Catalogue, which began its history as The Eighteenth Century Short Title Catalogue, but eventually incorporated material from the previous two catalogues to become The English Short Title Catalogue, retaining its acronym.

Each of these catalogues uses different cataloguing principles and different criteria of inclusion.  The former two differ in what they include, but both catalogue books that have been located (as opposed to copies known to have existed).  The ESTC, on the other hand, began as a computerized and comprehensive union catalogue, merging “together the existing catalogue records of other libraries.”  Because the ESTC includes items in the previous two catalogues, it is, as Gadd puts it,

a hybrid database consisting of three sets of catalogue records, each constructed on different principles.  Searching across these record sets, therefore, poses problems: the unsuspecting student, for example, interested in Stationers’ Company registrations of works might assume that registrations all but dried up after 1640 when in fact this is simply a consequence of information that STC recorded but Wing and ESTC routinely did not.

Part 2 details the evolution of microfilm collections based on these catalogues and their eventual digitization.  Two companies oversaw this process, eventually producing first EEBO then ECCO.

  • UMI: University Microfilms used STC and Wing to produce two series of microfilm collections known as “Early English Books, 1475-1640” and “Early English Books, 1641-1700.”  In 1998, UMI (now ProQuest) digitized copies from these collections to produce EEBO.
  • Research Publications produced a rival microfilm set based on the ESTC.  In 2003, Thomson Gale (now Gale/Cengage) digitized copies from this collection to produce ECCO.

EEBO was permitted to use the bibliographical records of the ESTC, but

it did so for its own purposes: certain categories of data were removed (e.g. collations, Stationers’ Register entrances), some information was amended (e.g. subject headings), and some was added (e.g. microfilm-specific details).

Additionally, there was no formal mechanism for synchronizing the data between the two resources.  Consequently, two divergent holding records exist in EEBO’s and ESTC’s respective catalogues. 

Gadd’s cautionary note pertains to the divergence bewteen these two catalogues:

As both resources continue to amend and expand their bibliographical data for their own purposes, there is an increasing likelihood of significant discrepancy between the two resources. . . . there is no absolute one-to-one correspondence between the pre-1701 entries in ESTC and the materials on EEBO; there are—and will always be—items on ESTC not available on EEBO.

Because different copies in the same edition can vary, there is, Gadd explains,

a vital difference between any single bibliographical record on EEBO and the corresponding ‘image set’: the former describes the particular edition  (or issue), the latter is taken from one copy from that particular edition. Moreover, unlike scholarly facsimile editions, the selection process for microfilming was often arbitrary.  Copies were selected primarily by reference to the copies listed in STC and WING, with particular preference for certain major collections; they were not selected because they were considered representative of a particular edition.

Gadd suggests that EEBO refer to itself as “a library of copies, rather than a catalogue of titles.”

Gadd commends ProQuest for its receptivity toward the scholarly community.  Part 3 briefly reviews ECCO, noting its “underlying text-transcription,” which allows for searches but is flawed by the inaccuracy of the OCR software it uses.