I don’t have a whole lot to say about this, but I thought it was interesting to see a number of “New Media” scholars debating the decisions Google made while putting together Google Book Search. When we talk about the proprietary role that a company like Google has on such an important chunk of our information environment, it seems important for scholars and librarians to be able to give some direction and feedback. So far, this does not seem to have been the case. DM
Archive for July, 2009
Anyone working even briefly with archives learns immediately that cataloguing is an art, not a science, and that the successful use of archives demands familiarity with its cataloguers’ idiosyncrasies. In the first part of his meticulous “Some Problems in ECCO (and ESTC),” James May provides a hard look at cataloguing problems in both ECCO and ESTC. May’s article has inspired two forthcoming roundtable discussions at EC/ASECS and ASECS and this blog, so it makes sense to provide at least a cursory review of his arguments here. For May, bibliographical problems limit ECCO “as a set of digital facsimiles” (20). Below is a list of the topics he covers.
Holdings: Identifying the copy digitized is problematic because ECCO lists only the library holding the source copy, not the shelfmark.
- If the library holds more than one copy of the digitized text, “readers can’t know what is digitized unless its identifiable from MS annotations on the copy digitized” (21).
- If the source library’s cataloguing is ambiguous, duplicate entries for the same copy can result. Among the examples citied is the case of the National Library of Scotland, which provides different shelfmarks for each volume of Edward Young’s 2-volume Poetical Works. ESTC and ECCO thus list two copies of Young’s Poetical Works in that library, though there is only one (2-vol.) copy.
- There are “editions in the ESTC (some reproduced on ECCO) that are not separate editions but only reissues of earlier editions” (23).
- Examples include the false attribution of An Account of the Two Brothers, Perseus and Demetrius, . . . Collected from the Grecian History, written by the author of Busiris, . . . the Universal Passion, Satires & c. to Edward Young. Additionally, both ESTC and ECCO “fail to list Young as the author of A Sea-Piece (Dodsley, 1755).
False claims regarding publishers and places of publication, and dates:
- Faulty information taken from title pages is often absorbed uncritically into the ESTC and ECCO. For May, “one problem with the ESTC and ECCO is that nobody surveys and edits its results” (25). He suggests that “there ought to be a way for scholars to post notes tagged to ESTC and ECCO entries for other scholars to read—a suggestion Rob Hume made . . . two years ago” in The Eighteenth-Century Intelligencer (n.s. 21.1.16).
Format errors: The format of smaller books in particular is often incorrectly listed. ECCO’s entry for The Works of Dr. Jonathan Swift, vols. 15-17 (1765), incorrectly calls those volumes 12mos rather than 18mos.
Incomplete or misrepresented works: “ECCO offers 16 of the 17 volumes of The Works of Johnathan Swift,. D.D: D.S.P.D. With Notes . . . By J. Hawksworth (Dublin: Williams, 1767-1768)” (25). Sometimes frontispieces are missing.
Legibility of ECCO’s digital images:
- The images lack crispness
- They “fail to reproduce red-lettering on title-pages” (26).
- Some pages are unreadable or incomplete. See Vol. 3 of Smollett’s Continuation of the Complete History of England, 1762 (pages 167-68, 258, 321, or 328).
- Footnotes and marginalia are sometimes obscured.
- Stains, and gutter loss are problems, with the latter obscuring “the third on all versos between pp. 14 and 24 of Swift’s A Tale of a Tub 1711: ESTC N136369, T49839, and N13640.
- “Images are sometimes distorted relative to their proportions in height and width” (26).
- Later editions are less frequently digitized, though they are often rarer and thus arguably in greater need of digitized preservation than earlier editions. Many of these subsequent editions were published in Ireland or Scotland, and thus ECCO is correspondingly weaker for a study of Scottish and Irish books and booktrade.
- First and revised editions are sometimes neglected. As May argues, “roughly half of the pre-1775 editions of Young are to be found in ECCO” (29). Only five of the nine editions of The Force of Religion appear in ECCO.
- As May points out, “one must know to exclude ‘not’ from a title search” (27)
- Sometimes searches for titles and dates fail if the work has not been tagged with the proper date
- ECCO’s searches miss “a certain percentage of words” (27).
Selectivity: Like Adam Matthews, which digitized periodicals, ECCO did not use a team of scholars in the process of selecting what should and should not be digitized. May argues that “more scholarly rigor was no doubt needed when the filming by Gale and its predecessors was done to decide which copy should be filmed” (28).
May concludes that “scholars need to provide a little noisy feedback to corporate ventures like ECCO if future projects are to benefit from their expertise” (29).
The following assertion from the Monk Project’s description (and quoted by Anna in her comments about this tool):
the scholarly use of digital texts must progress beyond treating them as book surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them
identifies an issue that has interested me for a while now and is behind my embryonic formulations of the differences between digital database collections that act as delivery systems (JStor, Project Muse, etc) and those that proffer other functions such as serving as finding aids. The tendency to see digitized works such as those found in Google Books (in its present incarnation) as surrogates for physical books has frequently resulted in users’ frustrations and disappointment in using these resources. This tendency led me to title a paper I gave at last year’s MLA “When is a Book Not a Book?: Using Google Book Search.” Thus, when collecting additional material in preparation for the EC/ASECS and ASECS sessions, I was understandably drawn to an article by Hugh Amory entitled, “Pseudodoxia Bibliographica, or When is a Book Not a Book? When It’s a Record” (The Scholar & the Database: Papers Presented on 4 November 1999 at the CERL Conference Hosted by the Royal Library, Brussels, 2 : 1-14).
Amory’s article is concerned with the distortions and misconceptions that can result when historians treat an imprint catalogue’s entries as books or titles. Amory uses the term “imprint catalogues” to refer to ESTC (incorporating Pollard and Redgrave and Wing) and the machine-readable form of Evans reshaped for the North American Imprints Program (NAIP) (2) and distinguishes these research tools from the original Evans and from European bibliographies. While those interested should read the article in its entirety, I offer the following extracts that I found especially noteworthy or interesting:
“[O]ur bibliographies do not form a coherent series, employing different measures and various categories of the book” (1).
“”Indeed, the term ‘imprint’ itself is peculiarly English in its ambiguous complexity. It comprehends both a publisher’s imprint or marque d’éditeur and a printer’s imprint or achevé d’imprimer, as well as the editions in which these imprints occur—i.e. an imprimé—or even fail to occur. Only in English, I believe, is it possible for an imprint to have no imprint” (2).
“…any systematic, comprehensive access to places of publication is neglected…Unlike current national bibliographies, too, imprint bibliographies regularly include false and fictitious imprints” (3).
“Peculiar too to Anglophone bibliography is the inclusion of colonial and postcolonial printing in the retrospective national bibliographies of the mother country” (3).
These imprint catalogues
were never designed to answer the general questions posed by book history — to calibrate the relative size of metropolitan and colonial printing, for example, of religious and secular production, or the rise of the novel. The scope of imprint bibliographies is retroactive, imposing territorial and cultural inclusions and exclusions that were alien to their periods. Indeed, even the cataloguing of a database is retroactive, defined by the nature of the question. The numbers that pour forth in such profusion represent a certain number of ‘hits’, not entries, and the fuller the cataloguing, the higher the number of ‘hits’. (4)
“…it may be unfair to demand that imprint catalogues ‘represent’ anything, even imprints, for whose history they provide no more raw material. Nor are they really designed for the production of statistics on literary or intellectual history, where, especially in the form of union catalogues, they serve rather as inventories” (7).
“[ESTC] is neither English, Short-Title, nor a Catalogue, since the ‘cataloguing’ is only a response shaped by the system at the user’s request. One of its most useful features, keyword searching, is precisely an index, whose accuracy and exhaustiveness depend on the illogical whims of language” (8).
“The very accessibility of these catalogues distorts their numbers, and the exclusion or cataloguing of serials makes them even less representative of ‘the amount of printing performed’ than Evans” (10).
“Nor is there any agreement on where a book ends and a pamphlet begins; as the Oxford English Dictionary remarks, ‘No absolute definition of a ‘book’ in this sense can be given’. … Escarpit, who abandons material concerns altogether, and proposes that the nature of a book is defined by how it is read–only opens up another abyss” (10).
“To provide a more meaningful series of data, a number of minor technical devices might be proposed… At present, one may record alternative places of publication in what is technically known in the MARC format as the 752 field, but we need a third, distinctive field for false or fictitious places, and the 752 field is all-too-rarely-used. One would like to link editions with issues, and issues with states that affect the imprint such as misprinted or variant dates in a unitary record” (12).
Again, the history of the book in the English-speaking realm needs a variety of new catalogues: an on-line catalogue of early periodicals that, at a minimum, would provide a count of the true number of issues, including those that have probably been lost; a catalogue of lost editions of monographs, or some standard for incorporating this information in imprint catalogues like ESTC; and finally, a census of books described in early libraries” (12).
Laura Mandell has placed on update on 18thConnect that indicates that an agreement has been reached for 18thConnect to work with Gale. There’s a recorded link to her ALA talk that is not opening for me as well as the following news about a grant the project has received from ICHASS:
18thConnect: From PDF Images to Clean Data Sets, led by the University of Illinois’ Robert Markley, will use supercomputer time to run a parallelized optical character recognition (OCR) program on pages of images of 18th century printed texts, made available through its collaboration with Gale Group. The resulting archive of machine-readable 18th-century texts in history, literature, art, the sciences, and the emerging social sciences will be accessible to scholars for faceted searching, automated semantic tagging, hand encoding of digital scholarly editions, and data mining. By converting a vast archive of images into machine-readable texts, this project will provide a model for adapting OCR programs to field-specific problems that must be solved in order to preserve the full range of our cultural heritage.
I am hoping that Laura and Bob may be able to tell us more.
Among the text databases included in the Monk (metadata offer new knowledge) Project are ECCO and EEBO (both of which are part of the Text Creation Partnership (TCP). While not addressing bibliographic errors, this initiative does hold relevancy for our discussions on improving these tools. In particular, this project’s efforts are apparently aimed at providing scholars with the means to work more effectively and simulatenously with texts created and housed in different databases.
A recent PowerPoint presentation about the Monk Project, Tools for Textual Data (May 20, 2009), by John Unsworth sketches such issues as treating text as data, the Monk Project’s efforts to facilitate means to “mix and match” texts that reside in different databases, the development of features that will enable searches that users may wish to conduct (for example, what adjectives does a given author favor the most?), and the acceptable level of curatorial/user intervention. The tools being developed to allow both the posing of questions that users may wish to ask and the mining of the data to yield responses to these queries seem highly promising.
Under “Questions for Discussion” (slide 22), I was interested in the two-part query, “Should users be allowed to change, correct, or improve data? If so, under what constraints or conditions?”. Thes question set seems directly pertinent to our discussion of how to improve bibliographic issues in these databases, but it rightly also asks about what sorts of constraints should (or need) to be in place–the answer to which would speaks to issues of quality control. Another question, “Should those who provide collections also collect the results of work done on their collections? Why or why not?,” was surprising to me. While I could see how gathering information about the ways that the collections were being used and the results obtained could help developers improve these databases’ functionality and accuracy, the collection of this information–especially by the owners of databases that are commercial enterprises–seemed far more worrisome to me.
Experienced users of ECCO know about the limits of its full-text capability. The long s in eighteenth-century fonts is one of many peculiarities that can wreck an automated effort at optical character recognition (OCR). Though I’m grateful that I can search ECCO and other databases using full text, I often wonder how complete my search is. I usually get a sense of how many false hits I find, but how many true hits am I missing? How accurate are the full-text capabilities of these resources?
A recent article presents a method for assessing the accuracy of OCR using the British Library’s 19th Century Newspaper Project as a case study:
Simon Tanner, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive,” D-Lib Magazine 15.7/8 (2009).
This is available at:
The article briefly mentions Gale’s Burney newspapers project. One of the good points in this article concerns how we should measure accuracy:
Given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of “significant words” rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.
The term “significant words” refers to words that users are likely to search for, in contrast to function words (pronouns, prepositions, etc.). A textbase’s accuracy in terms of “significant words” is an appropriate yardstick for how useful its full-text search is.
The full article merits reading. The authors found that for significant word accuracy, the 19th Century Newspaper Project was 68.4% accurate and the Burney Newspapers was 48.4% accurate. Eighteenth-century newspapers can be astonishingly difficult to read even in the originals, so this low percentage is not that surprising. I suspect that ECCO is somewhere in between these two percentages.
I received an offer of a trial subscription to the Eighteenth Century Journals Portal last year, but I did not have time to explore it. It is published by Adam Matthew Digital in the UK. I had been planning on mentioning it, but I also wanted to keep our focus on EECO and EEBO for the time being because these two databases are the focus of our conference sessions.
Yet I just received a new offer of a trial, so I thought I would mention it (I suspect that the firm would be more than willing to offer trials to others). Have others used this resource? I will be trying it over the next few days.
Here is a description of what it contains:
Eighteenth Century Journals Portal consists of the following resources:
- Eighteenth Century Journals I
Newspapers and Periodicals, 1693-1793, from the Bodleian Library, Oxford
- Eighteenth Century Journals II
Newspapers and Periodicals, 1699-1812, from the Harry Ransom Humanities Research Center, University of Texas at Austin
- Eighteenth Century Journals III
Newspapers and Periodicals, 1680-1816, from British Library Newspapers, Colindale and Cambridge University Library
The Eighteenth Century Journals Portal offers seamless integration between all collections and enables streamlined browsing and searching via a single user interface.
Eighteenth Century Journals I is drawn from the Hope Collection at the Bodleian Library, Oxford. It brings together 76 rare journals printed between 1714 and 1799. The collection combines well-known publications with more minor works, offering users a wide-ranging view of eighteenth century publishing culture.
Authors represented include Joseph Addison, Henry Fielding, Horace Walpole, Richard Steele, Samuel Taylor Coleridge and Thomas Chatterton. Topics covered include law and policing; British colonial possessions; the South Sea Bubble, religion, female dress and the American and French revolutions; politics, marriage, and morality. A particular strength is eighteenth century drama, with over 19 titles relating to the theatre.
The collection offers students and scholars of the eighteenth century an opportunity to examine a variety of topical issues, and to compare a range of perspectives on the debates of the day. Accessible and easy to use, it will prove an invaluable addition to any library supporting studies of the eighteenth century.
Eighteenth Century Journals II from the Harry Ransom Humanities Research Center, University of Texas at Austin is a critical resource for the study of all aspects of the eighteenth century. It offers a wide variety of newspapers and periodicals which are not available elsewhere. Many are extremely rare and exist only in short runs. All are now offered in extremely clear greyscale images, supported by fully searchable text that has been double-keyed and is captured in the original format.
This resource can be used for the study of:
- British and European Literature
- Theatre and popular entertainment
- Politics and Religion
- The American and French Revolutions
- Popular morality and social life
- Picturesque and Landscape
- The origins and rise of Romanticism
- Exoticism and imperialism
Eighteenth Century Journals III is drawn from British Library Newspapers at Colindale and from Cambridge University Library. At the request of scholars and librarians, it focuses on journals published outside of London.
Scholars will be excited by the inclusion of Canadian, Caribbean and Indian journals and can explore the ways in which major world events such as the revolution in Haiti are reported in different areas.
They will also be delighted to see the large number of Irish journals, together with a good number of titles published in Edinburgh, Canterbury and Cambridge.
As always, the material has been carefully checked against Burney, ECCO, Early British Periodicals and other relevant sources to avoid needless duplication. All of the material has been double-keyed and is fully searchable, achieving 99.5% accuracy (much higher than using OCR).
To begin this series, I’ll summarize Ian Gadd’s lucid “The Use and Misuse of Early English Books Online,” which argues that using EEBO properly requires an understanding of its evolution and of the evolution of the catalogues on which it relies. Particularly crucial, Gadd argues, is an understanding of EEBO’s historical reliance on ESTC.
Gadd’s article falls into three parts. Part 1 describes the three catalogues on which EEBO and ECCO are based:
- STC: Pollard and Redgrave’s Short-title Catalogue of Books Printed in England, Scotland, & Ireland, and of English Books Printed Abroad, 1475-1640
- WING: Donald Wing’s Short-title Catalogue of Books Printed in England, Scotland, Ireland, Wales, and British America, and of English Books printed in other Countries, 1641-1700
- ESTC: English Short Title Catalogue, which began its history as The Eighteenth Century Short Title Catalogue, but eventually incorporated material from the previous two catalogues to become The English Short Title Catalogue, retaining its acronym.
Each of these catalogues uses different cataloguing principles and different criteria of inclusion. The former two differ in what they include, but both catalogue books that have been located (as opposed to copies known to have existed). The ESTC, on the other hand, began as a computerized and comprehensive union catalogue, merging “together the existing catalogue records of other libraries.” Because the ESTC includes items in the previous two catalogues, it is, as Gadd puts it,
a hybrid database consisting of three sets of catalogue records, each constructed on different principles. Searching across these record sets, therefore, poses problems: the unsuspecting student, for example, interested in Stationers’ Company registrations of works might assume that registrations all but dried up after 1640 when in fact this is simply a consequence of information that STC recorded but Wing and ESTC routinely did not.
Part 2 details the evolution of microfilm collections based on these catalogues and their eventual digitization. Two companies oversaw this process, eventually producing first EEBO then ECCO.
UMI: University Microfilms used STC and Wing to produce two series of microfilm collections known as “Early English Books, 1475-1640” and “Early English Books, 1641-1700.” In 1998, UMI (now ProQuest) digitized copies from these collections to produce EEBO.
- Research Publications produced a rival microfilm set based on the ESTC. In 2003, Thomson Gale (now Gale/Cengage) digitized copies from this collection to produce ECCO.
EEBO was permitted to use the bibliographical records of the ESTC, but
it did so for its own purposes: certain categories of data were removed (e.g. collations, Stationers’ Register entrances), some information was amended (e.g. subject headings), and some was added (e.g. microfilm-specific details).
Additionally, there was no formal mechanism for synchronizing the data between the two resources. Consequently, two divergent holding records exist in EEBO’s and ESTC’s respective catalogues.
Gadd’s cautionary note pertains to the divergence bewteen these two catalogues:
As both resources continue to amend and expand their bibliographical data for their own purposes, there is an increasing likelihood of significant discrepancy between the two resources. . . . there is no absolute one-to-one correspondence between the pre-1701 entries in ESTC and the materials on EEBO; there are—and will always be—items on ESTC not available on EEBO.
Because different copies in the same edition can vary, there is, Gadd explains,
a vital difference between any single bibliographical record on EEBO and the corresponding ‘image set’: the former describes the particular edition (or issue), the latter is taken from one copy from that particular edition. Moreover, unlike scholarly facsimile editions, the selection process for microfilming was often arbitrary. Copies were selected primarily by reference to the copies listed in STC and WING, with particular preference for certain major collections; they were not selected because they were considered representative of a particular edition.
Gadd suggests that EEBO refer to itself as “a library of copies, rather than a catalogue of titles.”
Gadd commends ProQuest for its receptivity toward the scholarly community. Part 3 briefly reviews ECCO, noting its “underlying text-transcription,” which allows for searches but is flawed by the inaccuracy of the OCR software it uses.
This notice just appeared on the SHARP-L list and seems somewhat relevant to our discussion: the naming of physical attributes, agents, etc. tied to rare books. In teaching an undergraduate seminar last fall that used ECCO, I discovered the need to create an illustrated glossary for my students to help them grapple with proper terminology for features they were encountering as we left our modern paperback editions behind and turned to ECCO‘s digitized texts and 18th-century books from my own collection. We had reviewed many features in advance of our first paper, but I nonetheless received essays that indicated decided confusion–the bookplate mistaken as a frontispiece and subsequently analyzed in terms of the verbal text, issues about imprints, and much more.
|From:||SHARP-L Society for the History of Authorship, Reading & Publishing on behalf of Nina Schneider||Sent:||Mon 7/6/2009 1:31 PM|
|Subject:||RBMS Controlled Vocabulary Terms under consideration at ALA Annual|
This message is being cross-posted
————————————Below is the link to the list of terms that are under consideration for the RBMS Controlled Vocabularies at the ALA Annual Conference in Chicago.
Everyone is welcome to make comments or suggestions. Please note, if you do want to comment and don’t yet have access to the wiki, please send me an email. I will incorporate all comments received by *Thursday, July 9th.*
Link to terms under discussion:
(If the link does not work, copy and paste into a new browser; the terms under current discussion are under the folder labeled “Term Records for Annual 2009.”)
Please let me know if you have any questions.
William Andrews Clark Memorial Library
2520 Cimarron Street
Los Angeles, CA 90018