Digital Textbases and Optical Character Recognition (OCR)

by

Experienced users of ECCO know about the limits of its full-text capability. The long s in eighteenth-century fonts is one of many peculiarities that can wreck an automated effort at optical character recognition (OCR). Though I’m grateful that I can search ECCO and other databases using full text, I often wonder how complete my search is. I usually get a sense of how many false hits I find, but how many true hits am I missing? How accurate are the full-text capabilities of these resources?

A recent article presents a method for assessing the accuracy of OCR using the British Library’s 19th Century Newspaper Project as a case study:

Simon Tanner, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive,” D-Lib Magazine 15.7/8 (2009).

This is available at:

http://www.dlib.org/dlib/july09/munoz/07munoz.html

The article briefly mentions Gale’s Burney newspapers project. One of the good points in this article concerns how we should measure accuracy:

Given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of “significant words” rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.

The term “significant words” refers to words that users are likely to search for, in contrast to function words (pronouns, prepositions, etc.). A textbase’s accuracy in terms of “significant words” is an appropriate yardstick for how useful its full-text search is.

The full article merits reading. The authors found that for significant word accuracy, the 19th Century Newspaper Project was 68.4% accurate and the Burney Newspapers was 48.4% accurate. Eighteenth-century newspapers can be astonishingly difficult to read even in the originals, so this low percentage is not that surprising. I suspect that ECCO is somewhere in between these two percentages.

Tags: , , ,

9 Responses to “Digital Textbases and Optical Character Recognition (OCR)”

  1. Eleanor Shevlin Says:

    Thanks, Steve. Such discussion and articles should help advance our discussion here–and help us better prepare for our upcoming roundtables.

    Jim May’s recent article in the Intelligencer (23.2 [May 2009]) on searching Burney matches the stats you outline here (I will read the full Tanner, etc. article today).

    Like

  2. Ian Gadd Says:

    I’ve written a short note on ECCO’s accuracy rates, looking specifically at its s/f differentiation, and have found error (not accuracy) rates of 56%-80%.

    Like

  3. Anna Battigelli Says:

    Ian:
    Can you give us the reference to your note on s/f differentiation?

    Like

  4. Eleanor Shevlin Says:

    I had thought Ian was referring to the Coda to his “Use and Misuse piece of EEBO”, but please do let us know if there’s another source, Ian.

    The Tanner et. al. article is quite interesting, and I was struck by its reminder that “proper nouns, names and place names are harder for OCR engines to cope with” (9). In terms of using the Burney collection online for research, I would imagine that this trio of significant words would represent the a large majority of words searched. I have not used Burney extensively, but I did find that if I used common nouns associated with the proper nouns for names and places that were the focus of my search, I received far better results. I wonder if others have had the same experience?

    Like

  5. Anna Battigelli Says:

    One of the many interesting points in the Tanner article is the attempt to arrive at a more statistically precise assessment of accuracy in automated recognition technologies.

    Eleanor’s question sounds like a good one: perhaps certain methodologies, such as searching for associated terms, yields a slightly more complete search. I would be interested in hearing more about how scholars use the Burney Collection.

    Thanks, Steve, for bringing this article to our attention.

    Like

  6. Eleanor Shevlin Says:

    What I appreciated about the accuracy assessment/measurement is its identification of specific problems that can then be addressed. Plus, this information also offers data/evidence to support/explain anecdotal experiences.

    I use Burney to search for titles of works, how works are being advertised, business notices, bookseller/publisher activity, geographic information about commercial entities, partnerships among commercial entities, product placement, cultural announcements, and more.

    I often start with proper names (of people, addresses, titles), and then I use the results of those searches to create new search strings. When I browse nearby dates within a title’s hits, I not infrequently find results that did not come up as hits in the actual search. This is a rather reductive description of my search strategies, but it offers the general idea of how I approach this tool.

    Like

  7. Anna Battigelli Says:

    That’s helpful, Eleanor. Thanks. It’s useful to see how
    scholars use these text-bases and to hear about their
    methods. I would also love to hear a library cataloguer
    talk about searching these text-bases.

    Like

  8. Anna Battigelli Says:

    Steve–

    I’ve added the Tanner piece to the bibliography. Again, thanks for the excellent post and the reference.

    AB

    Like

  9. Eighteenth-Century Book Tracker « Early Modern Online Bibliography Says:

    […] the problems with OCR of eighteenth-century print that Steve Karian recently reminded us of, I don’t think we can be too confident in using these cost-free sites as the foundation for […]

    Like

Leave a comment