Digital Textbases and Optical Character Recognition (OCR)

Experienced users of ECCO know about the limits of its full-text capability. The long s in eighteenth-century fonts is one of many peculiarities that can wreck an automated effort at optical character recognition (OCR). Though I’m grateful that I can search ECCO and other databases using full text, I often wonder how complete my search is. I usually get a sense of how many false hits I find, but how many true hits am I missing? How accurate are the full-text capabilities of these resources?

A recent article presents a method for assessing the accuracy of OCR using the British Library’s 19th Century Newspaper Project as a case study:

Simon Tanner, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive,” D-Lib Magazine 15.7/8 (2009).

This is available at:

http://www.dlib.org/dlib/july09/munoz/07munoz.html

The article briefly mentions Gale’s Burney newspapers project. One of the good points in this article concerns how we should measure accuracy:

Given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of “significant words” rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.

The term “significant words” refers to words that users are likely to search for, in contrast to function words (pronouns, prepositions, etc.). A textbase’s accuracy in terms of “significant words” is an appropriate yardstick for how useful its full-text search is.

The full article merits reading. The authors found that for significant word accuracy, the 19th Century Newspaper Project was 68.4% accurate and the Burney Newspapers was 48.4% accurate. Eighteenth-century newspapers can be astonishingly difficult to read even in the originals, so this low percentage is not that surprising. I suspect that ECCO is somewhere in between these two percentages.

Tags: Burney Collection, Digital Humanities, ECCO, Gale/Cengage

This entry was posted on July 16, 2009 at 10:03 am and is filed under Burney Collection, Digital Literary Studies, ECCO. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

9 Responses to “Digital Textbases and Optical Character Recognition (OCR)”

Eleanor Shevlin Says:
July 16, 2009 at 11:17 am | Reply
Thanks, Steve. Such discussion and articles should help advance our discussion here–and help us better prepare for our upcoming roundtables.

Jim May’s recent article in the Intelligencer (23.2 [May 2009]) on searching Burney matches the stats you outline here (I will read the full Tanner, etc. article today).

LikeLike
Ian Gadd Says:
July 19, 2009 at 10:35 am | Reply
I’ve written a short note on ECCO’s accuracy rates, looking specifically at its s/f differentiation, and have found error (not accuracy) rates of 56%-80%.

LikeLike
Anna Battigelli Says:
July 23, 2009 at 8:16 am | Reply
Ian:
Can you give us the reference to your note on s/f differentiation?

LikeLike
Eleanor Shevlin Says:
July 23, 2009 at 10:17 am | Reply
I had thought Ian was referring to the Coda to his “Use and Misuse piece of EEBO”, but please do let us know if there’s another source, Ian.

The Tanner et. al. article is quite interesting, and I was struck by its reminder that “proper nouns, names and place names are harder for OCR engines to cope with” (9). In terms of using the Burney collection online for research, I would imagine that this trio of significant words would represent the a large majority of words searched. I have not used Burney extensively, but I did find that if I used common nouns associated with the proper nouns for names and places that were the focus of my search, I received far better results. I wonder if others have had the same experience?

LikeLike
Anna Battigelli Says:
July 23, 2009 at 1:15 pm | Reply
One of the many interesting points in the Tanner article is the attempt to arrive at a more statistically precise assessment of accuracy in automated recognition technologies.

Eleanor’s question sounds like a good one: perhaps certain methodologies, such as searching for associated terms, yields a slightly more complete search. I would be interested in hearing more about how scholars use the Burney Collection.

Thanks, Steve, for bringing this article to our attention.

LikeLike
Eleanor Shevlin Says:
July 24, 2009 at 12:37 pm | Reply
What I appreciated about the accuracy assessment/measurement is its identification of specific problems that can then be addressed. Plus, this information also offers data/evidence to support/explain anecdotal experiences.

I use Burney to search for titles of works, how works are being advertised, business notices, bookseller/publisher activity, geographic information about commercial entities, partnerships among commercial entities, product placement, cultural announcements, and more.

I often start with proper names (of people, addresses, titles), and then I use the results of those searches to create new search strings. When I browse nearby dates within a title’s hits, I not infrequently find results that did not come up as hits in the actual search. This is a rather reductive description of my search strategies, but it offers the general idea of how I approach this tool.

LikeLike
Anna Battigelli Says:
July 25, 2009 at 6:58 am | Reply
That’s helpful, Eleanor. Thanks. It’s useful to see how
scholars use these text-bases and to hear about their
methods. I would also love to hear a library cataloguer
talk about searching these text-bases.

LikeLike
Anna Battigelli Says:
July 29, 2009 at 9:38 am | Reply
Steve–

I’ve added the Tanner piece to the bibliography. Again, thanks for the excellent post and the reference.

AB

LikeLike
Eighteenth-Century Book Tracker « Early Modern Online Bibliography Says:
August 12, 2009 at 8:05 pm | Reply
[…] the problems with OCR of eighteenth-century print that Steve Karian recently reminded us of, I don’t think we can be too confident in using these cost-free sites as the foundation for […]

LikeLike

Early Modern Online Bibliography

Digital Textbases and Optical Character Recognition (OCR)

Share this:

Related

9 Responses to “Digital Textbases and Optical Character Recognition (OCR)”

Leave a comment Cancel reply