Experienced users of ECCO know about the limits of its full-text capability. The long s in eighteenth-century fonts is one of many peculiarities that can wreck an automated effort at optical character recognition (OCR). Though I’m grateful that I can search ECCO and other databases using full text, I often wonder how complete my search is. I usually get a sense of how many false hits I find, but how many true hits am I missing? How accurate are the full-text capabilities of these resources?
A recent article presents a method for assessing the accuracy of OCR using the British Library’s 19th Century Newspaper Project as a case study:
Simon Tanner, Trevor Muñoz, and Pich Hemy Ros, “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive,” D-Lib Magazine 15.7/8 (2009).
This is available at:
The article briefly mentions Gale’s Burney newspapers project. One of the good points in this article concerns how we should measure accuracy:
Given a newspaper page of 1,000 words with 5,000 characters if the OCR engine yields a result of 90% character accuracy, this equals 500 incorrect characters. However, looked at in word terms this might convert to a maximum of 900 correct words (90% word accuracy) or a minimum of 500 correct words (50% word accuracy), assuming for this example an average word length of 5 characters. The reality is somewhere in between and probably more at the higher extent than the lower. The fact is: character accuracy of itself does not tell us word accuracy nor does it tell us the usefulness of the text output. Depending on the number of “significant words” rendered correctly, the search results could still be almost 100% or near zero with 90% character accuracy.
The term “significant words” refers to words that users are likely to search for, in contrast to function words (pronouns, prepositions, etc.). A textbase’s accuracy in terms of “significant words” is an appropriate yardstick for how useful its full-text search is.
The full article merits reading. The authors found that for significant word accuracy, the 19th Century Newspaper Project was 68.4% accurate and the Burney Newspapers was 48.4% accurate. Eighteenth-century newspapers can be astonishingly difficult to read even in the originals, so this low percentage is not that surprising. I suspect that ECCO is somewhere in between these two percentages.