ECCO OCR Troubleshooting by Sayre Greenfield

Substituting the letters noted below will produce a significant number of extra hits per word for full-text searches of the database and for searches within individual texts, especially in the early eighteenth century.

S:     If a search term contains lower-case S at the beginning or in the middle of a word, the most common optical character recognition misreading is L, followed to lesser degrees by F, J, and I

SS:  If a search term contains lower-case SS in the middle of a word, by far the most common optical character recognition misreading is FL, especially if an I follows the SS.

CT:  If a search term contains lower-case CT, especially in the middle of a word, the most common optical character recognition misreading is as a single letter, usually T, followed to lesser degrees by C, D, A, and F.

2 Responses to “ECCO OCR Troubleshooting by Sayre Greenfield”

  1. Improving ECCO « digitalhumanistbeginner Says:

    […] As my colleague Ian Gadd pointed out (in an as yet unpublished note) searching ECCO is a hit and miss affair, especially since the orginal OCR-generated text often failed to read the ‘long s’ correctly. The weakness of ECCO’s OCR text was also, inadvertantly, revealed by JISC’s latest project ‘Historic Books’. This is a single interface for both EEBO and ECCO (and soon some 19thC collections) and can give users direct acces to the OCR text of ECCO – and it’s easy to see that this is often in poor shape. Sayre Greenfield has helpfully noted some shortcuts to help get around this in a piece for Early Modern Online Bibliographies here. […]

    Like

  2. What lurks beneath the ECCO page | Jane Austen in the Machine Says:

    […] 16 pretty good. That’s fairly representative. I was anticipating trouble with the long-s (see ECCO OCR Troubleshooting by Sayre Greenfield) but also encountered some surprises. The lower-case “e” was sometimes read as […]

    Like

Leave a reply to Improving ECCO « digitalhumanistbeginner Cancel reply