How do you make the full text of a digitized book searchable?

More Information

When I ask people how they think digitized texts are made full-text searchable — instead of just being pictures — the responses are typically varied. Some assume that letterforms and words are captured as part of the imaging process. While I'd like to respond with a sarcastic quip about a magical camera that can read, this isn’t entirely unfounded since many modern scanners can at least make an attempt at recognizing the text on a scanned page. We use actual cameras instead of scanners; books, of course, are not cut apart into single pages, and antique paper (and parchment) documents are not forced through a tray feed. The other response I typically get goes in the exact opposite direction. People often assume that if not an automatic process, there must be a human sitting down and transcribing every word on the page. That would that make the digitization process wildly expensive. I’m certain we’d run through these digital scribes fairly quickly as they would lose their minds in the face of an endless sea of monotony and run screaming from the building.

The answer lies in between these two extremes in a process called optical character recognition (OCR). Specialized software allows the user to upload a series of images and define language parameters before the program goes through each page defining where the text blocks are and using the contrast between ink and paper to decipher what letters appear on the page. The resulting text can be reviewed, corrected and exported into a number of formats depending on the needs of the user. For projects that are uploaded to the HathiTrust (a digital repository), each page is saved as a text document using the same numbering as the page images so that the two can be easily collated together.

In digitizing the Cummings Collection(opens in a new tab), OCR is done one of two ways, which is determined during the metadata creation and quality control stage when books are screened for their OCR friendliness. Works that the OCR software will easily recognize are run automatically, resulting in text documents that are not reviewed. Those that will certainly pose problems are also run automatically, but instead of individual text documents, the initial run saves an OCR project file. This file can be opened using the OCR software and each page can be reviewed, edited and re-recognized. Once the entire book has been reviewed each page can then be saved as an individual text file.

Unfortunately, there are quite a few common characteristics that could confuse the OCR software. The most common I've encountered are described below.

Unfamiliar Characters/Scripts/Nekudot

It is possible to ‘train’ ABBYY FineReader (an OCR app) to read scripts it is not programmed to recognize, but my results have had varying degrees of success. With regard to Hebrew vocalization this brings up a particularly interesting point — nekudot (or pronunciation markers) make the text easier for a human to read but harder for a machine. (Image 1)

Image 1. OCR will often read subscript vowels as a string of punctuation marks.

Antiquated or decorative scripts, such as Fraktur seen in the example below (Image 2), have a similar difficulty in OCR, though they can be equally tough for humans to read if unfamiliar with the typeface. With training, OCR can do a semi-decent job of recognizing most of the text. However, while a human would certainly recognize that the round, decorative ‘C’, ‘E’, ‘G’, and ‘S’ are not @, $, ®, or ©, a computer doesn't have that same sensibility. Fortunately, there is an option to remove certain symbols from the list of available characters, forcing the program to recognize a letter (image 3). While still wildly imperfect, it makes a better go of it than before, and with further training accuracy might improve.

Image 2. Text recognition of Fraktur before training and removing unwanted characters.

Image 3. Text recognition of Fraktur after training and removing unwanted characters.

Mis-en-page

The physical layout of type on the page and layering of graphical elements is by far the element which I have to correct most frequently. Part of this is because many of the texts in the Cummings Collection are religious in nature which, like their counterparts in other religions, contain not just the source text, but also explanations and expansions on the meaning of that text by multiple religious authorities. Humans can distinguish fairly easily between the different chunks of text because the human brain is able to recognize and interpret the many indicators the compositor (the person who physically laid out the type on the page) employed. In codicology, the purposeful layout of a page in order to render clear the different sections of a text is called the ordinatio and scribes a millennia ago were utilizing the same ideas that modern compositors used and contemporary scholars use today: larger type helps to indicate a heading, different typefaces and fonts show that two chunks of text are distinct entities, as does spacing between groups of text. So, part of why we can tell how the islands of text in the image below are separated is because the heading above primes us to expect two separate entities instead of one large one, which in turn makes it easier to recognize that the sliver of space between the two paragraphs is more meaningful than simply the space between words. Computers and software don’t have the same ability to recognize and add together all of these indications in order to properly ‘read’ the layout of a page. They certainly try, but the results are varied, and sometimes comically off. Such is the case in the image below (Image 4) where OCR correctly identified a few of the text blocks (green) in the top third of the page, but was horribly off in the bottom two-thirds of the page, inserting awkwardly shaped images (red) and text blocks that have seemingly little to do with the spatial layout of the page.

Image 4. Commentaries often have complicated layouts that OCR has difficulty recognizing.

Creativity also has a habit of tripping up OCR. This children’s book from 1923 printed some of the anthropomorphized food characters, such as Papa Baked Potato, in red behind the text (Image 5). While it can be tricky sometimes, the human eye can generally distinguish between the red and black in order to discern words and image. OCR cannot distinguish the two and the text output is a jumbled mess.

Image 5. Papa Baked Potato makes this simple text harder to read.

Material Degradation and Printing Errors

Material degradation such as tears or wormholes create shadows across a text block that OCR software interprets as ink on the page and tries to read. Printing errors such as the one shown below (Image 6) also cause problems because the software is unable to jump across the chasm to stitch the sections back together. Instead, it interprets each chunk as an isolated text block, resulting in fragmented sentences that start and end in the middle of a phrase.

Image 6. This page was folded when printed upon resulting in the text becoming fragmented when the page was unfolded.

Limitations with Image Capture

With some books, it is simply impossible to capture photos without also capturing part of the neighboring page (Image 7). With small bits of text the software can usually ignore it, but with larger chunks the software gets understandably confused and recognizes the other page’s text as being a continuation of the current page. The decision about how to sort pages during that initial stage is a tough one with this issue in mind because you’re essentially trying to measure how annoying it might be for someone trying to use this digital book and if that level of annoyance is high enough to warrant the hours you will put into fixing the issue. Unlike complicated page layouts, it’s a quick fix to simply delete the extra text areas. However, doing that 100, 200, 300+ times adds up, as it does with all of the issues noted above. Is it worth the time saved to have parts of a page repeated throughout the searchable text? Additionally, sometimes the software gets thrown off by these extra text areas and really messes up the page recognition requiring a little extra effort on my part to make things right, and it is not at all obvious whether that is likely to happen or not. Often one book will have some pages that are recognized perfectly, some that have an extra strip or two of text like in the image below, and some that are all over the place.

Image 7. The format of this book resulted in a big chunk of the neighboring page being captured as well.

As noted in the preceding paragraph, while these issues are easy enough to fix, they are complicated by scale. Every edit is time spent, which multiplied by hundreds of pages for each book in a collection of hundreds, thousands or even tens of thousands of books means a significant amount of time, and thus wages, are spent drawing colorful rectangles on a digital page. In screening which book to run automatically, I limit the number of books for which I spend time drawing these rectangles, but in the process I am effectively turning a blind eye to the inevitable mistakes that make their way into these digital transcriptions. Neither am I eliminating mistakes by fixing the colorful rectangles on the pages I do look at individually. Rather, I am redefining the recognition area in order to give OCR the best chance possible at correctly transcribing a page. Sometimes that means 99% accuracy, and sometimes it means 60%. Every institution must decide for itself whether it is then worth going in and correcting the text to get it closer to 100% accuracy. For a small collection in the institution's native language, it might be. But for a collection of 30,000 volumes in a foreign script, it may not.

Digitized books from the Cummings Collection of Hebrew Manuscripts are available on the Hathi Trust: https://babel.hathitrust.org/c...(opens in a new tab)

Header image: Scanned Hebrew text being converted into searchable text in ABBYY FineReader.

Preservation & Conservation
The UCLA Library Preservation & Conservation Department provides expert care to library materials so they are accessible now and into the future.