March 2019 Front Page (About Project Runeberg)

Redoing OCR

In the year 2000 and again in 2010 we found that OCR of fraktur (blackletter, Gothic) was too difficult and could wait. For normal print (antikva, Latin) we have used the commercial software ABBYY Finereader with great success. Since 2007 we have also increasingly imported books that have been scanned by others and often copied both scanned images and OCR text.

Around 2013 or 2014, the OCR quality for books printed in fraktur and scanned by Nasjonalbiblioteket of Norway suddenly improved radically. It seems they have used a special edition of Finereader developed by some German/Austrian project, but this was outside of our reach. Later, books in fraktur digitized by Det Kongelige Bibliotek of Denmark have also become better.

As we return to consider this problem again in 2019, free software Tesseract (Wikipedia, Github, wiki) is now in version 4.0 and a standard part of the Ubuntu Linux distribution, with support for Swedish and Danish fraktur added around 2015. The output is far from excellent, not as good as the Norwegian books, but much better than some other and quite useful as a starting point for manual proofreading.

We are now, using Tesseract, starting to redo OCR for some books in fraktur. The first attempt is Søren Kierkegaards Samlede Værker (15 volumes, 1920-1926), which were digitized in 2009 at the University of Toronto by the Internet Archive. From their OCR text, of terrible quality, it is apparent that they used ABBYY Finereader for Latin letters. We copied volumes 1-8 in 2014, but decided in 2015 to do our own OCR by manually training Finereader to interpret the fraktur text. This was timeconsuming and painful and the result was not very good. Now, we have copied the remaining volumes and redone OCR for all of them with Tesseract, with much better result.

In the meanwhile, a new edition of Søren Kierkegaards Skrifter (55 printed volumes, 2007-2013) has been published and come online at SKS.dk. There you will find all of the texts, without needing to proofread anything. However, this is not true for all the other books that we provide.

A problem is that we have no algorithm for determining which OCR text is better. The right way to determine this is to manually proofread the page and then see which OCR candidate required the smaller amount of edits to reach the desired result. But of course, when we have two OCR texts for the same page, we want to find out which is better without needing to proofread the page. And we can't just use a spell checker because then any sequence of correctly spelled words would win, regardless of its similarity to the scanned page. So far, we only redo OCR on pages were the naked eye can immediately see that there are too many errors typical of bad fraktur OCR, for example containing words such as "reban" (redan) or "ogfaa" (ogsaa).

Project Runeberg, Mon Feb 21 15:32:41 2022 (aronsson) (diff) (history) (download) << Previous Next >> https://runeberg.org/admin/201903-front.html

		About Project Runeberg / March 2019 Front Page Table of Contents / Innehåll \| << Previous \| Next >>
	Project Runeberg \| Catalog \| Recent Changes \| Donate \| Comments? \|