Project Runeberg's front page section for May 2012:
Yet another monumental volunteer effort was completed on April 28, when the last page of Nordisk illustreret Havebrugsleksikon (Scandinavian illustrated encyclopedia of gardening, 3rd edition, 2 volumes, 1920-1921) was proofread by Pultz. We scanned this book in April 2004.
Danish reference works that have previously been reached the status of fully proofread and indexed:
On May 31, the total number of pages with indexed contents reached 500,000, which is slightly more than half of our 845,000 pages (25 out of 42 linear metres of shelving, or 59 %).
Old and new document scanners. |
A small community grant of 18,000 SEK (€ 2000) from the Swedish chapter of the Wikimedia Foundation has made it possible for us to purchase a faster scanner and a more powerful personal computer for OCR processing. The immediate results are more than a hundred scanned volumes in April. Read more below.
Ett bidrag på 18.000 kronor (€ 2000) från föreningen Wikimedia Sverige har gjort det möjligt för oss att köpa en snabbare scanner och en kraftfullare dator för OCR-tolkning. Vår ansökan var en av en handfull som beviljades bidrag inom ramen för gemenskapens projekt 2012. Wikimedia Sverige är en ideell förening som stödjer fri kunskap och utgör den svenska avdelningen av Wikimedia Foundation, som driver Wikipedia. Det direkta resultatet av inköpet är över hundra inscannade böcker under april. Många av dem kommer säkert till nytta inom Wikipedia, som ofta länkar till Projekt Runebergs inscannade böcker som källhänvisning.
During April, we added 77,764 scanned pages (4 linear metres of shelving), either scanned by our volunteers or copied from the Internet Archive or other image sources. This has been our most productive month ever. The largest additions are Pedagogisk tidskrift (17,565 pages), Ord och Bild (7,574), Trap-Danmark (5,156), and "HGSL" (3,998).
April 2012 showed the strongest growth in Project Runeberg of any month ever. We added 159 volumes containing 77,764 scanned pages, nearly 4 linear metres of shelving (as we count 20,000 pages to be one metre of shelving). In this single month, Project Runeberg's collections increased by 10.7 percent, from 729,012 to 806,776 pages. This is more than our typical annual growth of 40 to 70 thousand pages.
How was this huge speed increase suddenly achieved? Did it cost a lot? Did a large number of new volunteers suddenly sign up? Or did we only copy images that someone else had scanned?
Of the 159 added volumes, 121 were scanned by our volunteers (111 volumes scanned by Lars, 4 by Ralph, 3 by Peter and 3 by Bert) and 38 were copied from other sources (31 from the Internet Archive, 3 from Nasjonalbiblioteket, 2 from Stockholmskällan, and 2 from Google). Welcome Bert, who is a new contributor of scanned images, of three years of the workers' calendar Lucifer, ljusbringaren (1893-1895).
The big difference is that Lars has a new, faster scanner. It is a sheet-feeding scanner for books where the spine has been cut off. While this method can't be used on valuable books, we have enough many books where it is applicable, such as old journals donated by libraries. Examples scanned this month were Pedagogisk tidskrift (1898-1940) and Ord och Bild (1926-1939).
Affordable desktop sheet-feeding scanners, in the range below €/$500, have been available for ten years. Two common models are the Canon DR-2050C and the Fujitsu Scansnap S1500. Most customers use these to move piles of personal paperwork to their computer disk. Some of our volunteers use them with great success. Their speed, however, is limited to 20 pages per minute in low resolution and even slower at the higher resolutions we prefer. Faster scanners have been priced for the office market, typically in the range above €/$5000, with nothing in between.
When we started to shop for a new scanner earlier this year, we had our minds set on one of those expensive models. But to our happy surprise, there is a new model, the Canon DR-160M which scans 60 leaves of paper per minute, both sides (duplex) at the same time, so 120 images per minute in full resolution, at a price of only €/$1200.
Compared to some more expensive models, this scanner has certain limitations. One is that it only takes papers 22 cm (8.7 inches) wide, which is too small for posters or fold-out maps. Feeding papers works fine mostly, but not always. When you have to resort to manually feeding individual pages, the overall speed will suffer. The feeder only holds some 60 pages, which are consumed in a minute, so you have to attend to the scanner often. Office models typically can handle larger batches. Still, its performance means a huge improvement over our previous Canon DR-2050C.
OCR (optical character recognition) is the second bottleneck in our digitization process. The way our platform is designed, any volunteer can download scanned images, run any OCR process they prefer, and upload the resulting text, where it can be proofread by other volunteers. This collaborative model works fine. (Free software evangelists often propose that free OCR software should be tried and used, but the examples of volunteers using free OCR software to process our scanned books are extremely rare.) For the last few years, the market for affordable personal computer OCR software has been dominated by ABBYY Finereader. We have used Finereader Professional from version 6 until the current version 11, with steady improvements. If a volunteer wants to help us with OCR, we can reimburse the small price (€129) for purchasing this software.
The Internet Archive uses the server/engine configuration of Finereader version 8. Their text quality is fine for English, but worse for other languages. For Scandinavian languages, we prefer to run our own OCR on their scanned images, rather than to use their OCR text for proofreading. Perhaps their text quality would benefit from upgrading to version 11, and adding dictionaries for old spelling. But more than this, we have found it necessary to manually check the segmentation of text columns. While recent versions of Finereader do a very good job on recognizing letters and words, columns are often mixed up, reading image captions as part of the adjacent text column.
For the time being, the best OCR results are achieved by letting
Finereader Professional process the entire book, and then manually
check the segmentation of every page. For most pages, the checking
takes less than a second. The automatic recognition can still be
consume more time than that, especially on a slow computer.
Fortunately, Finereader runs in parallel on a multi-core CPU. And
fortunately, strong CPUs are affordable as computers marketed for
gaming. Lars now uses a personal computer with an Intel Core i7 CPU
running at 3.4 GHz and having four cores. Finereader is not using a
lot of RAM; the 8 GB of this computer are not fully used. It is
possible to scan images of one book, OCR process a second book, and
check the segmentation of a third book (in a separate instance of
Finereader) at the same time.