- Project Runeberg -  Welcome to Project Runeberg
Front page | Next >>
Lysator Linköping University
  Project Runeberg | Like | Catalog | Recent Changes | Donate | Comments? |   
Project Runeberg (runeberg.org) is a volunteer effort to create free electronic editions of classic Nordic (Scandinavian) literature and make them openly available over the Internet. Projekt Runeberg (runeberg.org) arbetar på frivillig grund med att skapa fria elektroniska utgåvor av klassisk nordisk litteratur och göra dem öppet tillgängliga över Internet.

Project Runeberg, March 2019


March 2019

Redoing OCR

In the year 2000 and again in 2010 we found that OCR of fraktur (blackletter, Gothic) was too difficult and could wait. For normal print (antikva, Latin) we have used the commercial software ABBYY Finereader with great success. Since 2007 we have also increasingly imported books that have been scanned by others and often copied both scanned images and OCR text.

Around 2013 or 2014, the OCR quality for books printed in fraktur and scanned by Nasjonalbiblioteket of Norway suddenly improved radically. It seems they have used a special edition of Finereader developed by some German/Austrian project, but this was outside of our reach. Later, books in fraktur digitized by Det Kongelige Bibliotek of Denmark have also become better.

As we return to consider this problem again in 2019, free software Tesseract (Wikipedia, Github, wiki) is now in version 4.0 and a standard part of the Ubuntu Linux distribution, with support for Swedish and Danish fraktur added around 2015. The output is far from excellent, not as good as the Norwegian books, but much better than some other and quite useful as a starting point for manual proofreading.

We are now, using Tesseract, starting to redo OCR for some books in fraktur. The first attempt is Søren Kierkegaards Samlede Værker (15 volumes, 1920-1926), which were digitized in 2009 at the University of Toronto by the Internet Archive. From their OCR text, of terrible quality, it is apparent that they used ABBYY Finereader for Latin letters. We copied volumes 1-8 in 2014, but decided in 2015 to do our own OCR by manually training Finereader to interpret the fraktur text. This was timeconsuming and painful and the result was not very good. Now, we have copied the remaining volumes and redone OCR for all of them with Tesseract, with much better result.

In the meanwhile, a new edition of Søren Kierkegaards Skrifter (55 printed volumes, 2007-2013) has been published and come online at SKS.dk. There you will find all of the texts, without needing to proofread anything. However, this is not true for all the other books that we provide.

A problem is that we have no algorithm for determining which OCR text is better. The right way to determine this is to manually proofread the page and then see which OCR candidate required the smaller amount of edits to reach the desired result. But of course, when we have two OCR texts for the same page, we want to find out which is better without needing to proofread the page. And we can't just use a spell checker because then any sequence of correctly spelled words would win, regardless of its similarity to the scanned page. So far, we only redo OCR on pages were the naked eye can immediately see that there are too many errors typical of bad fraktur OCR, for example containing words such as "reban" (redan) or "ogfaa" (ogsaa).


February 2019

Insamlingskampanj 2018/19

Vi provade något nytt: en insamlingskampanj. Från söndag 10 februari till fredag 8 mars syntes en liten reklamskylt (banner, som ovan) på några av våra webbsidor, som uppmanade till donationer med ett givet mål, 25.000 kronor för verksamhetsåret 2018/19. Den uttalade tanken var att bannern skulle tas bort så fort målet hade uppnåtts, för att återkomma nästa år. Målet uppnåddes redan inom en månad. Sedan länge finns en länk "Donate" i sidhuvudet till alla våra webbsidor. Läs mer på vår sida för donationer.

2018/19 Fundraiser

It was our first attempt ever at an annual fundraiser. Starting on Sunday February 10th and ending on Friday March 8th, a small banner (the one above) was seen on some of our web pages, promoting donations toward our aim of raising 25,000 SEK for the fiscal year 2018/19. The idea was that the banner would be removed as soon as the aim had been reached, to reappear next year. It was reached already within a month. We have long had a link "Donate" in the header of all our web pages. Read more on our donation page.


March 2018

8 hours of labour, 8 hours of leisure – but for what?

by Lars Aronsson

Project Runeberg was founded in December 1992, now more than 25 years ago. The anniversary passed without any celebration. A handful of volunteers are scanning and uploading out-of-copyright books and magazines and a few dozen help with online proofreading.

The other day, I found that one volunteer had uploaded a Swedish monthly magazine on cinema and film, Biografen, from 1913. In it was a short essay by Ellen Key, a famous Swedish writer of the time, where she pointed out that the labour movement's demand for a regulated 8-hour working day would provide more leisure time and there would be a need to spend it well on educational and "cultural entertainment" (nöjeskultur). She made a reference to a Danish writer, Ludvig Feilberg.

Feilberg (1849-1912) was already known to us, even though we have not digitized any of his books. Our list of Nordic Authors, which we created to keep track of who is out-of-copyright, contained his name, nationality and years of birth and death. It also contained a link to a presentation of his works, provided by a Danish website.

We only add the best links we can find to this list, and as always we keep a history of all the edits we make. This link was added on January 7, 1999. That was two years before Wikipedia was conceived (in 2001), four years before we digitized the Danish biographic dictionary (in 2003), where he is mentioned, five years before we digitized Salmonsen's Danish encyclopedia (in 2004), ten years before Danish Wikipedia's article about him was created (in 2009), eleven years before Swedish Wikipedia's article (in 2010), and 17 years before one of his books were digitized and proofread in Wikisource (in 2016). Indeed, 19 years have passed since that first link for Ludvig Feilberg was added to our list, so now I added the others.

In her essay, published in October 1913, Ellen Key mentioned not only Feilberg's name, but also an article about him in a recent issue of another magazine, Ord och Bild. As it happens, we have digitized that magazine (in 2006), so I could immediately jump to that article and learn more about him. It was published in May 1913 after Feilberg had died in September 1912.

Apparently Ellen Key had access to both magazines, perhaps at Stockholm's public library. That privilege was only available to a well-educated urban elite at the time, when the majority of Sweden's population lived in the countryside and had only 7 years of schooling (and worked longer days than eight hours).

Today, regardless of schooling, background or place of residence, almost everybody who can read Swedish can also access the Internet and now have access to both magazines and to Wikipedia within seconds. We who live in 2018 are the privileged ones, not only compared to 1913, but also compared to 2013 or 2003, when the Internet contained so much less information than today.

However, such stories as I have told here of recent discovery of older knowledge, are quite rare. Millions spend their days on so-called social media (Facebook, Twitter, Instagram, ...) sharing photos of their cats or rumours about recent political events. It seems that fewer take the chance to discover history, now that they can. Maybe they don't realize how privileged they are to live in this time? This puts in jeopardy the foundation of maturing digitization projects such as ours. For whom are we digitizing? Will anybody read what we scan and proofread and learn from it? Perhaps our audience went away while we were busy? Perhaps they are less interested in educational and cultural entertainment of the kind that Ellen Key wanted to promote? We should have a lot to learn from her.


May 2016

Alla årgångar av Ord och Bild

1892 började månadsskriften Ord och Bild utkomma och i december 2005 började vi scanna några av årgångarna. Sedan har det sakta fyllt på. Nu i maj 2016 har vi slutligen lagt till årgång 1914 och kan därmed presentera samtliga årgångar som är äldre än 70 år, från 1892 till 1945.
http://runeberg.org/ordochbild/

Konst och konstnärer

För hundra år sedan utkom en populär tidskrift som ville spegla konstlivet. Titeln var Konst och konstnärer. Alla de fem årgångar som utkom finns nu att läsa här.
http://runeberg.org/kok/


Project Runeberg, 2019-07-15 23:57 (runeberg)
http://runeberg.org/

Valid HTML 4.0! All our files are DRM-free