|Project Runeberg (runeberg.org) is a volunteer effort to create free electronic editions of classic Nordic (Scandinavian) literature and make them openly available over the Internet.||Projekt Runeberg (runeberg.org) arbetar på frivillig grund med att skapa fria elektroniska utgåvor av klassisk nordisk litteratur och göra dem öppet tillgängliga över Internet.|
8 hours of labour, 8 hours of leisure – but for what?
by Lars Aronsson
Project Runeberg was founded in December 1992, now more than 25 years ago. The anniversary passed without any celebration. A handful of volunteers are scanning and uploading out-of-copyright books and magazines and a few dozen help with online proofreading.
The other day, I found that one volunteer had uploaded a Swedish monthly magazine on cinema and film, Biografen, from 1913. In it was a short essay by Ellen Key, a famous Swedish writer of the time, where she pointed out that the labour movement's demand for a regulated 8-hour working day would provide more leisure time and there would be a need to spend it well on educational and "cultural entertainment" (nöjeskultur). She made a reference to a Danish writer, Ludvig Feilberg.
Feilberg (1849-1912) was already known to us, even though we have not digitized any of his books. Our list of Nordic Authors, which we created to keep track of who is out-of-copyright, contained his name, nationality and years of birth and death. It also contained a link to a presentation of his works, provided by a Danish website.
We only add the best links we can find to this list, and as always we keep a history of all the edits we make. This link was added on January 7, 1999. That was two years before Wikipedia was conceived (in 2001), four years before we digitized the Danish biographic dictionary (in 2003), where he is mentioned, five years before we digitized Salmonsen's Danish encyclopedia (in 2004), ten years before Danish Wikipedia's article about him was created (in 2009), eleven years before Swedish Wikipedia's article (in 2010), and 17 years before one of his books were digitized and proofread in Wikisource (in 2016). Indeed, 19 years have passed since that first link for Ludvig Feilberg was added to our list, so now I added the others.
In her essay, published in October 1913, Ellen Key mentioned not only Feilberg's name, but also an article about him in a recent issue of another magazine, Ord och Bild. As it happens, we have digitized that magazine (in 2006), so I could immediately jump to that article and learn more about him. It was published in May 1913 after Feilberg had died in September 1912.
Apparently Ellen Key had access to both magazines, perhaps at Stockholm's public library. That privilege was only available to a well-educated urban elite at the time, when the majority of Sweden's population lived in the countryside and had only 7 years of schooling (and worked longer days than eight hours).
Today, regardless of schooling, background or place of residence, almost everybody who can read Swedish can also access the Internet and now have access to both magazines and to Wikipedia within seconds. We who live in 2018 are the privileged ones, not only compared to 1913, but also compared to 2013 or 2003, when the Internet contained so much less information than today.
However, such stories as I have told here of recent discovery of older knowledge, are quite rare. Millions spend their days on so-called social media (Facebook, Twitter, Instagram, ...) sharing photos of their cats or rumours about recent political events. It seems that fewer take the chance to discover history, now that they can. Maybe they don't realize how privileged they are to live in this time? This puts in jeopardy the foundation of maturing digitization projects such as ours. For whom are we digitizing? Will anybody read what we scan and proofread and learn from it? Perhaps our audience went away while we were busy? Perhaps they are less interested in educational and cultural entertainment of the kind that Ellen Key wanted to promote? We should have a lot to learn from her.
Alla årgångar av Ord och Bild
1892 började månadsskriften Ord och Bild utkomma och i december 2005 började vi scanna några av årgångarna. Sedan har det sakta fyllt på. Nu i maj 2016 har vi slutligen lagt till årgång 1914 och kan därmed presentera samtliga årgångar som är äldre än 70 år, från 1892 till 1945.
Konst och konstnärer
För hundra år sedan utkom en populär tidskrift som ville spegla konstlivet. Titeln var Konst och konstnärer. Alla de fem årgångar som utkom finns nu att läsa här.
Snart hundra ordböcker
Ordböcker har blivit en stor genre i Projekt Runeberg. Redan i december 1998 digitaliserade vi "Biblisk ordbok" (1896) av Erik Nyström. Vi hade då precis börjat införa faksimilutgåvor (med scannade bilder av boksidorna, inte bara text), och ville vi prova hur det fungerade för ordböcker. Det dröjde dock fem år till nästa försök, när sjätte upplagan av Svenska Akademiens ordlista (1889) digitaliserades i november 2003. Vid nyåret hade 70 år passerat efter Elof Hellquists bortgång och i mars 2004 digitaliserade vi hans "Svensk etymologisk ordbok" (1922), som fortfarande är en av våra mest besökta titlar. Åren 2004-2005 digitaliserades ytterligare fem ordböcker. 2007 tillkom det viktiga "Svenskt dialektlexikon" (1862-1867) av Johan Ernst Rietz. Men det var först 2009 som det tog fart. Sedan dess har vi digitaliserat flera ordböcker varje år. Sommaren 2011 började vi även digitalisera nyare ordböcker under antagandet att de egentligen inte är litterära verk i upphovsrättslagens mening, utan snarare kataloger (listor, förteckningar) med mycket kortare skyddstid. Hittills har ingen hört av sig med invändningar mot den tolkningen, som vi nu har använt i fyra år. Vårt senaste tillskott, en estnisk-tysk ordbok från 1970, är ett exempel på den tillämpningen. Sommaren 2011 satte vi också upp en temasida som samlar ordböckerna, hittills 98 titlar.
Vi knappade in Bibeln
För precis 20 år sedan satt vi och knappade in Bibeln. Den svenska översättningen från 1917 var fri från upphovsrätt. Redan 1991 hade delar av texten börjat spridas som datorfiler på nätet, men det gällde bara de stycken man helst läser, som bergspredikan, julevangeliet och skapelseberättelsen. För att få texten komplett måste ju även släktkrönikorna och annat läggas in. Det var ett stort arbete, som underlättas om fler hjälps åt. Men på den tiden fanns inga digitalkameror eller bredbandsanslutningar, mest långsamma modem. Tillräckligt många hade dock en Bibel, ett tangentbord och möjligheten att skicka e-post. En av oss gjorde en lista över Bibelns 66 böcker och 1189 kapitel. När en frivillig anmälde sig, blev hon eller han tilldelad ett kapitel att knappa in och sända in med e-post. När det fanns ett inskickat kapitel, gavs nästa frivilliga i uppdrag att korrekturläsa det. Tjugo frivilliga utförde dessa 2378 arbetsbeting inom loppet av två år. I mars 1996 var hela arbetet klart och finns sedan dess fritt och öppet tillgängligt för alla på vår sajt.
Det är inte bara det klassiska språket i 1917 års översättning som är andaktsfullt ålderdomligt. Formatet på textfilerna med "skrivmaskinsfont" doftar också av Internets barndom. En del av det svenska kulturarvet. Här är en gammal instruktion,
Open data at #HACK4NO
by Lars Aronsson
Institutions are eager to reach out to new users and to find new ways to explore and combine their databases and digitized collections (such as scanned photos or books). Some develop their own website, others use common sharing platforms such as Flickr or Youtube. Some cooperate with Wikipedia in various ways. An increasing trend is to organize a hack-a-thon, a small festival where individuals or small companies are invited to compete with the best new ideas and software for reusing the data from the organizing institutions.
In Oslo, the national arts council organized such an event, #HACK4NO, a hackathon for Norway, in the beginning of February 2014. Contributing institutions included the national library, national archives, art museums, the encyclopedia Store Norske Leksikon, an environmental agency with a biodiversity database, and the national land survey (which produces maps).
Curious of all that is happening in Norway, and which is different from my native Sweden, I went there to participate in three days of free food and interesting talks. It was seven years since I last visited the Norwegian capital and a whole new Manhattan skyline has grown up around the central train station.
As you might know, Norway's national library, Nasjonalbiblioteket, has a very impressive digitization program, intending to digitize all of Norway's literature, both out of and still in copyright. This is based on an agreement with Kopinor, a central organization for Norwegian copyright holders. The government pays a fee, to compensate for the fact that all Norwegian citizens can read 20th century literature online for free. But access is limited to IP addresses based in the country. Foreigners who visit bokhylla.no only get access to the out of copyright literature, which means older literature and government reports.
Project Runeberg has already copied many of these older, freely available books, sometimes adding our own OCR text, and made it possible for volunteers to proofread that text. We copy books from many sources, but the Norwegian ones are increasing quickly. One recent example is Den Norske husflidsforenings håndbok i vevning, on the craft of weaving.
Ahead of the hackathon, Nasjonalbiblioteket had released some new data and documentation on their digitization project aimed at developers, including lists of all the digitized works. It turns out 160,000 works are available to Norwegians, but only 20,000 (or 13 %) are considered to be free from copyright and available to the world. As my contribution to the competition, I decided to study the difference between these lists, the 140,000 non-free works. I was hoping to find some errors, some works that really should be free, but that had not yet been made freely available.
The lists contain the names of the authors, apparently derived from the BIBSYS library catalog. Unfortunately, BIBSYS only rarely specifies the years of birth and death for authors, so it's hard to identify which works are written by authors who died more than 70 years ago. Instead, I took an easier approach and looked at the year of publication, assuming that the oldest works are more likely to have fallen out of copyright. The result is this list of the oldest non-free works. At the start it listed 392 books published before 1890, which turn out to be good candidates for an investigation in copyright. Many of them should be freed.
My list was one of 14 entries in the competition. As many as 11 entries were based on geographic data, combining maps with coordinates of monuments of national heritage, and the like. Norwegians like hiking and exploring the landscape, more than studying books, apparently. One entry which compared the text of articles from Store Norske Leksikon to those in Wikipedia, was contributed by a Danish developer. My contribution was not "an app" and doesn't use any "API". It is a minimalistic script of less than 100 lines of code, downloading and comparing two lists, and producing a single web page in HTML as its output.
My entry didn't win any prize in the competition. All I got was a t-shirt. But that was not the point. My aim was not to bring the catalog data to a wider audience, but to provide feedback to Nasjonalbiblioteket, making them release more of the non-free books. Recently, Nasjonalbiblioteket had added reader comments based on Disqus.com to their website. Comments can be added to any entry for a digitized book. My idea was to use this for requests to make the book freely available. But after two weeks, it turns out that Nasjonalbiblioteket most often doesn't read these comments. It doesn't work as a feedback channel. Of the initially 392 books, two of the oldest were made freely available immediately after the hackathon. A third book was made free (and soon copied to Project Runeberg for proofreading: Gjennem Lorgnetten. 1) after I contacted Nasjonalbiblioteket on Twitter. They now suggested I should use e-mail as a feedback channel, which is my current approach. Each book in my list now has an e-mail (mailto) link which fills in the URN address of the book in the subject line of the e-mail. It remains to see if this works as a feedback channel. The response to my efforts can best be described as "slow".
There is currently a lot of hype around "open data" with hackathons and data releases being organized everywhere. In reality, the data that are released are often filled with errors. Opening the data to a wider audience will expose these errors and make it possible to correct old mistakes. But this also requires that the institutions are open to feedback, and interested in improving their data.
To clarify the phrase "filled with errors": Out of the 160,000 digitized books, I estimate that more than 100, possibly several hundreds, are erroneously categorized as non-free. My little program generates a list of the most likely candidates. (There might also be errors in the opposite direction, but those are not my priority.) Even if more than 99% of cases are correct (in telecommunications terminology known as "two nines"), it is less than 99.9% ("three nines"). OCR software correctly recognizes between 99% and 99.9% of the characters, which is why we need manual proofreading, hoping to reach "four nines" (99.99% accuracy or 1 error in 10,000 characters) or more.
Update: On March 1, the following works had become freely available and were copied to Project Runeberg. However, the lists of Nasjonalbiblioteket's digitized books are no longer properly updated, but only show a smaller fraction of all books, and they don't seem to be in any hurry to fix their errors.
- Gustav Adolph Borgen, Advarsel mod Totalafhold (1882)
- Jakob Norby, Historiske tids-tabeller til brug i høiere skoler (1882)
- Ingvald Undset, Om den nordiske stenalders tvedeling (1889)
- Per Wieselgren, Totalafholdssagen i Guds Ords Lys (1882)
March 9, the following 12 works were made free:
- Norges Grundlov : Text og Forklaring (115 pages)
- Berg, Lauritz, Den lille Astronom (21 pages)
- Brandes, Edvard, Et Brud : Skuespil i tre Akter (117 pages)
- Brandes, Edvard, Overmagt : Skuespil i fire Akter (239 pages)
- Dilling, L., Gjennem Lorgnetten : Skitser. 3 (245 pages)
- Flood, Jørgen W., Kristiania Svaneapotheks Historie (35 pages)
- Kjerulf, Theodor, Stenriget og Fjeldlæren (297 pages)
- Madsen, H.Th., Om status og det enkle bogholderi : lærebog og haandbog (125 pages)
- Madsen, H.Th., Faciter til lærebog i handelsregning (45 pages)
- Rosing, Marie, 46 tegninger til brug ved undervisning i kvindeligt haandarbeide (51 pages)
- Rothschild, Lazarus von, Rothschilds lommebog i handelskundskab : indeholdende over 300 spørgsmaal og svar henhørende under handels- og kontorvidenskab (129 pages)
- Smitt, J., Norges Landbrug i dette Aarhundrede : et Tidsbillede (321 pages), som vi redan har i Googles inscanning
March 13, more books were added to the long list of available books, but none to the short list of publicly available books. One book was removed (a clear error) from the long list, even though it should have been made free and added to the short list.
March 15, the following 5 works were made free:
- Pontoppidan, Erik, Udtog af Dr. Erich Pontoppidans Forklaring (129 pages)
- Krydserens Viser i skjønsomt Udvalg (45 pages)
- Læsebog for kristelig skole og hjem. 1 (137 pages)
- Martyrerne i Roms Katakomber (217 pages)
- Samtaler mellem nogle frøkener om sand kristendoms mulighed hos folk i høiere livsstillinger (101 pages)
March 16, no changes were made to my list. The total number of digitized books increased from 162,791 to 162,998 (+207) but the number of freely available books fell from 21,696 to 21,671 (-25). Fluctuations of this kind occur every day, which I wasn't prepared for. Did 25 books actually change from being freely available to not being so? Or are the lists simply not reflecting the reality? I haven't checked the day-to-day differences in detail, so I don't know. Maybe I should modify my software to check this.
March 18, P. Ulleland (translator), Volsungernes saga (1887) was removed from our list. Not because it was made free (it isn't) but because it disappeared from the list of digitized books! It is still digitized, however, but no longer listed. Only the 1902 reprint is listed as digitized (and not free). The author is apparently Peder Jakobsen Ulleland (1859-1892), who has been dead for more than 70 years, so both printings of the work should be made free.
March 19, P. Ulleland (translator), Volsungernes saga (1887) is back on our list.