|Project Runeberg (runeberg.org) is a volunteer effort to create free electronic editions of classic Nordic (Scandinavian) literature and make them openly available over the Internet.||Projekt Runeberg (runeberg.org) arbetar på frivillig grund med att skapa fria elektroniska utgåvor av klassisk nordisk litteratur och göra dem öppet tillgängliga över Internet.|
Open data at #HACK4NO
by Lars Aronsson
Institutions are eager to reach out to new users and to find new ways to explore and combine their databases and digitized collections (such as scanned photos or books). Some develop their own website, others use common sharing platforms such as Flickr or Youtube. Some cooperate with Wikipedia in various ways. An increasing trend is to organize a hack-a-thon, a small festival where individuals or small companies are invited to compete with the best new ideas and software for reusing the data from the organizing institutions.
In Oslo, the national arts council organized such an event, #HACK4NO, a hackathon for Norway, in the beginning of February 2014. Contributing institutions included the national library, national archives, art museums, the encyclopedia Store Norske Leksikon, an environmental agency with a biodiversity database, and the national land survey (which produces maps).
Curious of all that is happening in Norway, and which is different from my native Sweden, I went there to participate in three days of free food and interesting talks. It was seven years since I last visited the Norwegian capital and a whole new Manhattan skyline has grown up around the central train station.
As you might know, Norway's national library, Nasjonalbiblioteket, has a very impressive digitization program, intending to digitize all of Norway's literature, both out of and still in copyright. This is based on an agreement with Kopinor, a central organization for Norwegian copyright holders. The government pays a fee, to compensate for the fact that all Norwegian citizens can read 20th century literature online for free. But access is limited to IP addresses based in the country. Foreigners who visit bokhylla.no only get access to the out of copyright literature, which means older literature and government reports.
Project Runeberg has already copied many of these older, freely available books, sometimes adding our own OCR text, and made it possible for volunteers to proofread that text. We copy books from many sources, but the Norwegian ones are increasing quickly. One recent example is Den Norske husflidsforenings håndbok i vevning, on the craft of weaving.
Ahead of the hackathon, Nasjonalbiblioteket had released some new data and documentation on their digitization project aimed at developers, including lists of all the digitized works. It turns out 160,000 works are available to Norwegians, but only 20,000 (or 13 %) are considered to be free from copyright and available to the world. As my contribution to the competition, I decided to study the difference between these lists, the 140,000 non-free works. I was hoping to find some errors, some works that really should be free, but that had not yet been made freely available.
The lists contain the names of the authors, apparently derived from the BIBSYS library catalog. Unfortunately, BIBSYS only rarely specifies the years of birth and death for authors, so it's hard to identify which works are written by authors who died more than 70 years ago. Instead, I took an easier approach and looked at the year of publication, assuming that the oldest works are more likely to have fallen out of copyright. The result is this list of the oldest non-free works. At the start it listed 392 books published before 1890, which turn out to be good candidates for an investigation in copyright. Many of them should be freed.
My list was one of 14 entries in the competition. As many as 11 entries were based on geographic data, combining maps with coordinates of monuments of national heritage, and the like. Norwegians like hiking and exploring the landscape, more than studying books, apparently. One entry which compared the text of articles from Store Norske Leksikon to those in Wikipedia, was contributed by a Danish developer. My contribution was not "an app" and doesn't use any "API". It is a minimalistic script of less than 100 lines of code, downloading and comparing two lists, and producing a single web page in HTML as its output.
My entry didn't win any prize in the competition. All I got was a t-shirt. But that was not the point. My aim was not to bring the catalog data to a wider audience, but to provide feedback to Nasjonalbiblioteket, making them release more of the non-free books. Recently, Nasjonalbiblioteket had added reader comments based on Disqus.com to their website. Comments can be added to any entry for a digitized book. My idea was to use this for requests to make the book freely available. But after two weeks, it turns out that Nasjonalbiblioteket most often doesn't read these comments. It doesn't work as a feedback channel. Of the initially 392 books, two of the oldest were made freely available immediately after the hackathon. A third book was made free (and soon copied to Project Runeberg for proofreading: Gjennem Lorgnetten. 1) after I contacted Nasjonalbiblioteket on Twitter. They now suggested I should use e-mail as a feedback channel, which is my current approach. Each book in my list now has an e-mail (mailto) link which fills in the URN address of the book in the subject line of the e-mail. It remains to see if this works as a feedback channel. The response to my efforts can best be described as "slow".
There is currently a lot of hype around "open data" with hackathons and data releases being organized everywhere. In reality, the data that are released are often filled with errors. Opening the data to a wider audience will expose these errors and make it possible to correct old mistakes. But this also requires that the institutions are open to feedback, and interested in improving their data.
To clarify the phrase "filled with errors": Out of the 160,000 digitized books, I estimate that more than 100, possibly several hundreds, are erroneously categorized as non-free. My little program generates a list of the most likely candidates. (There might also be errors in the opposite direction, but those are not my priority.) Even if more than 99% of cases are correct (in telecommunications terminology known as "two nines"), it is less than 99.9% ("three nines"). OCR software correctly recognizes between 99% and 99.9% of the characters, which is why we need manual proofreading, hoping to reach "four nines" (99.99% accuracy or 1 error in 10,000 characters) or more.
Update: On March 1, the following works had become freely available and were copied to Project Runeberg. However, the lists of Nasjonalbiblioteket's digitized books are no longer properly updated, but only show a smaller fraction of all books, and they don't seem to be in any hurry to fix their errors.
- Gustav Adolph Borgen, Advarsel mod Totalafhold (1882)
- Jakob Norby, Historiske tids-tabeller til brug i høiere skoler (1882)
- Ingvald Undset, Om den nordiske stenalders tvedeling (1889)
- Per Wieselgren, Totalafholdssagen i Guds Ords Lys (1882)
March 9, the following 12 works were made free:
- Norges Grundlov : Text og Forklaring (115 pages)
- Berg, Lauritz, Den lille Astronom (21 pages)
- Brandes, Edvard, Et Brud : Skuespil i tre Akter (117 pages)
- Brandes, Edvard, Overmagt : Skuespil i fire Akter (239 pages)
- Dilling, L., Gjennem Lorgnetten : Skitser. 3 (245 pages)
- Flood, Jørgen W., Kristiania Svaneapotheks Historie (35 pages)
- Kjerulf, Theodor, Stenriget og Fjeldlæren (297 pages)
- Madsen, H.Th., Om status og det enkle bogholderi : lærebog og haandbog (125 pages)
- Madsen, H.Th., Faciter til lærebog i handelsregning (45 pages)
- Rosing, Marie, 46 tegninger til brug ved undervisning i kvindeligt haandarbeide (51 pages)
- Rothschild, Lazarus von, Rothschilds lommebog i handelskundskab : indeholdende over 300 spørgsmaal og svar henhørende under handels- og kontorvidenskab (129 pages)
- Smitt, J., Norges Landbrug i dette Aarhundrede : et Tidsbillede (321 pages), som vi redan har i Googles inscanning
In 2014, the Scandinavian/Nordic countries commemorate three important anniversaries. Below are some links to literature that we provide, related to these events. It is:
- 200 years since the Norwegian constitution was signed at Eidsvoll on May 17, 1814. After the Napoleonic wars, Sweden (with Prussia) was among the winners and Denmark (with France) was among the losers. Norway, having been an integral part of the Kingdom of Denmark for 400 years, was ceded to Sweden at the Treaty of Kiel, but during the swap managed to declare its independence before entering into a union with Sweden. Norway got its own constitution, laws, and parliament. Only the king and foreign policy were to be common with Sweden.
- Rigsforsamlingen paa Eidsvold, Illustreret norsk konversationsleksikon (1907-1913)
- Bernt Moe, Biographiske Efterretninger om Eidsvolds-Repræsentanter og Storthingsmænd i Tidsrummet 1814-1845 (1845)
- Fredrik Lagerroth, Kielertraktatens tolkning och tillämpning, Scandia (1940)
- Nordahl Rolfsen, Februarmøtet paa Eidsvold, Læsebok for folkeskolen (1912)
- Carl Th. Sørensen, Bernadotte i Norden eller Norges adskillelse fra Danmark og forening med Sverig (1903-1904)
- Henrik Wergeland, Norges konstitutions historie, Samlade Skrifter (1841-1843)
- 150 years since the Second Schleswig War in 1864, from February 1 to October 30. When Prussia and Austria attacked Denmark, many in Norway and Sweden (in the union forged 50 years earlier) wanted to join the war in support of Denmark, but the government refused. This shattered all dreams of a political Scandinavian union. The idealistic student movement known as "Scandinavism" took a new direction, aiming for cultural exchanges instead of political ones. When Italy united in 1866 and Germany in 1871, Scandinavia remained divided.
- Dansk-tyske Krige, Salmonsens konversationsleksikon (1916)
- Andra slesvigska kriget, Nordisk familjebok (1917)
- Carl Grimberg, När Sverige blev neutralt och upphörde att vara kolonialmakt, Svenska folkets underbara öden (1913-1939)
- Carl Grimberg, När de svenska frivilliga i dansk-tyska kriget hotades med arkebusering, Svenska folkets underbara öden (1913-1939)
- Adolf Helander, När Stockholmarne gjorde uppror 1864, Stockholmstyper (1901)
- A.D. Jørgensen, Det slesvigholstenske Spørgsmål, Historiske Afhandlinger (1897)
- Bengt Lidforss, Skandinavismens tidigare skeden, Utrikespolitiska vyer (1924)
- Bernhard Elis Malmström, Prolog vid studentkonserterna för Danmark under kriget 1864, Dikter (1880)
- Gustav Sundbärg, Skandinavismen, Det svenska folklynnet (1911)
- 100 years since the outbreak of World War I on July 28, 1914, in which all of Scandinavia remained neutral. Even though Finland was a Grand Duchy within the Russian empire from 1809 until 1917, emperor Nicholas II did not draft the Finns. The kings of the independent monarchies Sweden, Denmark, and Norway met in Malmö on December 18-19, 1914 to manifest their neutrality.
- Bondetåget, Hvar 8 dag (February 8, 1914)
- Bondetåget, Hvar 8 dag (February 15, 1914)
- Världskriget, Nordisk familjebok (1922)
- Ellen Key, Kriget, freden och framtiden (November 1914)
- Rudolf Kjellén, Världskrigets politiska problem (1915)
- Martin Koch, Februaridagarna 1914. Ögonblicksbilder från kristiden (1914)
- Carl G. Laurin, Alla ha rätt samt andra uppsatser med anledning av världskriget (1917)
- Fredrik Lindholm, Nationalhymner och soldatsånger under världskriget (1916)
- Erik Lindorm, Världen i brand. En bokfilm över det stora kriget 1914-1918 (1935)
- Anton Nyström, Före, under och efter 1914. Världskriget. Orsaker och ansvar (1915)
- Carl Rosenblad, Krigsberedskap och folkanda (1917)
- Karl Staaff, Angående ... Bondetåget, Politiska tal (February 11, 1914)
- Otto Witt, Krigets tekniska sagor för stora och små (1915)
- Världskulturen och kriget (1915)
A Summary of 2013
Project Runeberg turned 21 years old on December 13, 2013, the same age that Project Gutenberg (founded in 1971) had when we started in 1992. They are no longer twice our age.
We currently hold 1.49 million scanned book pages, which is a 45% increase over last year. Since the end of 2011, two years ago, our collections have doubled. 1.34 million pages (90 percent) are OCRed, but only 0.27 million (18 percent) are proofread. Proofreading progressed with only 20,000 pages this year.
Among important additions this year are the Swedish illustrated biographic dictionary Svenskt porträttgalleri (26 volumes, 1895-1913) and the Finnish encyclopedia Tietosanakirja (11 volumes, 1909-1922). The latter was scanned by the Internet Archive at the University of Toronto, from where we copied the scanned images and added our own improved OCR text.
For journals and periodicals, we only digitize volumes that were published more than 70 years ago. This year, however, the editors of the Swedish engineering journal Ny Teknik encouraged us to advance beyond this limit regarding their predecessor Teknisk Tidskrift. We did so in July by digitizing the years 1960 and 1962. We have received only thanks and no complaints, and will thus continue on this path.
Both in October and November, our website produced over 1.3 million pageviews, which is 14 percent less (!) than a year ago. Alexa currently ranks us as the world's 80,000th most visited website, which is similar to the websites of Norway's and Sweden's national libraries and, as far as we know, better than all museums and all other library websites in Scandinavia. Norway's national archive (arkivverket.no) is an exception in this group, having a website ranked as the 24,000th most visited.
Public Domain Day - January 1
Copyright lasts for an author's lifetime + 70 years. Every year on January 1, works by a new class of authors enter the public domain. Celebrate the Public Domain Day with us and other enthusiasts around the world on January 1, 2014, when works by authors who died in 1943 are released. For a list of such authors, translators, and illustrators, search our database of Nordic Authors or consult Wikipedia's Category:1943 deaths (available in many languages).
Among Scandinavian celebrities, we immediately spot Danish writer Henrik Pontoppidan, Finnish writer Maria Jotuni, Norwegian poet Nordahl Grieg, Norwegian sculptor Gustav Vigeland, Swedish composer Alice Tegnér, and Swedish painter Nils von Dardel.
Speed and predictions
Are we doing fine, or worse than we could have? Our growth graph, as shown here in December, has been overlaid with two dashed lines starting from the point in early 2006 when we had scanned 400,000 pages. If we had continued as we started, to add 120,000 pages/year, we would be at 1.2 million pages now, instead of 1.0 million. We could have reached 1.0 million pages two years ago. On the other hand, if we had continued to add 40,000 pages/year as we did during 2006-2010, we would still be at 680,000 pages and reach 1.0 million in 2021.
Both predictions are possible, just as our present reality in between. One fascinating aspect is how the rate of proofreading seems to be entirely insensitive to the rate of scanning. It continues at about 30,000 pages/year, before 2006, after 2010, and in between.
Update: As of September 7, 2013, we have 1.36 million pages and are no longer "in between", but well above the +120,000 pages/year dashed line.