- Project Runeberg -  Welcome to Project Runeberg
Front page | Next >>
Lysator Linköping University
  Project Runeberg | Like | Catalog | Recent Changes | Donate | Comments? |   
Project Runeberg (runeberg.org) is a volunteer effort to create free electronic editions of classic Nordic (Scandinavian) literature and make them openly available over the Internet. Projekt Runeberg (runeberg.org) arbetar på frivillig grund med att skapa fria elektroniska utgåvor av klassisk nordisk litteratur och göra dem öppet tillgängliga över Internet.

Project Runeberg, February 2014


February 2014

Open data at #HACK4NO

by Lars Aronsson

Institutions are eager to reach out to new users and to find new ways to explore and combine their databases and digitized collections (such as scanned photos or books). Some develop their own website, others use common sharing platforms such as Flickr or Youtube. Some cooperate with Wikipedia in various ways. An increasing trend is to organize a hack-a-thon, a small festival where individuals or small companies are invited to compete with the best new ideas and software for reusing the data from the organizing institutions.

In Oslo, the national arts council organized such an event, #HACK4NO, a hackathon for Norway, in the beginning of February 2014. Contributing institutions included the national library, national archives, art museums, the encyclopedia Store Norske Leksikon, an environmental agency with a biodiversity database, and the national land survey (which produces maps).

Curious of all that is happening in Norway, and which is different from my native Sweden, I went there to participate in three days of free food and interesting talks. It was seven years since I last visited the Norwegian capital and a whole new Manhattan skyline has grown up around the central train station.

As you might know, Norway's national library, Nasjonalbiblioteket, has a very impressive digitization program, intending to digitize all of Norway's literature, both out of and still in copyright. This is based on an agreement with Kopinor, a central organization for Norwegian copyright holders. The government pays a fee, to compensate for the fact that all Norwegian citizens can read 20th century literature online for free. But access is limited to IP addresses based in the country. Foreigners who visit bokhylla.no only get access to the out of copyright literature, which means older literature and government reports.

Project Runeberg has already copied many of these older, freely available books, sometimes adding our own OCR text, and made it possible for volunteers to proofread that text. We copy books from many sources, but the Norwegian ones are increasing quickly. One recent example is Den Norske husflidsforenings håndbok i vevning, on the craft of weaving.

Ahead of the hackathon, Nasjonalbiblioteket had released some new data and documentation on their digitization project aimed at developers, including lists of all the digitized works. It turns out 160,000 works are available to Norwegians, but only 20,000 (or 13 %) are considered to be free from copyright and available to the world. As my contribution to the competition, I decided to study the difference between these lists, the 140,000 non-free works. I was hoping to find some errors, some works that really should be free, but that had not yet been made freely available.

The lists contain the names of the authors, apparently derived from the BIBSYS library catalog. Unfortunately, BIBSYS only rarely specifies the years of birth and death for authors, so it's hard to identify which works are written by authors who died more than 70 years ago. Instead, I took an easier approach and looked at the year of publication, assuming that the oldest works are more likely to have fallen out of copyright. The result is this list of the oldest non-free works. At the start it listed 392 books published before 1890, which turn out to be good candidates for an investigation in copyright. Many of them should be freed.

My list was one of 14 entries in the competition. As many as 11 entries were based on geographic data, combining maps with coordinates of monuments of national heritage, and the like. Norwegians like hiking and exploring the landscape, more than studying books, apparently. One entry which compared the text of articles from Store Norske Leksikon to those in Wikipedia, was contributed by a Danish developer. My contribution was not "an app" and doesn't use any "API". It is a minimalistic script of less than 100 lines of code, downloading and comparing two lists, and producing a single web page in HTML as its output.

My entry didn't win any prize in the competition. All I got was a t-shirt. But that was not the point. My aim was not to bring the catalog data to a wider audience, but to provide feedback to Nasjonalbiblioteket, making them release more of the non-free books. Recently, Nasjonalbiblioteket had added reader comments based on Disqus.com to their website. Comments can be added to any entry for a digitized book. My idea was to use this for requests to make the book freely available. But after two weeks, it turns out that Nasjonalbiblioteket most often doesn't read these comments. It doesn't work as a feedback channel. Of the initially 392 books, two of the oldest were made freely available immediately after the hackathon. A third book was made free (and soon copied to Project Runeberg for proofreading: Gjennem Lorgnetten. 1) after I contacted Nasjonalbiblioteket on Twitter. They now suggested I should use e-mail as a feedback channel, which is my current approach. Each book in my list now has an e-mail (mailto) link which fills in the URN address of the book in the subject line of the e-mail. It remains to see if this works as a feedback channel. The response to my efforts can best be described as "slow".

There is currently a lot of hype around "open data" with hackathons and data releases being organized everywhere. In reality, the data that are released are often filled with errors. Opening the data to a wider audience will expose these errors and make it possible to correct old mistakes. But this also requires that the institutions are open to feedback, and interested in improving their data.

To clarify the phrase "filled with errors": Out of the 160,000 digitized books, I estimate that more than 100, possibly several hundreds, are erroneously categorized as non-free. My little program generates a list of the most likely candidates. (There might also be errors in the opposite direction, but those are not my priority.) Even if more than 99% of cases are correct (in telecommunications terminology known as "two nines"), it is less than 99.9% ("three nines"). OCR software correctly recognizes between 99% and 99.9% of the characters, which is why we need manual proofreading, hoping to reach "four nines" (99.99% accuracy or 1 error in 10,000 characters) or more.

Update: On March 1, the following works had become freely available and were copied to Project Runeberg. However, the lists of Nasjonalbiblioteket's digitized books are no longer properly updated, but only show a smaller fraction of all books, and they don't seem to be in any hurry to fix their errors.

March 9, the following 12 works were made free:

March 13, more books were added to the long list of available books, but none to the short list of publicly available books. One book was removed (a clear error) from the long list, even though it should have been made free and added to the short list.

March 15, the following 5 works were made free:

March 16, no changes were made to my list. The total number of digitized books increased from 162,791 to 162,998 (+207) but the number of freely available books fell from 21,696 to 21,671 (-25). Fluctuations of this kind occur every day, which I wasn't prepared for. Did 25 books actually change from being freely available to not being so? Or are the lists simply not reflecting the reality? I haven't checked the day-to-day differences in detail, so I don't know. Maybe I should modify my software to check this.

March 18, P. Ulleland (translator), Volsungernes saga (1887) was removed from our list. Not because it was made free (it isn't) but because it disappeared from the list of digitized books! It is still digitized, however, but no longer listed. Only the 1902 reprint is listed as digitized (and not free). The author is apparently Peder Jakobsen Ulleland (1859-1892), who has been dead for more than 70 years, so both printings of the work should be made free.

March 19, P. Ulleland (translator), Volsungernes saga (1887) is back on our list.


January 2014

Three Anniversaries

In 2014, the Scandinavian/Nordic countries commemorate three important anniversaries. Below are some links to literature that we provide, related to these events. It is:


December 2013

A Summary of 2013

Project Runeberg turned 21 years old on December 13, 2013, the same age that Project Gutenberg (founded in 1971) had when we started in 1992. They are no longer twice our age.

We currently hold 1.49 million scanned book pages, which is a 45% increase over last year. Since the end of 2011, two years ago, our collections have doubled. 1.34 million pages (90 percent) are OCRed, but only 0.27 million (18 percent) are proofread. Proofreading progressed with only 20,000 pages this year.

Among important additions this year are the Swedish illustrated biographic dictionary Svenskt porträttgalleri (26 volumes, 1895-1913) and the Finnish encyclopedia Tietosanakirja (11 volumes, 1909-1922). The latter was scanned by the Internet Archive at the University of Toronto, from where we copied the scanned images and added our own improved OCR text.

For journals and periodicals, we only digitize volumes that were published more than 70 years ago. This year, however, the editors of the Swedish engineering journal Ny Teknik encouraged us to advance beyond this limit regarding their predecessor Teknisk Tidskrift. We did so in July by digitizing the years 1960 and 1962. We have received only thanks and no complaints, and will thus continue on this path.

Both in October and November, our website produced over 1.3 million pageviews, which is 14 percent less (!) than a year ago. Alexa currently ranks us as the world's 80,000th most visited website, which is similar to the websites of Norway's and Sweden's national libraries and, as far as we know, better than all museums and all other library websites in Scandinavia. Norway's national archive (arkivverket.no) is an exception in this group, having a website ranked as the 24,000th most visited.

Public Domain Day - January 1

Copyright lasts for an author's lifetime + 70 years. Every year on January 1, works by a new class of authors enter the public domain. Celebrate the Public Domain Day with us and other enthusiasts around the world on January 1, 2014, when works by authors who died in 1943 are released. For a list of such authors, translators, and illustrators, search our database of Nordic Authors or consult Wikipedia's Category:1943 deaths (available in many languages).

Among Scandinavian celebrities, we immediately spot Danish writer Henrik Pontoppidan, Finnish writer Maria Jotuni, Norwegian poet Nordahl Grieg, Norwegian sculptor Gustav Vigeland, Swedish composer Alice Tegnér, and Swedish painter Nils von Dardel.


January 2013

Project Runeberg growth 2003-2013 with dashed lines for two predictions

Speed and predictions

Are we doing fine, or worse than we could have? Our growth graph, as shown here in December, has been overlaid with two dashed lines starting from the point in early 2006 when we had scanned 400,000 pages. If we had continued as we started, to add 120,000 pages/year, we would be at 1.2 million pages now, instead of 1.0 million. We could have reached 1.0 million pages two years ago. On the other hand, if we had continued to add 40,000 pages/year as we did during 2006-2010, we would still be at 680,000 pages and reach 1.0 million in 2021.

Both predictions are possible, just as our present reality in between. One fascinating aspect is how the rate of proofreading seems to be entirely insensitive to the rate of scanning. It continues at about 30,000 pages/year, before 2006, after 2010, and in between.

Update: As of September 7, 2013, we have 1.36 million pages and are no longer "in between", but well above the +120,000 pages/year dashed line.


Project Runeberg, 2014-07-31 00:20 (runeberg)
http://runeberg.org/

Valid HTML 4.0! All our files are DRM-free