Project Runeberg's front page section for February 2014:
by Lars Aronsson
Institutions are eager to reach out to new users and to find new ways to explore and combine their databases and digitized collections (such as scanned photos or books). Some develop their own website, others use common sharing platforms such as Flickr or Youtube. Some cooperate with Wikipedia in various ways. An increasing trend is to organize a hack-a-thon, a small festival where individuals or small companies are invited to compete with the best new ideas and software for reusing the data from the organizing institutions.
In Oslo, the national arts council organized such an event, #HACK4NO, a hackathon for Norway, in the beginning of February 2014. Contributing institutions included the national library, national archives, art museums, the encyclopedia Store Norske Leksikon, an environmental agency with a biodiversity database, and the national land survey (which produces maps).
Curious of all that is happening in Norway, and which is different from my native Sweden, I went there to participate in three days of free food and interesting talks. It was seven years since I last visited the Norwegian capital and a whole new Manhattan skyline has grown up around the central train station.
As you might know, Norway's national library, Nasjonalbiblioteket, has a very impressive digitization program, intending to digitize all of Norway's literature, both out of and still in copyright. This is based on an agreement with Kopinor, a central organization for Norwegian copyright holders. The government pays a fee, to compensate for the fact that all Norwegian citizens can read 20th century literature online for free. But access is limited to IP addresses based in the country. Foreigners who visit bokhylla.no only get access to the out of copyright literature, which means older literature and government reports.
Project Runeberg has already copied many of these older, freely available books, sometimes adding our own OCR text, and made it possible for volunteers to proofread that text. We copy books from many sources, but the Norwegian ones are increasing quickly. One recent example is Den Norske husflidsforenings håndbok i vevning, on the craft of weaving.
Ahead of the hackathon, Nasjonalbiblioteket had released some new data and documentation on their digitization project aimed at developers, including lists of all the digitized works. It turns out 160,000 works are available to Norwegians, but only 20,000 (or 13 %) are considered to be free from copyright and available to the world. As my contribution to the competition, I decided to study the difference between these lists, the 140,000 non-free works. I was hoping to find some errors, some works that really should be free, but that had not yet been made freely available.
The lists contain the names of the authors, apparently derived from the BIBSYS library catalog. Unfortunately, BIBSYS only rarely specifies the years of birth and death for authors, so it's hard to identify which works are written by authors who died more than 70 years ago. Instead, I took an easier approach and looked at the year of publication, assuming that the oldest works are more likely to have fallen out of copyright. The result is this list of the oldest non-free works. At the start it listed 392 books published before 1890, which turn out to be good candidates for an investigation in copyright. Many of them should be freed.
My list was one of 14 entries in the competition. As many as 11 entries were based on geographic data, combining maps with coordinates of monuments of national heritage, and the like. Norwegians like hiking and exploring the landscape, more than studying books, apparently. One entry which compared the text of articles from Store Norske Leksikon to those in Wikipedia, was contributed by a Danish developer. My contribution was not "an app" and doesn't use any "API". It is a minimalistic script of less than 100 lines of code, downloading and comparing two lists, and producing a single web page in HTML as its output.
My entry didn't win any prize in the competition. All I got was a t-shirt. But that was not the point. My aim was not to bring the catalog data to a wider audience, but to provide feedback to Nasjonalbiblioteket, making them release more of the non-free books. Recently, Nasjonalbiblioteket had added reader comments based on Disqus.com to their website. Comments can be added to any entry for a digitized book. My idea was to use this for requests to make the book freely available. But after two weeks, it turns out that Nasjonalbiblioteket most often doesn't read these comments. It doesn't work as a feedback channel. Of the initially 392 books, two of the oldest were made freely available immediately after the hackathon. A third book was made free (and soon copied to Project Runeberg for proofreading: Gjennem Lorgnetten. 1) after I contacted Nasjonalbiblioteket on Twitter. They now suggested I should use e-mail as a feedback channel, which is my current approach. Each book in my list now has an e-mail (mailto) link which fills in the URN address of the book in the subject line of the e-mail. It remains to see if this works as a feedback channel. The response to my efforts can best be described as "slow".
There is currently a lot of hype around "open data" with hackathons and data releases being organized everywhere. In reality, the data that are released are often filled with errors. Opening the data to a wider audience will expose these errors and make it possible to correct old mistakes. But this also requires that the institutions are open to feedback, and interested in improving their data.
To clarify the phrase "filled with errors": Out of the 160,000 digitized books, I estimate that more than 100, possibly several hundreds, are erroneously categorized as non-free. My little program generates a list of the most likely candidates. (There might also be errors in the opposite direction, but those are not my priority.) Even if more than 99% of cases are correct (in telecommunications terminology known as "two nines"), it is less than 99.9% ("three nines"). OCR software correctly recognizes between 99% and 99.9% of the characters, which is why we need manual proofreading, hoping to reach "four nines" (99.99% accuracy or 1 error in 10,000 characters) or more.
Update: On March 1, the following works had become freely available and were copied to Project Runeberg. However, the lists of Nasjonalbiblioteket's digitized books are no longer properly updated, but only show a smaller fraction of all books, and they don't seem to be in any hurry to fix their errors.
March 9, the following 12 works were made free:
March 13, more books were added to the long list of available books, but none to the short list of publicly available books. One book was removed (a clear error) from the long list, even though it should have been made free and added to the short list.
March 15, the following 5 works were made free:
March 16, no changes were made to my list. The total number of digitized books increased from 162,791 to 162,998 (+207) but the number of freely available books fell from 21,696 to 21,671 (-25). Fluctuations of this kind occur every day, which I wasn't prepared for. Did 25 books actually change from being freely available to not being so? Or are the lists simply not reflecting the reality? I haven't checked the day-to-day differences in detail, so I don't know. Maybe I should modify my software to check this.
March 18, P. Ulleland (translator), Volsungernes saga (1887) was removed from our list. Not because it was made free (it isn't) but because it disappeared from the list of digitized books! It is still digitized, however, but no longer listed. Only the 1902 reprint is listed as digitized (and not free). The author is apparently Peder Jakobsen Ulleland (1859-1892), who has been dead for more than 70 years, so both printings of the work should be made free.
March 19, P. Ulleland (translator),
Volsungernes saga (1887) is back on our list.