|
|
![]()
Speed and predictions
Are we doing fine, or worse than we could have? Our growth graph, as shown here in December, has been overlaid with two dashed lines starting from the point in early 2006 when we had scanned 400,000 pages. If we had continued as we started, to add 120,000 pages/year, we would be at 1.2 million pages now, instead of 1.0 million. We could have reached 1.0 million pages two years ago. On the other hand, if we had continued to add 40,000 pages/year as we did during 2006-2010, we would still be at 680,000 pages and reach 1.0 million in 2021.
Both predictions are possible, just as our present reality in between. One fascinating aspect is how the rate of proofreading seems to be entirely insensitive to the rate of scanning. It continues at about 30,000 pages/year, before 2006, after 2010, and in between.
Twenty years December 1992-December 2012
![]()
Project Runeberg started in the evening of December 13, 1992, which means we are now entering our 21st year. During 2012 we have reached several milestones:
- Our 3,000th scanned volume was uploaded on November 27.
- Our 1,000,000th scanned page was uploaded on November 22.
- Our 250,000th page was proofread on November 20.
- Our 500,000th page was indexed on May 31.
- During December we converted all books scanned before 2005 from ISO 8859-1 (Latin-1) to the UTF-8 (Unicode) character set. Books scanned after 2005 already use the new standard, which allows a mix of Greek, Cyrillic, Chinese and other characters.
- During the year, we recruited new volunteers in scanning and bought a faster scanner, which allowed our collection to grow much faster. One third of our 1,000,000 pages were uploaded during 2012.
- Our Facebook page remained popular and active, and attracted new fans, currently 1800 in all.
- Our website remained popular with between 1 and 2 million page views per month. As a comparison, this is roughly 1 percent of the web traffic to the Swedish language Wikipedia.
There are also many things we didn't accomplish:
- The speed of proofreading didn't increase, but remained at circa 30,000 pages per year. That's a good speed, but doesn't correspond to our increased scanning speed.
- We failed to secure funding for expanding the project. It is still hard to explain why digitizing books is beneficial to society, or why digitized books or newspapers should also be made easily and openly available.
Unicode conversion
An 8 minute video explains in Swedish how and why we're converting to Unicode. If you have a problem to view the video, perhaps the version on Facebook works better.One fifth of our collection or some 200,000 scanned pages was uploaded before 2005 and its text was encoded in ISO 8859-1 (Latin-1), an international 8-bit standard defined in the 1980s for letters of the north European languages. It can represent Danish æøå, French āįë, German äöüß, Swedish åäö, and Icelandic ūđ, but not Greek, Cyrillic, Hebrew, Arabic, and Chinese letters or other special characters.
More recently, all texts have been uploaded in UTF-8 (Unicode, ISO 10646), an international standard capable of representing many thousands of characters from virtually all languages of the earth.
During December 2012, we have been converting texts from the old standard to the new one. The video (above) illustrates this. Another example is seen in our preface to Fänrik Ståls sägner, one of our first e-texts from 1993.
Visitor statistics for periodicals
Recently, we have been digitizing old volumes of periodicals, including journals, yearbooks, biographic dictionarires, calendars, and directories. Since each volume, each year, is a book of 200 to 1200 pages, their popularity can be compared against each other, to find which titles and which years attract more viewers. This could be useful as a guide for what to digitize next. What are our readers looking for?
The following top list is based on our web server access log over 14 days, filtering out accesses from search engine robots/crawlers. This is not an entirely scientific survey, but a rough indication.
Rank Page
viewsVolume URL 1. 5995 adelskal/1923 2. 4014 svindkal/1947 3. 3460 statskal/1963 4. 3287 vemardet/1993 5. 2955 statskal/1984 6. 2681 statskal/1955 7. 2640 statskal/1925 8. 2521 karlebo/1936 9. 2373 vemardet/1969 10. 2105 hvemerhvem/1930 11. 2099 hvemerhvem/1948 12. 2043 hvemerhvem/1973
Rank Page
viewsVolume URL 13. 1863 rikskal/1908 14. 1758 vemardet/1985 15. 1569 vemardet/1957 16. 1384 vemardet/1925 17. 1203 statskal/1881 18. 1161 vemardet/1977 19. 1061 vemardet/1943 20. 926 hvemerhvem/1912 21. 853 vemardet/1933 22. 836 blaabog/1910 23. 820 sundtidn/1888 24. 647 svlartid/1933
Rank Page
viewsVolume URL 25. 646 vemardet/1963 26. 635 urf/1891 27. 562 pht/1903 28. 555 jernkont/1910 29. 497 pht/1899 30. 461 nfm/1938 31. 445 tiden/1911 32. 427 tiden/1912 33. 383 biblblad/1940 34. 376 svlartid/1893 35. 365 svlartid/1924 36. 364 svlartid/1930 Legend: adelskal = Directory of Swedish nobility; biblblad = Biblioteksbladet, Swedish library journal; blaabog = Danish biographic dictionary; hvemerhvem = Norwegian biographic dictionary; jernkont = Jernkontorets annaler, journal of Swedish iron and steel industry; karlebo = A mechanical engineer's handbook; nfm = Nordisk familjeboks månadskrönika, monthly supplements to a Swedish encyclopedia; pht = Personhistorisk tidskrift, Swedish journal of genealogy; rikskal = Directory of Swedish government officials; statskal = Directory of Swedish government officials; sundtidn = Sundsvalls Tidning, a Swedish daily newspaper; svindkal = Directory of Swedish industry; svlartid = Swedish teachers' journal; tiden = Tiden, Swedish social-democratic journal; urf = Under röd flagg, a Swedish socialistic journal; vemardet = Swedish biographic dictionary.
Conclusion: For those curious about whether the teacher's journal is more or less popular than the library journal, this must be a revelation that directories are the most popular by far. The first journal volume in this list ranks as number 24. If the Internet is a library, then the reference shelf (including directories, calendars, dictionaries, and encyclopedias, such as Wikipedia) will be the most useful part of it.
All our files are DRM-free
Digital Rights Management (DRM) is a class of access control technologies that are used by hardware manufacturers, publishers, copyright holders and individuals with the intent to limit the use of digital content and devices after sale. (Wikipedia)
Project Runeberg provides free electronic editions of Nordic literature, to be used freely, without restrictions and limitations. That's why we don't use DRM in any of our files. In the case that any remaining copyright would restrict you, we hope that you will respect that, but we are not trying to stop you. We are joining the Free Software Foundation's campaign to inform our readers about the benefits of DRM-free file formats.
This little logotype at the bottom of our web pages is there to keep you reminded. Click it to learn more.
Full speed in July
The very rainy summer in Sweden has enabled Project Runeberg to continue at almost full speed, growing by 20,000 scanned pages during July. We now have more than 900,000 pages in all, corresponding to 45 linear metres of shelving.
Indexing articles
Suddenly, on page 46, issue 2, 1940, of Biblioteksbladet, the Swedish library journal, there is an article announcing that printed library catalog cards can be ordered for articles found in Swedish journals: "Tryckta katalogkort å tidskriftsuppsatser". The Swedish public library association (Sveriges allmänna biblioteksförening) organized this effort to index and cross-reference articles in the 32 most important Swedish journals, including their own Biblioteksbladet. At least in theory, there must be a printed catalog card for this article that announces that these cards are now available.
Most library catalogs contain one database record, corresponding to one paper card, for each book. While this works fine for novels and monographs, it's a poor solution for collections of essays, short stories, poems; and even more so for anthologies, readers, and journals, where the individual articles (or chapters or poems) are the important items that a reader would search for. In journals, each article might also have different authors and topic classification. The explanation is of course that a database of all articles can be ten to a hundred times larger than a database of all books. To compile such a register is a much more ambitious endeavor. When it was done, it was much more spotty than the full coverage normally expected of library catalogs or national bibliographies. The fact that only 32 journals were indexed, and starting only in 1940, are signs of this spottiness.
But this was the 1940s, and today everything is much better organized, right? Not quite, unfortunately.
After article indexing, the next logical step, as computers and networks develop, would be full text searching. But it was only in 2012 that Project Runeberg, this volunteer initiative, scanned year 1940 of Biblioteksbladet, making full text searching possible. Nobody had done this before: neither the national institutions, nor the owner the Swedish library association. The fact is that a search for "tryckta katalogkort", the title of this article, in the Swedish national library catalog Libris yields only two hits: for two other articles in Biblioteksbladet, one by Knut Tynell in 1917 and one by Jonas Samzelius in 1922, but not the article in 1940. Even though a printed catalog card was offered in 1940, no corresponding database record exists today.
As part of our digitization, we compile a simple table of contents, or list of articles. This is not a full library catalog record, but not far from it. Our help page on indexing (in Swedish) and the 1999 essay on Project Runeberg's Electronic Facsimile Editions of Nordic Literature describe the format of the Articles.lst file that we maintain for each scanned volume. If we could download existing library records for each article of a journal, this would be a great help. Unfortunately, even though the search above returned two articles from 1917 and 1922, we can't trust Libris to contain records for all articles from these years.
Even though early examples of article indexing across journals can be found in the 19th century, it was only in the first half of the 20th century that lasting projects started in the Scandinavian countries.
The development in Sweden has been documented in some books by Jan-Eric Malmquist: Tidskriftens innehåll : om artikelregistrering och artikelsökning (1987), Tidskriftiana : en skrift om tidskrifter (1990), and "Förstudien" : om behovet av en ny svensk periodicabibliografi (1991). Such rare documentation is written for a purpose. Jan-Eric is the founder of Artikelsök, a Swedish commercial online article database, operated by BTJ since 1984, and was investigating its possible extension into a national bibliography with government funding. This never came about, Artikelsök is still a commercial offering from BTJ, but the attempt left this documentation.
In 1915, Dansk tidsskrift index was founded, indexing articles in 80 Danish journals. The coverage increased to 300 journals in 1929 and 500 journals in 1978. In 1940, Avis kronik index was added, indexing articles in 25 Danish newspapers. The two merged in 1979 to form Dansk artikelindeks. Book reviews were not included, but a separate Dansk anmeldelsesindeks was added in 1979. All of these are part of the Danish national bibliography. In 1994, the printed indexes were discontinued and databases introduced. In the following decade, all databases have been made available in the common portal bibliotek.dk, where all articles back to 1915 can be searched.
In 1918, librarian Wilhelm P. Sommerfeldt founded Norsk tidsskriftindex, indexing articles in Norwegian journals, an effort that lasted until 1965. There is a gap in indexing for 1966-1979, before Norske tidsskriftsartikler (Nota) was started in 1980. In 1988, its name changed to Norart. It covers articles in more than 400 Norwegian journals. Even though Norart is part of the Norwegian national bibliography, operated by the national library, it used to be a subscription database costing NOK 1000 per year for individual customers. This changed in January 2006, when the fee was removed and Norart web search was provided free to all. The older index for 1918-1965 has not been digitized.
In its early decades, the Swedish library journal contained many articles discussing indexing, but it was only in 1940 (as mentioned above) that an effort was started. It was a collaboration, involving librarians at several public libraries, initially six of them, coordinated by the sales office of the library association (Bibliotekens försäljningscentral; later Bibliotekstjänst) in Örebro. The resulting catalog cards were printed on reseda green cardboard. The fee for an annual subscription was SEK 30 for an estimated 400 cards. Initially, 51 subscribers signed up, of which 42 were public libraries, the other 9 being various schools.
The effort lasted until 1951, when the cards were abandoned. The cards for the first five years have been compiled into a bound volume Svenska tidskriftsartiklar 1940-1945 (1985) by librarian Annika Lange.
In 1952, Svensk tidskriftsindex was introduced, being a printed volume per year, rather than separate catalog cards. While somewhat less convenient for the readers, it saved the librarians all the work of sorting the cards into the existing catalog drawers.
In 1984, Bibliotekstjänst started to offer an online database, Artikelsök, covering articles since 1979, to subscribing libraries. This is the state still today. The articles before 1979 are only indexed in the printed volumes.
Clearly, the three Scandinavian countries have chosen different paths. Most citizens will never notice, since they only look for articles in their own language, but there is still a lesson to learn. Denmark and Norway are able to provide the index free of charge, while Sweden relies on a commercial offering, only available through subscribing libraries. Sweden and Norway only provide online indexes for articles from the 1980s or more recent, while Denmark goes back to 1915 for journals and 1945 for newspapers.
For Project Runeberg, this opens a field of activity in digitizing the pre-1980 printed indexes for Sweden and Norway. Indexes are catalogs, not covered by full copyright, but only 15 years of catalog protection. Even though the indexes for 1940-1978 can be digitized, the articles and journals for these years are still under copyright. For most of the Swedish pre-1940 journals that Project Runeberg has digitized, no article indexes have ever been compiled, except for a few individual articles that might have been cataloged in Libris.
No copyright, only catalog protection, for Who's Who
Two and a half years ago, in October and November 2009, we started an experiment in copyright law by digitizing some works in the Who's Who genre. These are periodically published directories of contemporary people, full of biographic details about their birth, education and career.
By scanning volumes that were less than 70 years old, we wanted to test the hypothesis that such works are not covered by literary copyright, but only by the lesser "catalog protection" established in the copyright laws of the Scandinavian countries in the 1960s. Catalog protection, according to Danish copyright law section § 71, Norwegian copyright law 1961-05-12 nr 02 section § 43, and Swedish copyright law 1960:729 section § 49, is granted for a period of 15 years after the first publication of a "catalog, a table, a database or similar".
It's not clear exactly what kinds of works are covered by the full literary copyright for the author's lifetime + 70 years. We think that this covers traditional encyclopedias and biographic dictionaries such as Dansk biografisk Lexikon and Svenskt biografiskt handlexikon, which contain written stories with complete sentences. But a Who's Who contains only chronological lists of abbreviations, based on information submitted by the described individuals themselves.
In the 30 months that have passed since we started to digitize such works, not a single complaint or take-down notice has been received. While this is not a final answer, we take this opportunity to celebrate that our hypothesis has prevailed for two and a half years. As more time goes by and more works are being digitized, our interpretation manifests itself as the way it was always done.
- Biografiskt album för Svenska Missionsförbundet (Swedish)
- Hvem er Hvem? (Norwegian)
- Kraks Blå Bog (Danish)
- Merkantilt biografisk leksikon (Norwegian)
- Vem var det (Swedish)
- Vem är det (Swedish)
- Vem är hon (Swedish)
- Vem är vem (Swedish)
- Vem är vem inom handel och industri (Swedish)
Vad gör svenska staten?
Här är några utvalda korn, som långsamt har malts på den kvarn som drivs av svenska statliga skattepengar.
Om något av nedanstående gör dig glad eller frustrerad, skriv ett brev till den ansvariga ministern. För Kungliga biblioteket är det utbildningsminister Jan Björklund, för Riksarkivet och Riksantikvarieämbetet kulturminister Lena Adelsohn Liljeroth, för SCB finansminister Anders Borg, för digitala agendan IT-minister Anna Karin Hatt, för Kulturarvslyftet arbetsmarknadsminister Hillevi Engström. Du kan också donera pengar till Projekt Runeberg, som inte får del av de statliga pengarna.
1. Gamla ord får ny betydelse (nyspråk, käre Orwell) när IT-minister Anna Karin Hatt ska genomföra sin digitala agenda för Sverige och den 7 juni 2012 tillsatte en digitaliseringskommission, ledd av professor Jan "Gulan" Gulliksen. Ordet digitalisering används här för det, som tidigare kallades datorisering; införandet av datorer i olika delar av den offentliga förvaltningen. Det verkar inte ha något att göra med digitalisering i den gamla betydelsen omvandling från analogt till digitalt format, till exempel scanning av böcker.
2. Kulturarvslyftet lanserades i september 2011 som en arbetsmarknadsåtgärd, värd 800 miljoner kronor, för att få arbetslösa akademiker i arbete med inventering, vård och digitalisering av kulturarvet. Det låter ju mycket lovvärt, om man nu får värde för pengarna. Det hela samordnas av Riksantikvarieämbetet. Men den 29 maj 2012 meddelade Göteborgs-Posten att denna miljardsatsning gav bara 1 jobb hittills i Göteborgsregionen och 25 i hela landet, mot planerade 4400.
3. Riksarkivet tillkännagav 29 mars 2012 att Svenskt biografiskt lexikon (SBL) nu finns gratis på webben. Äntligen! Detta unika personhistoriska uppslagsverk påbörjades inom Bonnierkoncernen 1917, men blev aldrig lönsamt. Utgivningen övertogs 1962 av staten och ingår sedan 2009 i Riksarkivet. 2004 kom en CDROM-utgåva som nu har varit slutsåld i flera år. Slutet gott, allting gott? Nja... inte riktigt. Utgivningen har fortfarande bara kommit från Abelin (1917) till Ström (band 33, 2011) och ännu återstår minst fem band. Den aktuella webbversionen består enbart av text, inga faksimilbilder, och texten innehåller OCR-fel. Här är halva jobbet ogjort. Bakläxa! Någon öppen licens enligt Creative Commons anges inte heller.
4. Riksarkivets stora SVAR-databas är fortfarande inte öppet tillgänglig, utan kostar 995 kr/år i abonnemangsavgift. Hur många år ska vi behöva vänta på frisläppandet? Detta är betalt med skattemedel, så släpp det fritt. Nu, omedelbart!
5. Riksarkivet driver sedan september 2011 Digisam, ett sekretariat för samordning av digitalisering (i ordets äldre betydelse) av kulturarvet vid 24 svenska statliga minnesinstitutioner, däribland Kungliga biblioteket, Nationalmuseum, Nordiska museet, Riksantikvarieämbetet, Riksarkivet, Svenska filminstitutet och Tekniska museet.
- Har Digisam några pengar för att påskynda digitaliseringen?
- Nej.
- Kan Digisam svara på vad som redan är digitaliserat, så att dubbelarbete kan undvikas?
- Nej.
- Kan Digisam svara på vilken statlig institution som kommer att digitalisera, till exempel, äldre årgångar av tidskriften Ord och Bild?
- I den av regeringen beslutade strategin för digitalisering av kulturarvet står att: "Alla statliga institutioner som samlar, bevarar och tillgängliggör kulturarvsmaterial och kulturarvsinformation ska ha en plan för digitalisering och tillgänglighet."
- Det är alltså varje myndighet/institution som måste svara på nödvändiga frågor om vad, om och i villken turordning de egna samlingarna ska digitaliseras.
- Digisam har erbjudit sig att driva och hålla samman arbetet med planerna för att de ska bli jämförbara och för att vi tillsammans ska identifiera och hantera gemensamma strategiskt avgörande frågor. I det arbetet, som planeras till 2013, kommer med all säkerhet frågan komma upp om ansvar för trycksaker som kan finnas i flera olika samlingar.
6. Riksarkivet samarbetar med Kungliga biblioteket om att digitalisera dagstidningar inom det EU-finansierade projektet Digidaily. Mycket lovvärt! KB har 122 miljoner tidningssidor på sina hyllor och hittills har 1,6 miljoner sidor blivit digitaliserade. Men - håll i er nu - projektet omfattar inte tillgängliggörandet på nätet. Man scannar och lagrar, men visar ingenting! Hur tänkte man när man gjorde upp den planen? Borde den ansvarige ministern avgå? Eller är det helt okej?
7. Kungliga biblioteket har köpt en bokscanningsrobot som själv bläddrar sidor för 850.000 kronor. Hur många böcker har den scannat hittills? Det sägs att den arbetar på Statens offentliga utredningar, från startåret 1922 till och med 1988, vilket skulle vara ett utmärkt material eftersom det redan från början är fritt från upphovsrätt. Men när kommer de att bli läsbara på nätet? Den 20 april 2012 skriver Christian Zeising att man hunnit fram till 1950-talet och att de ska nätpubliceras "inom de närmaste veckorna". Två månader senare har inget mer hänt.
Uppdatering: Den 7 september 2012 tillkännagav Kungliga biblioteket att 930 SOU (av totalt 5600) nu är tillgängliga. Se vår indexsida för mer information.
8. Statistiska centralbyrån (SCB) har digitaliserat bokserien Bidrag till Sveriges officiella statistik (BiSOS) från 1800-talet och annan historisk statistik, totalt 4400 volymer om 650.000 sidor (ungefär jämnstort med Projekt Runeberg). Liksom aktuell statistik ligger allt öppet och lätt tillgängligt på webben. Härifrån kanske staten borde rekrytera nästa riksarkivarie? Ett reportage finns i Biblioteksbladet, nr 3-2012, sidan 17 (PDF).
What exactly were we doing in May?
During May 2012, Project Runeberg's collections continued to grow fast, both by scanning and by importing scanned works from external sources. It seems that a total of 39,138 scanned pages were added. While this is just a bit more than half of what we did in April, it's almost as much as we did in the first three months of this year and more than we did in all of 2006 or 2010.
Some authors whose works were added are John Bunyan, Frederik Dreier, Erik Jakob Ekman, Albert Engström, Carl Grimberg, Joseph Guinchard, Mauritz Hallberg, Johan Ludvig Heiberg, Bror Emil Hildebrand. Victor E. Lennstrand, Carl Gustaf af Leopold, Jack London, and Christian Winther. There might be others as well, and there are some works without a named author.
But which were the texts that we added? And what other progress did we make during the month? Previously, we provided a list as part of this monthly update. But the list was compiled semi-manually in an improvised fashion. We need a better system for this and thought volunteers like you may help out. Already, our catalog can be sorted by the date each work was added, to get a list of recently published titles, but this is only half the truth. The catalog only lists the top level titles, such as the title of a journal or encyclopedia. But it doesn't indicate which volumes of that title are available or when the parts were added. For example, the catalog won't tell you that during May 2012 we added years 1894-1897 to the already existing 1898-1940 of the journal Pedagogisk tidskrift.
What would it take to make a better catalog, that immediately could provide the complete and up-to-date information of what we have? Ultimately, it needs to list not only titles and volumes, but also the contents of volumes. For a journal, it's the individual articles that are interesting. Could we present this as an API for developers, or as open bibliographic data for download? In what format and by what parameters?
We could need some help with OCR
As more people are helping out with scanning books, we also need more volunteers to run OCR. You are welcome to run any OCR software you prefer. If you ask for our recommendation, we'll say Finereader Professional, which unfortunately only runs under Microsoft Windows. If you have a good computer that runs Windows and some time to devote, we'll guide you through the process. Send us an e-mail.
Calling out to Norway and Finland
The vast majority of our titles are in Swedish or Danish. Literature in Norwegian and Finnish is still an uncharted territory (not to mention Icelandic or Sami). Which titles would be most useful to bring online? Please give us a hint. Perhaps you can even help us to scan them?