Design Decisions of Project Runeberg

by Lars Aronsson

There are many ways to do what Project Runeberg does. In this section, we try to explain the decisions that we have taken together with the reasons why. If we would have taken other decisions, Project Runeberg might have been quite different. It is hoped that this section could help you understand Project Runeberg, while also serving as a check list for your own publishing project.

If there are decisions that we seem to have taken, but that are not documented here, we want you to write to us. The reason might be that we forgot to document these decisions, or that we just did not know of any alternative way to the one we chose.

The decisions are listed in no particular order, but where possible they have been ordered by date. New decisions are added at the end of this list. If you wish to quote from or reference this text, use the URL and date found at the very bottom of this HTML page.

Why did Project Runeberg start? (1992): In 1992, we wanted to demonstrate the potential of the Internet not only for technical purposes, but also in the humanities and other activities such as government and business. After having seen applications such as MUD and IRC develop, information publishing systems such as Gopher and the World Wide Web had a hard time to bootstrap because there was too little to read to attract readers and too few readers to attract publishers. At this time, we started to publish Nordic literature, assuming that it would eventually help attract readers that could relate to this material, and it did.
Why such an abstract name? (1992): "Everyone will be world-famous for fifteen minutes." -- Andy Warhol (1928-1987). In this kind of future, it is important for any service provider to use a clear name and trademark, providing it as a small hook in the minimal attention span of the average audience upon which she can accumulate positive impressions over time. The field of computers and the Internet is evolving, ever changing, and any name that makes too specific claims as to the nature of our activities (electronic text archive, digital library, ...), might force us to change the name often or live with inconsistency. The name Project Runeberg tells you that this is a project, but does not tell you anything about what it does. While the name might lead you to think about the 19th century poet Johan Ludvig Runeberg, the Nordic countries (mostly Finland and maybe Sweden), and perhaps Project Gutenberg, there is nothing of a promise in the name. We considered this to be important at the outset of our project.
Why doing this for free? (1992): Traditional publishing has costs per copy, for paper and distribution that will not scale below a certain point. This is not true for online electronic publishing, where the initial setup costs can be divided among an infinite number of users. We are convinced that with time, computers and communications will be cheaper, faster, and provide more capacity. Nobody knows how large an audience an online publishing project might get, and we believe that charging for services per user, per chapter, or per hour would be the strongest limit on the use of Project Runeberg. Charging also requires the development of some technology that is not yet there, and the academic environment in which we exist is not built for activities generating their own revenue. For example, the Swedish University Network has rules against commercial use.
Having ruled out charging for our services, we still have the option of getting funding from various sources. Applying for funds is a time-consuming activity, however, and as we have better things to do, we have chosen to base Project Runeberg on voluntary sparetime work. This principle actually scales very well with the Internet, as perhaps most clearly shown by the Linux operating system development project. Project Runeberg has received a limited amount of funding from Linköping University, and LYSATOR has repeatedly received donations from industry and from departments of the university.
Why literature? (1992): Our main inspiration at the time was Project Gutenberg, which already published free electronic editions of literature on the Internet, mostly English language literature. Literature is easy to publish online, because it only requires text files, which you can manage with a keyboard and text terminal. Images and other data require special hardware, higher network bandwidth, and data formats that might not yet be standadized. We were also familiar with free software distribution projects, such as Linux and the Free Software Foundation's GNU project. But our goal was to demonstrate the usefulness of the Internet in non-technical areas.
Why Nordic literature? (1992--1994): We are based in Sweden, so Swedish is our own language, whereas the information available on the Internet was and is dominated by the English language. To be able to demonstrate the usefulness of the Internet to people close to us, we wanted to present materials in our own language. At the same time, Sweden is a rather small country, only having 9 million inhabitants, and the languages of the other Scandinavian countries (Danish, Norwegian) are easily understandable. At little extra effort, we also included Icelandic, and thereby included in our scope the old Icelandic sagas from the viking age. It was only later (1993 or 1994), however, that we added the Finnish language to the definition of Project Runeberg. Together with it came the closely related languages of Sami and Estonian. These languages are not similar to the others, and are not spoken by any of our editors. By including literature in these languages, the scope of Project Runeberg was broadend from Scandinavian to Nordic. We have been able to resist the temptation to include German and Russian literature, even though this has been suggested. These cultural areas have later developed their own successful publishing projects.
Why mostly old literature? (1992): As our goal is to provide electronic editions free of charge, we will not be able to pay royalties to authors or copyright owners. This forces us to use material that we are able to publish without such restrictions. Copyright protection is granted by law during the entire lifetime of the author and then for 70 more years (life+70), but after that, works of literature and art enter the public domain, which means Project Runeberg and others can freely publish them. This is the main reason why most of what Project Runeberg outputs is quite old literature.
Why no English translations? (1993--): Project Runeberg has defined Nordic literature to be any literature that is available in the Nordic languages, either originally written here, or translated from other languages. Perhaps the most significant example of the latter is the Bible, originally written in Hebrew, Greek, and Latin. We have often been asked to publish English (and perhaps also German and French) translations of literature written by Nordic authors. So far, we have resisted doing this, because such translations can be published by English language servers and we want to prioritize material in our own languages, but the idea is still tempting. We might open up to English translations of Nordic literature in the future.
Why is some information in English? (1992): The working language of Project Runeberg is English. This means information about the project is provided in English first, and you can always write to us in English. Even though most people in Norway, Denmark, and Finland understand Swedish, and we in Sweden understand Norwegian and Danish, there are subtleties in the differences between these languages that may cause dangerous misunderstandings, and we also don't want to exclude anybody from using Project Runeberg just because they don't speak our language. We believe that at this time, English serves the role of the single international language better than Latin and Esperanto. In some parts of Project Runeberg, information is provided in Swedish only or Finnish only, which is unfortunate. We hope to improve this in the future, so that all information is available in English aswell as at least one Nordic language.
Why Latin-1? (1992): From the start, Project Runeberg has used the standardized character set ISO 8859-1, also known as Latin-1, containing the characters most widely used in northern and western Europe. This is the default character set in most UNIX dialects, in the X Window System, in Microsoft Windows (3.0, 95, and NT), and on the Apple Macintosh when sold on Iceland. This character set allows us to represent all characters used in the languages of the literature that we publish, except for Sami (i.e. Lappish, which requires Latin-6) and rune inscriptions (for which we know of no standardized character set). Users of Apple Macintosh not sold on Iceland, users of early versions of Hewlett-Packard UNIX (HP-UX), and users of Microsoft DOS without Windows should set their web browsers to display Latin-1 (Western Europe) as the default encoding. The alternative would be for Project Runeberg to present parallel editions in each character set and text encoding that users might prefer, but this always opens to some readers being unsatisfied because we forgot to support their system. Instead, we want to endorse a single standard where possible. In the Hypertext Markup Language (HTML), Latin-1 is also the default encoding specified by the approved Internet Standard, so there is absolutely no need for us to publish web pages containing ugly codes like ä and þ when we have keyboards that allow us to type ä and þ directly. In the long run, we believe that the World Wide Web will convert to Unicode (ISO 10646), but for the time being, Latin-1 does the job for us.
Why HTML and not just plain text? (1994): Project Runeberg started out using Gopher technology for Internet publishing, allowing only plain text format files, in combination with FTP. LYSATOR was very early to adopt World Wide Web (WWW) technology in 1993, including the Hypertext Transfer Protocol (HTTP) and the Hypertext Markup Language (HTML), supposedly being the 14th web site worldwide to register with CERN, and Project Runeberg decided in 1994 to use this as its primary or only medium. HTTP has turned out to be the dominating transport mechanism for accessing remotely published material, by far more common than FTP or Gopher. While HTTP does not rule out the use of plain text format, HTML provides some nice text formatting options (among them bold face and italics) that are very useful in electronic republishing literature that has previously been published on paper. Unfortunately, HTML does not provide any way to markup words with i n t e r l e a v e d s p a c i n g (spärrad stil), a typographic feature which has earlier been commonly used instead of bold face in Swedish imprints. When our copy uses interleaved spacing, we indicate this in our source files, and convert to bold face representation in the HTML version that you see.
Why not SGML and the TEI DTD? (1993): Already in 1993 we were confronted by the possibility to make our texts available in the Standard Generalized Markup Language (SGML) format using the document type definition (DTD) of the Text Encoding Initiative (TEI). This is supposedly the politically correct format for scholarly electronic editions. However, such requests have only come from people wanting us to do the work, and not from anybody wanting to volunteer. The vast majority of Project Runeberg's readers are happy to be able to access our editions in the HTML subset that we have used since 1994, and the Gopher plain text format before that.
Why the open URL policy? (1994): Project Runeberg wants to encourage external WWW content providers to create hypertext links to any part of our material. That is to say we want Project Runeberg to be a truely integrated part of the World Wide Web, which we view as one single, global hypertext. To facilitate this, we have adopted and openly published a policy for the Uniform Resource Locators (URLs) or addresses of our web pages. All material that we publish has an URL that should be stable over long time (whatever that means), easy to write down and quote, and contain no irrelevant data. The main URL for the project (http://www.lysator.liu.se/runeberg/) is the base of the URLs to all our HTML pages. The base URL clearly shows our organizational affiliation, which is an indication of our belief that both Linköping University and LYSATOR will continue to exist for a long time. Added to this base URL, each electronic edition that we publish uses a code name which is also a the name of a subdirectory in the file system on our server. These code names consist of at most eight lower case letters from the English alphabet (a-z) or decimal digits (0-9). Each directory then contains a flat level of files having file name suffixes that indicate the type of the file. In particular, there must not be any strange characters that are hard to reproduce (such as tilde, dashes, and whitespace) in Project Runeberg URLs.
Why no PURL or DOI?: With our open URL policy, we will have a hard time to alter any of our URLs, and much more the base URL of our project. Taking into account that we devised this policy with longevity in mind, it is highly unlikely that we should need any other means to extend the life of references to our project and the material we publish. Thus, we don't see any use for our part of systems such as Persistent Uniform Resource Locators (PURL) or Document Object Identifiers (DOI). We also believe that organizations that are careless enough to alter the URLs of published documents, are unlikely to care to register PURLs for their documents, so the very idea of PURLs is yet unclear to us.
Why simple files and not some database?: In the beginning, Project Runeberg contained only a few editions, and there was no problem to maintain these as simple text files. Over time, as the collections have increased, this choice is a natural candidate for reconsideration. However, the application of some database management system must be made with care. Access through Common Gateway Interface (CGI) scripts, for instance, might violate the open URL policy, and should be avoided as long as possible. Currently (1998), Project Runeberg has published a little more than 200 electronic editions, and there is still no problem to handle that using ordinary files.
Why not HTML frames? (1995): While HTML frames provide a conventient user interface, allowing some information to be always present at the edge of the page while letting the main text scroll up and down, the way these capabilities are integrated with HTML does not allow external hypertext linkage to a complete frameset with specified content. Therefore, using frames would violate Project Runebergs open URL policy. Also important, Project Runeberg has used HTML before frames were introduced. Instead of using frames, each of Project Runeberg's HTML pages contains project and edition metadata in the visible page header. We have developed our own set of automatic scripts and programs that add this information in a way that is consistent, safe, and saves us from manually editing the page header of each HTML file.
Why publish scan images of text? (1996): When republishing a text in electronic text format--whether in plain text, HTML or any other encoding--that has previously been published on paper, it is necessary to use a keyboard or a scanner and Optical Character Recognition (OCR). In either case, there is a great risk that errors will be introduced in this conversion process. No matter how great effort is put into locating and correcting these errors, this risk can not be fully eliminated. The only way to guarantee the correct transmission into electronic format of the printed text is to provide scan images without OCR. Dirt and damages on the printed page or on the glass of a flatbed scanner will of course distort the image, but printed letters that can be interpreted by the human eye on paper should be possible to interpret by the human eye from the scan image. The first time that we learned that scan images were used as a replacement for text recognition was in an industrial project where scan images of postal addresses were copied from lottery tickets to the envelopes sent out to the winners. We also know that a similar process is used by the Danish National Archives when digitizing old church records. The first time we saw this being used for Internet publishing of scanned literature was in the Making of America project presented by the University of Michigan and Cornell University.
Why not use scan images only? (1996): While publishing scan images of text pages provides for superior text transmission reliability, while saving the cost of manual labor to proofread the text version, the parallel publishing of a text version is still motivated by the ability to search the text, the limited bandwidth need to transfer the text, the ability for the reader to adjust the font size of the displayed text, and for the abilities of blind people to read the text using special electronic devices. The aforementioned Making of America project in 1997 used raw OCR in combination with fuzzy word search to make the scanned text pages searchable without spending money on proofreading. While this approach is sufficient for presenting a reliable human eye readable image aswell as for making the digital library searchable, it does hide the contents from the blind community and also from public search engines such as Altavista and Hotbot.
Why not always use the best print copy? (1995--1997): In 1995 or 1996 for the first time, opinions were raised that the electronic editions published by Project Runeberg were of unsufficient quality for scholarly use because low cost non-first edition print copies had been used. Criticism of this kind was first published, we believe, in Johan Svedjedal, Almqvist på Internet. Om publicering av en textkritisk edition som digital hypertext, Tidskrift för litteraturvetenskap, 26(1997):2, pp. 60-74, ISSN 1104-0556, which serves as a foreword to the electronic edition of Svenska Vitterhetssamfundet's textual critic edition of the collected works of Carl Jonas Love Almqvist. While this is true in many cases, and we are well aware of this problem, it is not true for all of Project Runeberg's editions. The reason is that each edition is created by volunteers who may be more or less careful with these details. Our focus has not been primarily to serve the scholarly community, but to provide examples of Nordic literature that most common readers can relate to. This includes many school teachers in primary and secondary schools, even some on the college and undergraduate level. As we learn more about online republishing of print material, the requirements on the quality of our output will inevitably increase, but the risk of sometimes failing must never stop us from trying. It is unlikely, however, that the editors of Project Runeberg would ever create textual critic editions that meet scholarly editing standards. In order for such editions to appear within our project, someone else would have to volunteer the scholarly editing skills, in which case our editors would be glad to provide their experience in online publishing for the best possible combined effort.
Why not provide access for Z39.50 gateways? (1997): to be considered...
Why not use Dublin Core metadata? (1997): to be considered...

Project Runeberg, Thu Dec 20 03:34:57 2012 (aronsson) (diff) (history) (download) << Previous Next >> https://runeberg.org/admin/decisions.html

		About Project Runeberg / Design Decisions of Project Runeberg Table of Contents / Innehåll \| << Previous \| Next >>
	Project Runeberg \| Catalog \| Recent Changes \| Donate \| Comments? \|