Design Decisions of Project Runeberg
by Lars Aronsson
There are many ways to do what Project Runeberg does. In this
section, we try to explain the decisions that we have taken together
with the reasons why. If we would have taken other decisions, Project
Runeberg might have been quite different. It is hoped that this
section could help you understand Project Runeberg, while also serving
as a check list for your own publishing project.
If there are decisions that we seem to have taken, but that are not
documented here, we want you to write to us. The reason
might be that we forgot to document these decisions, or that we just
did not know of any alternative way to the one we chose.
The decisions are listed in no particular order, but where possible
they have been ordered by date. New decisions are added at the end of
this list. If you wish to quote from or reference this text, use the
URL and date found at the very bottom of this HTML page.
- Why did Project Runeberg start? (1992)
- In 1992, we wanted to demonstrate the potential of the Internet
not only for technical purposes, but also in the humanities and other
activities such as government and business. After having seen
applications such as MUD and IRC develop, information publishing
systems such as Gopher and the World Wide Web had a hard time to
bootstrap because there was too little to read to attract readers and
too few readers to attract publishers. At this time, we started to
publish Nordic literature, assuming that it would eventually help
attract readers that could relate to this material, and it did.
- Why such an abstract name? (1992)
- "Everyone will be world-famous for fifteen minutes." -- Andy
Warhol (1928-1987). In this kind of future, it is important for any
service provider to use a clear name and trademark, providing it as a
small hook in the minimal attention span of the average audience upon
which she can accumulate positive impressions over time. The field of
computers and the Internet is evolving, ever changing, and any name
that makes too specific claims as to the nature of our activities
(electronic text archive, digital library, ...), might force us to
change the name often or live with inconsistency. The name Project
Runeberg tells you that this is a project, but does not tell you
anything about what it does. While the name might lead you to think
about the 19th century poet Johan Ludvig Runeberg, the Nordic
countries (mostly Finland and maybe Sweden), and perhaps Project
Gutenberg, there is nothing of a promise in the name. We considered
this to be important at the outset of our project.
- Why doing this for free? (1992)
- Traditional publishing has costs per copy, for paper and
distribution that will not scale below a certain point. This is not
true for online electronic publishing, where the initial setup costs
can be divided among an infinite number of users. We are convinced
that with time, computers and communications will be cheaper, faster,
and provide more capacity. Nobody knows how large an audience an
online publishing project might get, and we believe that charging for
services per user, per chapter, or per hour would be the strongest
limit on the use of Project Runeberg. Charging also requires the
development of some technology that is not yet there, and the academic
environment in which we exist is not built for activities generating
their own revenue. For example, the Swedish University Network has
rules against commercial use.
Having ruled out charging for our services, we still have the
option of getting funding from various sources. Applying for funds is
a time-consuming activity, however, and as we have better things to
do, we have chosen to base Project Runeberg on voluntary sparetime
work. This principle actually scales very well with the Internet, as
perhaps most clearly shown by the Linux operating system development
project. Project Runeberg has received a limited amount of funding
from Linköping University, and LYSATOR has repeatedly received donations from industry and from
departments of the university.
- Why literature? (1992)
- Our main inspiration at the time was Project Gutenberg, which
already published free electronic editions of literature on the
Internet, mostly English language literature. Literature is easy to
publish online, because it only requires text files, which you can
manage with a keyboard and text terminal. Images and other data
require special hardware, higher network bandwidth, and data formats
that might not yet be standadized. We were also familiar with free
software distribution projects, such as Linux and the Free Software
Foundation's GNU project. But our goal was to demonstrate the
usefulness of the Internet in non-technical areas.
- Why Nordic literature? (1992--1994)
- We are based in Sweden, so Swedish is our own language, whereas
the information available on the Internet was and is dominated by the
English language. To be able to demonstrate the usefulness of the
Internet to people close to us, we wanted to present materials in our
own language. At the same time, Sweden is a rather small country,
only having 9 million inhabitants, and the languages of the other
Scandinavian countries (Danish, Norwegian) are easily understandable.
At little extra effort, we also included Icelandic, and thereby
included in our scope the old Icelandic sagas from the viking age. It
was only later (1993 or 1994), however, that we added the Finnish
language to the definition of Project Runeberg. Together with it came
the closely related languages of Sami and Estonian. These languages
are not similar to the others, and are not spoken by any of our
editors. By including literature in these languages, the scope of
Project Runeberg was broadend from Scandinavian to Nordic. We have
been able to resist the temptation to include German and Russian
literature, even though this has been suggested. These cultural areas
have later developed their own successful publishing projects.
- Why mostly old literature? (1992)
- As our goal is to provide electronic editions free of charge, we
will not be able to pay royalties to authors or copyright owners.
This forces us to use material that we are able to publish without
such restrictions. Copyright protection
is granted by law during the entire lifetime of the author and then
for 70 more years (life+70), but after that, works of literature and
art enter the public domain, which means Project Runeberg and others
can freely publish them. This is the main reason why most of what
Project Runeberg outputs is quite old literature.
- Why no English translations? (1993--)
- Project Runeberg has defined Nordic literature to be any
literature that is available in the Nordic languages, either
originally written here, or translated from other languages. Perhaps
the most significant example of the latter is the Bible, originally
written in Hebrew, Greek, and Latin. We have often been asked to
publish English (and perhaps also German and French) translations of
literature written by Nordic authors. So far, we have resisted doing
this, because such translations can be published by English language
servers and we want to prioritize material in our own languages, but
the idea is still tempting. We might open up to English translations
of Nordic literature in the future.
- Why is some information in English? (1992)
- The working language of Project Runeberg is English. This means
information about the project is provided in English first, and you
can always write to us in English. Even though most people in Norway,
Denmark, and Finland understand Swedish, and we in Sweden understand
Norwegian and Danish, there are subtleties in the differences between
these languages that may cause dangerous misunderstandings, and we
also don't want to exclude anybody from using Project Runeberg just
because they don't speak our language. We believe that at this time,
English serves the role of the single international language better
than Latin and Esperanto. In some parts of Project Runeberg,
information is provided in Swedish only or Finnish only, which is
unfortunate. We hope to improve this in the future, so that all
information is available in English aswell as at least one Nordic
language.
- Why Latin-1? (1992)
- From the start, Project Runeberg has used the standardized
character set ISO 8859-1, also known as Latin-1, containing the
characters most widely used in northern and western Europe. This is
the default character set in most UNIX dialects, in the X Window
System, in Microsoft Windows (3.0, 95, and NT), and on the Apple
Macintosh when sold on Iceland. This character set allows us to
represent all characters used in the languages of the literature that
we publish, except for Sami (i.e. Lappish, which requires Latin-6) and
rune inscriptions (for which we know of no standardized character
set). Users of Apple Macintosh not sold on Iceland, users of early
versions of Hewlett-Packard UNIX (HP-UX), and users of Microsoft DOS
without Windows should set their web browsers to display Latin-1
(Western Europe) as the default encoding. The alternative would be
for Project Runeberg to present parallel editions in each character
set and text encoding that users might prefer, but this always opens
to some readers being unsatisfied because we forgot to support their
system. Instead, we want to endorse a single standard where possible.
In the Hypertext Markup Language (HTML), Latin-1 is also the default
encoding specified by the approved Internet Standard, so there is
absolutely no need for us to publish web pages containing ugly codes
like ä and þ when we have keyboards that allow us
to type ä and þ directly. In the long run, we believe that the World
Wide Web will convert to Unicode (ISO 10646), but for the time being,
Latin-1 does the job for us.
- Why HTML and not just plain text? (1994)
- Project Runeberg started out using Gopher technology for Internet
publishing, allowing only plain text format files, in combination with
FTP. LYSATOR was very early to adopt World Wide Web (WWW) technology
in 1993, including the Hypertext Transfer Protocol (HTTP) and the
Hypertext Markup Language (HTML), supposedly being the 14th web site
worldwide to register with CERN, and Project Runeberg decided in 1994
to use this as its primary or only medium. HTTP has turned out to be
the dominating transport mechanism for accessing remotely published
material, by far more common than FTP or Gopher. While HTTP does not
rule out the use of plain text format, HTML provides some nice text
formatting options (among them bold face and italics)
that are very useful in electronic republishing literature that has
previously been published on paper. Unfortunately, HTML does not
provide any way to markup words with
i n t e r l e a v e d s p a c i n g (spärrad stil), a typographic
feature which has earlier been commonly used instead of bold face in
Swedish imprints. When our copy uses interleaved spacing, we indicate
this in our source files, and convert to bold face representation in
the HTML version that you see.
- Why not SGML and the TEI DTD? (1993)
- Already in 1993 we were confronted by the possibility to make our
texts available in the Standard Generalized Markup Language (SGML)
format using the document type definition (DTD) of the Text Encoding
Initiative (TEI). This is supposedly the politically correct format
for scholarly electronic editions. However, such requests have only
come from people wanting us to do the work, and not from anybody
wanting to volunteer. The vast majority of Project Runeberg's readers
are happy to be able to access our editions in the HTML subset that we
have used since 1994, and the Gopher plain text format before that.
- Why the open URL policy? (1994)
- Project Runeberg wants to encourage external WWW content providers
to create hypertext links to any part of our material. That is to say
we want Project Runeberg to be a truely integrated part of the World
Wide Web, which we view as one single, global hypertext. To
facilitate this, we have adopted and openly published a policy for the
Uniform Resource Locators (URLs) or addresses of our web pages. All
material that we publish has an URL that should be stable over long
time (whatever that means), easy to write down and quote, and contain
no irrelevant data. The main URL for the project (http://www.lysator.liu.se/runeberg/) is the base of the URLs
to all our HTML pages. The base URL clearly shows our organizational
affiliation, which is an indication of our belief that both Linköping
University and LYSATOR will continue to exist for a long time. Added
to this base URL, each electronic edition that we publish uses a code
name which is also a the name of a subdirectory in the file system on
our server. These code names consist of at most eight lower case
letters from the English alphabet (a-z) or decimal digits (0-9). Each
directory then contains a flat level of files having file name
suffixes that indicate the type of the file. In particular, there
must not be any strange characters that are hard to reproduce (such as
tilde, dashes, and whitespace) in Project Runeberg URLs.
- Why no PURL or DOI?
- With our open URL policy, we will have a hard time to alter any of
our URLs, and much more the base URL of our project. Taking into
account that we devised this policy with longevity in mind, it is
highly unlikely that we should need any other means to extend the life
of references to our project and the material we publish. Thus, we
don't see any use for our part of systems such as Persistent Uniform
Resource Locators (PURL) or Document Object Identifiers (DOI). We
also believe that organizations that are careless enough to alter the
URLs of published documents, are unlikely to care to register PURLs
for their documents, so the very idea of PURLs is yet unclear to us.
- Why simple files and not some database?
- In the beginning, Project Runeberg contained only a few editions,
and there was no problem to maintain these as simple text files. Over
time, as the collections have increased, this choice is a natural
candidate for reconsideration. However, the application of some
database management system must be made with care. Access through
Common Gateway Interface (CGI) scripts, for instance, might violate
the open URL policy, and should be avoided as long as possible.
Currently (1998), Project Runeberg has published a little more than
200 electronic editions, and there is still no problem to handle that
using ordinary files.
- Why not HTML frames? (1995)
- While HTML frames provide a conventient user interface, allowing
some information to be always present at the edge of the page while
letting the main text scroll up and down, the way these capabilities
are integrated with HTML does not allow external hypertext linkage to
a complete frameset with specified content. Therefore, using frames
would violate Project Runebergs open URL policy. Also important,
Project Runeberg has used HTML before frames were introduced. Instead
of using frames, each of Project Runeberg's HTML pages contains
project and edition metadata in the visible page header. We have
developed our own set of automatic scripts and programs that add this
information in a way that is consistent, safe, and saves us from
manually editing the page header of each HTML file.
- Why publish scan images of text? (1996)
- When republishing a text in electronic text format--whether in
plain text, HTML or any other encoding--that has previously been
published on paper, it is necessary to use a keyboard or a scanner and
Optical Character Recognition (OCR). In either case, there is a great
risk that errors will be introduced in this conversion process. No
matter how great effort is put into locating and correcting these
errors, this risk can not be fully eliminated. The only way to
guarantee the correct transmission into electronic format of the
printed text is to provide scan images without OCR. Dirt and damages
on the printed page or on the glass of a flatbed scanner will of
course distort the image, but printed letters that can be interpreted
by the human eye on paper should be possible to interpret by the human
eye from the scan image. The first time that we learned that scan
images were used as a replacement for text recognition was in an
industrial project where scan images of postal addresses were copied
from lottery tickets to the envelopes sent out to the winners. We
also know that a similar process is used by the Danish National
Archives when digitizing old church records. The first time we saw
this being used for Internet publishing of scanned literature was in
the Making of America project presented by the University of
Michigan and Cornell University.
- Why not use scan images only? (1996)
- While publishing scan images of text pages provides for superior
text transmission reliability, while saving the cost of manual labor
to proofread the text version, the parallel publishing of a text
version is still motivated by the ability to search the text, the
limited bandwidth need to transfer the text, the ability for the
reader to adjust the font size of the displayed text, and for the
abilities of blind people to read the text using special electronic
devices. The aforementioned Making of America project in 1997 used
raw OCR in combination with fuzzy word search to make the scanned text
pages searchable without spending money on proofreading. While this
approach is sufficient for presenting a reliable human eye readable
image aswell as for making the digital library searchable, it does
hide the contents from the blind community and also from public search
engines such as Altavista and Hotbot.
- Why not always use the best print copy? (1995--1997)
- In 1995 or 1996 for the first time, opinions were raised that the
electronic editions published by Project Runeberg were of unsufficient
quality for scholarly use because low cost non-first edition print
copies had been used. Criticism of this kind was first published, we
believe, in Johan Svedjedal, Almqvist på
Internet. Om publicering av en textkritisk edition som digital
hypertext, Tidskrift för litteraturvetenskap, 26(1997):2,
pp. 60-74, ISSN 1104-0556, which serves as a foreword to the
electronic edition of Svenska Vitterhetssamfundet's textual critic
edition of the collected works of Carl Jonas Love Almqvist. While
this is true in many cases, and we are well aware of this problem, it
is not true for all of Project Runeberg's editions. The reason is
that each edition is created by volunteers who may be more or less
careful with these details. Our focus has not been primarily to serve
the scholarly community, but to provide examples of Nordic literature
that most common readers can relate to. This includes many school
teachers in primary and secondary schools, even some on the college
and undergraduate level. As we learn more about online republishing
of print material, the requirements on the quality of our output will
inevitably increase, but the risk of sometimes failing must never stop
us from trying. It is unlikely, however, that the editors of Project
Runeberg would ever create textual critic editions that meet scholarly
editing standards. In order for such editions to appear within our
project, someone else would have to volunteer the scholarly editing
skills, in which case our editors would be glad to provide their
experience in online publishing for the best possible combined effort.
- Why not provide access for Z39.50 gateways? (1997)
- to be considered...
- Why not use Dublin Core metadata? (1997)
- to be considered...
Project Runeberg, Thu Dec 20 03:34:57 2012
(aronsson)
(diff)
(history)
(download)
<< Previous
Next >>
https://runeberg.org/admin/decisions.html