by Lars Aronsson, 5 Dec 1998
updated 11 Dec 1998
During the fall of 1998, Project Runeberg's first series of electronic facsimile editions was produced, as an experiment to see whether this is a feasible way of publishing old literature, what its benfits and drawbacks are, whereof the production cost depends, and how large it is. This text presents some preliminary results of these experiments. A total of four man-months were invested in these experiments, but no goal was set for the output volume.
Since the start in 1992, Project Runeberg has published some 200 electronic editions of Nordic literature. The first were hand-keyed. Later scanners, OCR and proofreading was used. In either case, only the text (with illustrations) has been published. The electronic facsimile editions, on the other hand, primarily present scanned images of text pages. This new approach is driven by two factors: proofreading is time-consuming and yet inexact, and the cost of handling large volumes of data is falling. Facsimile technology makes proofreading unnecessary, thus saving time and money, while at the same time gaining near perfect accuracy. These advantages come at the cost of larger data volumes. A facsimile image file of a text page can be 20 or 50 times larger than the plain text file. Technical development, both in storage (disk space) and transfer (network and modem bandwidth), is rapidly making the handling of these larger volumes affordable.
As of today, 42 of Project Runeberg's electronic editions use facsimile technology. Some are editions that already existed in text, and where the added text page images add accuracy to the existing product. Most are completely new editions, however. Various kinds of books have been digitized, in order to find the "corners of the envelope" and develop a technology that can handle all sorts of books that we want to digitize in the future. The presented editions are documented in the Project Runeberg Timeline.
The production of these electronic facsimile editions consists of
Research is the least structured of these activities. It often starts with acquisition, and then maybe a few years of waiting for the last author to have been dead for 70 years, in order to avoid copyright issues altogether. Much of the necessary research behind Project Runeberg's publishing is presented in our presentations of Nordic Authors. In some cases, research is obvious, or available by external sources. The cost for research has not been estimated here. However, it would be preferable to digitize books where the cost of research is the least. Once a useful copy is found, old books typically cost 1 SEK (0.12 USD) per page, or less.
Digitization has costs that scale linearly by the number of pages. Sheet feeding scanners can process 5-30 pages per minute, and a single operator can handle several scanners and OCR processors running in parallel. This is a service that should be outsourced when possible. There seems to be a consensus among archivists and librarians to use 600 dpi bitonal TIFF with fax group 4 compression for most printed matters. Despite the high resolution, the image files do not become very large. For presentation on the web, these bitonal images can easily be scaled down by 1:4, 1:5, or 1:6 to grayscale GIF images. As has been demonstrated by the Making of America project at the University of Michigan, such conversion can successfully be made on demand. Color illustrations must be scanned separately, and the same is true for illustration plates that are larger than the normal page format. The cost for these illustrations has not been estimated here.
Indexing has costs that scale linearly with the number of lines in the produced table of contents, the number of articles. This count is very different in different kinds of books.
Figure 1 is a diagram where each point is an electronic facsimile edition, among the first 42 produced by Project Runeberg. These 42 editions together contained 19,200 scanned pages and 9,500 articles. The horizontal axis represents the number of scanned pages in each edition, while the vertical axis represents its article count. Three main categories of books have been identified, based on the ratio of articles per 100 pages. Novels and textbooks have articles (chapters) that are more than 5 pages each, i.e. less than 20 articles per 100 pages. Encyclopedic works and works where a detailed alphabetic index was included in the table of contents have more articles than pages. A line was drawn at 200 articles per 100 pages, to set off the three most extreme of the 42 published editions.
Figure 1. Number of articles vs pages
Some of the editions have been labeled, as is explained by Table 1.
The most extreme of these editions, Hofberg's Swedish biographic dictionary (labeled "sbh"), is a small encyclopedia in 2 volumes, having 4400 entries on 1400 pages. There is no printed table of contents in this work, but the article headings were keyed in manually. Further, all 4000 biographies have been linked to Project Runeberg's presentations of Nordic Authors, where death years and additional information has been added to many of the entries. The amount of work required by this level of ambition by far exceeds the average production cost of electronic facsimile editions. At the other extreme, the items labled "doodle", "gudasaga", and "rydvaria" have very long chapters, so the cost of indexing is very small compared to the cost of digitization.
Three works by Wilhelm Tham are labeled in Figure 1. These are uniform descriptions of various Swedish counties, and have similar page counts, but very different article counts. This is because the author's style developed over the years. The one labled "thamaros", written in 1849, only contains a rough table of contents, but the two descriptions written in 1850 ("thamupps", "thamstoc") also contain detailed alphabetic indices, which were also included in the electronic facsimile edition.
Whether the extra work to digitize detailed indices is worth while, must be determined separately for each work. Here, "sbh" is a valuable addition to Project Runeberg's presentation of Nordic Authors, which will help in researching authorship and copyright status of future electronic editions, and "thamstoc" is a presentation of the Swedish capital, which is sure to attract a large audience. For other works, it might make more sense to defer the production of more detailed indices to unpaid volunteers, as Project Runeberg has already done with proofreading and production of plain text editions.
|Table 1. Labeled editions|
|Label||Name and description|
|cfd||Carl Fredrik Dahlgren, Samlade skrifter, 1400 pages, 1875|
|doodle||Emil A. G. Kleen, Ströftåg och irrfärder hos min vän Yankee Doodle (samt annorstädes)|
|faltskar||Zacharias Topelius, Fältskärns berättelser, 6 vol., 2500 pages|
|gudasaga||Viktor Rydberg, Fädernas gudasaga, 594 pages|
|palmtrip||August Palm, Ögonblicksbilder från en tripp till Amerika, 1901|
|ringleka||Ringlekar på Skansen, sheet music|
|rsjkmmus||Johan Helmig Roman, Sjukmans Musiquen, previously unpublished 18th century sheet music|
|rydvaria||Viktor Rydberg, Varia, 2 vol., 952 pages|
|sbh||Hofberg, Svenskt biografiskt handlexikon, 2 vol., 1400 pages, 1906|
|thamaros||Wilhelm Tham, Beskrifning öfver Westerås län, 1849|
|thamstoc||Wilhelm Tham, Beskrifning öfver Stockholms län, 1850|
|thamupps||Wilhelm Tham, Beskrifning öfver Upsala län, 1850|
|topesang||Zacharias Topelius, Sånger|
Electronic facsimile editions do represent a feasible and economic way of presenting old literature on the Internet. During the experiments in the fall of 1998, Project Runeberg made 32 such editions available on the Internet, containing 16,000 pages, indexed by 9,200 lines of table of contents. The production costs were not separted from the development costs. A total of four man-months (640 hours) was invested, giving an overall production rate of 25 pages per hour, but we believe this could be increased to 100 pages per skilled work hour for a continued production without changing the methods or scale of operation. A weekly output between 1000 and 4000 pages is expected.
The production costs, however, do not scale by the number of digitized pages only, but also by the number of articles to be indexed, and whether a printed table of contents can be proofread, or a new one has to be made from scratch. These costs have not been separated in the experiments described here, but it is clear that the cost of indexing can be significant.
Digitization is the work item that requires the lowest skill level,
and should be the first candidate for outsourcing. That would also be
the best way to separate its cost. Some source (???quoted by Lesk???)
indicate that the cost of digitization can be made as low as 0.10 USD
(0.80 SEK) per page.