by Lars Aronsson, 4 November 1998
There are many ways to measure a large web site, such as Project Runeberg. There are static measures, such as the number of files or the size of the files that are available, and there are dynamic measures that relate to the use of the web site. The best way to measure the use is perhaps to ask the users. We welcome e-mail from users of Project Runeberg, but we have no resources to conduct interviews with a significant number of users.
Another way to collect dynamic statistics (is that an oxymoron?) is to look at the log file that is produced by the web server. Every significant event is recorded in this log file, such as when the server delivered a file to a remote client. Now, files can be delivered for very different reasons, and you want to be able to tell them apart. This is difficult, resulting perhaps in poor statistics.
One reason that files are delivered by a web server is when a web spider crawls the entire web to collect data for a search engine such as HotBot or AltaVista, or for the Swedish Kulturarw³ project. Another reason is when image files are delivered as part of a web page where they are to appear inline. Both of these reasons are unimportant side effects that do not provide any information about real people actually wanting to use the contents of the web site. If you can sort them out, the remainder of the events will represent people using hypertext links to access new HTML pages. We have chosen to relate to these events as "clicks" because people tend to use graphic browsers and a mouse to click on the hypertext links.
A series of clicks coming from the same user within some time frame would represent a session. Sessions can be long, indicating an interested reader, or short, indicating someone who stumbled on the web site my mistake and soon left again. We currently do not identify sessions, and as a consequence we are not able to count users, only their clicks.
The way we define clicks is to look at two fields in the web server log file. When the client type is "Mozilla" and the retrieved file type is HTML, then we count a click. Mozilla is the cliet type claimed by dominating graphic web browsers such as Netscape Navigator and Microsoft Internet Explorer. Events generated by Lynx users will not count as clicks in our definition. We have found that Lynx users represent less than 3 % of Project Runeberg's users, and this is less than the expected error margin anyway. Further, by our definiton, we risk to count clicks when a web crawler falsely claims to be Mozilla. We have identified some cases of this kind, and our web server log file analyzing software has been set to ignore them.
We know that our methods for analyzing the web server log files are less than perfect, and that they might improve in the future. Therefore, we save the log files, so we can run new analysis on them later. We have saved all our log files since we started out with Gopher technology in 1992, with a few exceptions.
We have noticed a strong weekly variation in our web server statistics. In fact, the number of events on Saturday and Sunday could be added to be comparable to any of the weekdays Monday through Friday. There is also some variation during summer holidays, but they sometimes disappear in the long term trends where the contents and the use of the web site evolves. The launching of new contents sometimes outweighs the effects of a summer vacation. Because of the weekly fluctuations, however, we have chosen to group data by week, using weeks as 50-some data points when observing the trends over a year. We have not done any studies on the variation in use over different hours of the day.
Currently, in 1998, Project Runeberg's web server typically experiences some 120,000 events each week. Of these, 40,000 are web crawlers or the like (including Lynx users). Then there are 45,000 inline images that are provided as parts of web pages. The remaining 35,000 events are interactive "clicks" that we count. The trend is slowly rising from 28,000 clicks at the beginning of the year towards 35,000 in October. This is a very slow trend in comparison with the fast growth of the web, so Project Runeberg has actually been lagging behind the trends for some time. This is because the project has not been very active during 1997 and the first three quarters of 1998.
As Project Runeberg is built as a collection of electronic editions (or titles) of classic Nordic literature, it is natural to compare different titels against each other, to see which are more popular. Such a title can be a novel, a poetry collection, or a dictionary. The titles published by Project Runeberg can be seen in our alphabetic catalog. A few of these titles are extremely popular, while the majority are hardly used at all.
Figure 1. The most popular editions
The number of weekly clicks for our most popular editions are shown in figure 1, above. The first thing you will notice is that data are missing for weeks 31-36. This is one of the gaps in our sequence of saved log files. During parts of July and August 1998, the server was down, and when it was restared after the holidays, the log files were not archived properly.
The second thing you will notice in this diagram is that the vertical axis has a logarithmic scale, using the base 2 logarithm. The distance between each line represents a doubling of the number of clicks. This is the best representation for entities that vary as much as click counts do, and that are sometimes subject to exponential growth. However, fully understanding such a diagram does require some knowledge of logarithms.
The editions that are shown in the diagram are only the nine most popular of the more than 200 editions published by Project Runeberg. In addition to them, the total number of clicks is also presented, varying between 20,000 and 35,000 each week. Only the short names are printed for each edition. This is the same short name that appears in the URL. For the editors of Project Runeberg, these short names are familiar, and perhaps some are to you too?
The most popular edition that Project Runeberg has published is "nordflor", which is short for C.A.M. Lindman, Bilder ur Nordens Flora, 1901-1905. This edition alone receives between 2000 and 5000 clicks each week, averaging more than 10 percent of Project Runeberg. This popularity is explained by its contents: beautifully color illustrated, detailed descriptions of more than 500 plants of the Nordic flora.
The second most popular item, "authors", is not a piece of classic literature, but our own database and collection of presentations of Nordic Authors. It is closely followed by another internal product, "tema", the section we call Tema; Thematic entries to Project Runeberg. These two items are presented as literature editions because they have separate subdirectories on our server, just like the classic literature editions have. A lot of work has been invested in these two items by the editors of Project Runeberg, and it is satisfying to see that it was not completely wasted.
Two translations of the Bible are present among the most popular editions: "bibeln" is the Swedish Bible translation of 1917 (Bibeln) and "dkbibel" is the Danish Bible translation (Bibelen). Both translations show great variation in use from week to week. The Swedish translation has a two week variation, and the Danish shows a pattern that repeats over three weeks. We have no explanation to this strange behavior.
The item named "sfs" is our snapshot copy of the law text section (SFS-T) of the Swedish parliament's database (Rixlex), somewhat improperly referred to as Svensk författningssamling. We made this snapshot available in HTML some years back when the parliament themselves only provided telnet access. Despite the appearance on the web of more up-to-date collections of Swedish laws, this edition remains popular. This can partly be explained by its size (more than 6,000 HTML files) and its open exposure to full text search engines such as Hotbot and Altavista.
The item "admin" is once again an internal production, our section About Project Runeberg. Its popularity is partly explained by the fact that it has a link in the upper right corner of every HTML page presented by Project Runeberg.
"svlihist" is an old Swedish schoolbook on literature history, Karl Warburg, Svensk litteraturhistoria i sammandrag, 1904. Its popularity is explained by its closeness to the subject matter of Project Runeberg and the fact that each author mentioned therein is symmetrically linked to the entry on the same author in the aforementioned Nordic Authors.
Also relevant to the study of Swedish history is "sbh", short for Herman Hofberg, Svenskt biografiskt handlexikon (SBH), 2nd edition, 1906, a biographic dicionary covering more than 4,000 famous Swedish people from four centuries. These biographies are also symmetrically linked to Nordic Authors. While work on digitizing SBH was begun in January 1997, systematic efforts were invested only in September 1998, and the complete edition was announced on October 21st (Wednesday of week 43). This explains the raise in its use in the latter part of the year.
It should be added that the diagram presented here is up to date as of November 2nd, 1998. It should benefit from being completed with the remaining data at the end of the year, but we are experienced enough never to give any promises of that kind.
Figure 2. Some popular novels
The same statistics for some popular novels are shown in figure 2.
The reader should keep in mind that Project Runeberg provides a
separate HTML file for each chapter of these novels, resulting in a
number of clicks from a reader that choses to read the entire work.
The most popular novels, typically getting between 200 and 800 clicks
per week, are:
"folkeven" - Asbjørnsen, Moe, Norske folkeeventyr
"kram" - Hans-Eric Hellberg, Kram
"nilsholg" - Selma Lagerlöf, Nils Holgerssons underbara resa genom Sverige
"rodarum" - August Strindberg, Röda rummet.
There is a gap between the above four and the rest, which only
seldom get more than 200 clicks per week:
"kalocain" - Karin Boye, Kallocain
"hertha" - Fredrika Bremer, Hertha
"jerusalm" - Selma Lagerlöf, Jerusalem
"portugal" - Selma Lagerlöf, Kejsaren av Portugallien.