- Project Runeberg -  About Project Runeberg /
Metadata

Table of Contents / InnehÄll | << Previous | Next >>
  Project Runeberg | Like | Catalog | Recent Changes | Donate | Comments? |   

Project Runeberg Metadata

Project Runeberg maintains various metadata about authors and titles relevant to Project Runeberg. This page gives an overview of how the metadata system works.

Architecture

The metadata architecture within Project Runeberg has been guided by principles of flexibility and simplicity; the data formats used are simple text formats that are relatively easy to parse from any programming language. These data formats are the primary means of communication between the various programs that are used to process data.

We have avoided the more complex XML formats as far as internal use goes, since they tend to become complicated and error-prone when manually edited. Instead, we generate some XML formats (such as RSS) for publishing from our internal formats.

The key internal formats are the following:

The most fundamental of the *.lst files are a.lst (the main author database table), t.lst (the main title database table), and tema.lst (the thematic keyword database table). These three files each have a primary key referred to as the 'author key', the 'title/work key', and 'tema/theme key', which are referred from other files. Additionally, an a_tema.lst and t_tema.lst connect the author and title tables with the thematic keyword table. The fields of these files are documented in comments in the beginning of each file.

The t.lst table used to the be primary source for information about titles, but as we over time wanted to keep track of an increasing number of metadata attributes, many of them only relevant for certain kinds of titles, we are migrating towards having the 'Metadata' format files being the primary source for title metadata, and build the t.lst table by gathering the most widely used attributes from these files.

Since we introduced facsimile editions of titles (publishing first scanned images with crude OCR texts, and then allowing for collaborative proof-reading to improve the text accuracy), many titles now also have an Articles.lst and a Pages.lst file. These files provide mapping between page contents to page units, and between scanned page filenames and page contents.

Metadata

Titles (or 'works', or 'editions', the choice of word depending a bit on context) are typically the largest data objects within Project Runeberg, each title representing a book or some other work of literature, or art, or some other kind of publication. Being the largest objects, they also tend to be the ones that have most metadata connected with them. The 'Metadata' files contain most of this metadata (excepting only some purely internal metadata). To distinguish between this particular set of metadata and the general concept of metadata, we generally refer to these particular metadata files and the metadata contained within them with 'Metadata' spelled with a capital M.

Each 'Metadata' file contains one or more lines with attribute-value pairs, plus optional empty lines and optional comment lines. Comment lines begin with a number sign (#).

An attribute-value line contains an attribute name (by convension with an all uppercase name), followed by a colon (or an equals sign), followed by the value for that attribute. There may optionally be spaces before and after the colon/equals sign.

The attribute namespace is open-ended, so new attributes tend to be added now and then, so the complete list of attributes may change over time and include many unusual attributes of rather specialized use. The following is a list of the more general attributes; unless otherwise noted, all attributes are optional:



Project Runeberg, Thu Jan 21 15:10:17 2010 (runeberg) (diff) (history) (download) << Previous Next >>
http://runeberg.org/admin/metadata.html

Valid HTML 4.0! All our files are DRM-free