Project Runeberg Metadata

Project Runeberg maintains various metadata about authors and titles relevant to Project Runeberg. This page gives an overview of how the metadata system works.

Architecture

The metadata architecture within Project Runeberg has been guided by principles of flexibility and simplicity; the data formats used are simple text formats that are relatively easy to parse from any programming language. These data formats are the primary means of communication between the various programs that are used to process data.

We have avoided the more complex XML formats as far as internal use goes, since they tend to become complicated and error-prone when manually edited. Instead, we generate some XML formats (such as RSS) for publishing from our internal formats.

The key internal formats are the following:

*.lst -- These files basically represent a database table, with one line per entry and with fields separated by the pipe bar (|). Lines beginning with the number sigh (#) are ignored, and are used for comments. Empty lines are also ignored.
Metadata -- Each title published within Project Runeberg has one (or more) files with this name. They contain a number of attribute-value pairs, providing the metadata for the title in question. Each pair is represented by a line like "ATTRIBUTE: value". The attribute name is, by convension, always in uppercase. Space before and after the colon is optional. Empty lines and lines beginning with '#' are ignored.
*.src -- These files contain one or more named sections of data. A section is introduced by a line beginning with three dashes (---) followed by the section name. Any lines before the first explicity marked section implicitly belongs to the section METADATA, which contains lines of the same format as used in the Metadata format. A section without a name is assumed to be named TEXT.
*.txt -- Text files enriched with a small subset of HTML, including the tags 'h1', 'h2', 'h3', 'p', 'ul', 'li', 'i', 'b', 'blockquote' and 'a'.

The most fundamental of the *.lst files are a.lst (the main author database table), t.lst (the main title database table), and tema.lst (the thematic keyword database table). These three files each have a primary key referred to as the 'author key', the 'title/work key', and 'tema/theme key', which are referred from other files. Additionally, an a_tema.lst and t_tema.lst connect the author and title tables with the thematic keyword table. The fields of these files are documented in comments in the beginning of each file.

The t.lst table used to the be primary source for information about titles, but as we over time wanted to keep track of an increasing number of metadata attributes, many of them only relevant for certain kinds of titles, we are migrating towards having the 'Metadata' format files being the primary source for title metadata, and build the t.lst table by gathering the most widely used attributes from these files.

Since we introduced facsimile editions of titles (publishing first scanned images with crude OCR texts, and then allowing for collaborative proof-reading to improve the text accuracy), many titles now also have an Articles.lst and a Pages.lst file. These files provide mapping between page contents to page units, and between scanned page filenames and page contents.

Metadata

Titles (or 'works', or 'editions', the choice of word depending a bit on context) are typically the largest data objects within Project Runeberg, each title representing a book or some other work of literature, or art, or some other kind of publication. Being the largest objects, they also tend to be the ones that have most metadata connected with them. The 'Metadata' files contain most of this metadata (excepting only some purely internal metadata). To distinguish between this particular set of metadata and the general concept of metadata, we generally refer to these particular metadata files and the metadata contained within them with 'Metadata' spelled with a capital M.

Each 'Metadata' file contains one or more lines with attribute-value pairs, plus optional empty lines and optional comment lines. Comment lines begin with a number sign (#).

An attribute-value line contains an attribute name (by convension with an all uppercase name), followed by a colon (or an equals sign), followed by the value for that attribute. There may optionally be spaces before and after the colon/equals sign.

The attribute namespace is open-ended, so new attributes tend to be added now and then, so the complete list of attributes may change over time and include many unusual attributes of rather specialized use. The following is a list of the more general attributes; unless otherwise noted, all attributes are optional:

TITLE -- the title string for this edition. REQUIRED.
TITLEKEY -- the title key for this edition. REQUIRED. This corresponds to the title's line in the t.lst table, and
AUTHORKEY -- author key(s) (referring to the a.lst table) of the author(s) of this title.
COAUTHORKEY -- co-author key (referring to the a.lst table) of the coauthor(s) of this title. This should not include any author already listed in the AUTHORKEY attribute.
TRANSLATORKEY -- translator author key (referring to the a.lst table) of the translator(s) of this edition, if it is a translation.
IMAGE_SOURCE -- for volumes where scanned images were copied from an external source, such as the Internet Archive or Google Book Search, this field specifies the foreign key, e.g. umich:ACV2565 for poncelet. Before the colon is the source archive, one of utoronto (Canadian Libraries in the Internet Archive), umich (University of Michigan) or google (Google Book Search). After the colon is the internal identifier used in the link to this volume in that archive.
MARC -- if library catalog records are found for this book, they can be specified here. For example, ibsen specifies bibliotekdk:55041563 bibsys:940178680 libris:872159 for links to records in the catalogs of bibliotek.dk, bibsys.no, and libris.kb.se.
OUR_PUBLISHING_DATE -- a date of the format YYYY-MM-DD indicating the day when the title was first published within Project Runeberg. This information may be missing or approximate for works published before the Metadata system was fully implemented.
LANGUAGE -- a language code indicating the language of the current title.
ORIGINAL_LANGUAGE -- a language code indicating the original lange of the current title. If omitted, assumed to be the same as given by the LANGUAGE attribute.

THEME -- one or more thematic keywords (referring to the tema.lst table) for this title.

Project Runeberg, Thu Jan 21 15:10:17 2010 (aronsson) (diff) (history) (download) << Previous Next >> https://runeberg.org/admin/metadata.html

		About Project Runeberg / Metadata Table of Contents / Innehåll \| << Previous \| Next >>
	Project Runeberg \| Catalog \| Recent Changes \| Donate \| Comments? \|