Return to Recent Changes
| Changes made to admin/metadata - (history) | |||||
|---|---|---|---|---|---|
| Version | Size (words) | Common | Deleted | Inserted | Changed |
| 1.1 | 1032 | 1032 = 100% | 0 = 0% | 0 = 0% | |
| 1.2 | 1175 | 1032 = 88% | 143 = 12% | 0 = 0% | |
<h1>Project Runeberg Metadata</h1>
<p>Project Runeberg maintains various metadata about authors and
titles relevant to Project Runeberg. This page gives an overview
of how the metadata system works.</p>
<h2>Architecture</h2>
<p>The metadata architecture within Project Runeberg has been guided
by principles of flexibility and simplicity; the data formats used
are simple text formats that are relatively easy to parse from any
programming language. These data formats are the primary means of
communication between the various programs that are used to process
data.</p>
<p>We have avoided the more complex XML formats as far as internal
use goes, since they tend to become complicated and error-prone when
manually edited. Instead, we generate some XML formats (such as
RSS) for publishing from our internal formats.</p>
<p>The key internal formats are the following:
<ul>
<li>*.lst -- These files basically represent a database table,
with one line per entry and with fields separated by the pipe bar (|).
Lines beginning with the number sigh (#) are ignored, and are used
for comments. Empty lines are also ignored. </li>
<li>Metadata -- Each title published within Project Runeberg has one
(or more) files with this name. They contain a number of attribute-value
pairs, providing the metadata for the title in question. Each pair
is represented by a line like "ATTRIBUTE: value". The attribute name
is, by convension, always in uppercase. Space before and after the
colon is optional. Empty lines and lines beginning with '#' are
ignored.</li>
<li>*.src -- These files contain one or more named sections of data.
A section is introduced by a line beginning with three dashes (---)
followed by the section name. Any lines before the first explicity
marked section implicitly belongs to the section METADATA, which contains
lines of the same format as used in the Metadata format. A section
without a name is assumed to be named TEXT.</li>
<li>*.txt -- Text files enriched with a small subset of HTML, including
the tags 'h1', 'h2', 'h3', 'p', 'ul', 'li', 'i', 'b', 'blockquote' and 'a'.
</li>
</ul>
<p>The most fundamental of the *.lst files are a.lst (the main author
database table), t.lst (the main title database table), and tema.lst
(the thematic keyword database table). These three files each have
a primary key referred to as the 'author key', the 'title/work key',
and 'tema/theme key', which are referred from other files.
Additionally, an a_tema.lst and t_tema.lst connect the author and
title tables with the thematic keyword table. The fields of these
files are documented in comments in the beginning of each file.</p>
<p>The t.lst table used to the be primary source for information about
titles, but as we over time wanted to keep track of an increasing
number of metadata attributes, many of them only relevant for certain
kinds of titles, we are migrating towards having the 'Metadata' format
files being the primary source for title metadata, and build the t.lst
table by gathering the most widely used attributes from these files.</p>
<p>Since we introduced facsimile editions of titles (publishing first
scanned images with crude OCR texts, and then allowing for collaborative
proof-reading to improve the text accuracy), many titles now also have
an Articles.lst and a Pages.lst file. These files provide mapping between
page contents to page units, and between scanned page filenames and page
contents.</p>
<h2>Metadata</h2>
<p>Titles (or 'works', or 'editions', the choice of word depending a
bit on context) are typically the largest data objects within
Project Runeberg, each title representing a book or some other work
of literature, or art, or some other kind of publication. Being the
largest objects, they also tend to be the ones that have most metadata
connected with them. The 'Metadata' files contain most of this
metadata (excepting only some purely internal metadata). To distinguish
between this particular set of metadata and the general concept of metadata,
we generally refer to these particular metadata files and the metadata
contained within them with 'Metadata' spelled with a capital M.</p>
<p>Each 'Metadata' file contains one or more lines with
attribute-value pairs, plus optional empty lines and optional
comment lines. Comment lines begin with a number sign (#).</p>
<p>An attribute-value line contains an attribute name (by convension
with an all uppercase name), followed by a colon (or an equals
sign), followed by the value for that attribute. There may optionally
be spaces before and after the colon/equals sign.</p>
<p>The attribute namespace is open-ended, so new attributes tend to
be added now and then, so the complete list of attributes may change
over time and include many unusual attributes of rather specialized
use. The following is a list of the more general attributes; unless
otherwise noted, all attributes are optional:
<ul>
<li><tt>TITLE</tt> -- the title string for this edition. REQUIRED.</li>
<li><tt>TITLEKEY</tt> -- the title key for this edition. REQUIRED. This
corresponds to the title's line in the <tt>t.lst</tt> table, and </li>
<li><tt>AUTHORKEY</tt> -- author key(s)
(referring to the <tt>a.lst</tt> table)
of the author(s) of this title.</li>
<li><tt>COAUTHORKEY</tt> -- co-author key
(referring to the <tt>a.lst</tt> table)
of the coauthor(s) of this title. This should not include any author
already listed in the AUTHORKEY attribute.</li>
<li><tt>TRANSLATORKEY</tt> -- translator author key
(referring to the <tt>a.lst</tt> table)
of the translator(s) of this edition, if it is a translation.</li>
<li><tt>IMAGE_SOURCE</tt> -- for volumes where scanned images
were copied from an external source, such as the Internet Archive
or Google Book Search, this field specifies the foreign key, e.g.
<tt>umich:ACV2565</tt> for <a href="/poncelet/" >poncelet</a>.
Before the colon is the source archive, one of utoronto (Canadian
Libraries in the Internet Archive), umich (University of Michigan)
or google (Google Book Search). After the colon is the internal
identifier used in the link to this volume in that archive.
<li><tt>MARC</tt> -- if library catalog records are found for this
book, they can be specified here. For example,
<a href="/ibsen/" >ibsen</a> specifies
<tt>bibliotekdk:55041563 bibsys:940178680 libris:872159</tt>
for links to records in the catalogs of
<a href="http://bibliotek.dk/" >bibliotek.dk</a>,
<a href="http://bibsys.no/" >bibsys.no</a>, and
<a href="http://www.libris.kb.se/" >libris.kb.se</a>.
<li><tt>OUR_PUBLISHING_DATE</tt> -- a date of the format YYYY-MM-DD
indicating the day when the title was first published within Project
Runeberg. This information may be missing or approximate for works
published before the Metadata system was fully implemented.</li>
<li><tt>LANGUAGE</tt> -- a language code indicating the language of
the current title.</li>
<li><tt>ORIGINAL_LANGUAGE<tt> -- a language code indicating the
original lange of the current title. If omitted, assumed to be the
same as given by the LANGUAGE attribute.</li>
<li><tt>THEME</tt> -- one or more thematic keywords (referring to
the <tt>tema.lst</tt> table) for this title.</li>
</ul>
</p>