Datasets for Natural Language Processing and Text Mining (About Project Runeberg)

“Datasets are the water and air of our field.” —Christopher Potts, professor of linguistics, Stanford University.

Available datasets:

Filename Date Size Format Description

salmonsen-2.lst Jan.
2023 153 MB
157,926 lines .lst
4 fields Danish encyclopedia Salmonsens konversationsleksikon (2nd ed., 26 volumes, 1915—1930)
Fields: 1. filename, 2. article heading, 3. full text with mark-up, 4. author signature.

a.lst Aug.
1995 2 MB
~35,000 lines .lst
7 fields
in ISO 8859-1 List of Nordic Authors (and other relevant people) with years of birth and death, nationality and "Runeberg author ID".

t.lst ~1995 0.5 MB
~6,000 lines .lst
9 fields
in ISO 8859-1 List of titles (books) presented by Project Runeberg.

Articles.lst 1998 .lst
3 fields Available for each of Project Runeberg's volumes (books), as described in
Project Runeberg's Electronic Facsimile Editions of Nordic Literature (May 1999)
Download through link in the page footer for each book.

Pages.lst 1998 .lst
2 fields

Filename	Date	Size	Format	Description
salmonsen-2.lst	Jan. 2023	153 MB 157,926 lines	.lst 4 fields	Danish encyclopedia Salmonsens konversationsleksikon (2nd ed., 26 volumes, 1915—1930) Fields: 1. filename, 2. article heading, 3. full text with mark-up, 4. author signature.
a.lst	Aug. 1995	2 MB ~35,000 lines	.lst 7 fields in ISO 8859-1	List of Nordic Authors (and other relevant people) with years of birth and death, nationality and "Runeberg author ID".
t.lst	~1995	0.5 MB ~6,000 lines	.lst 9 fields in ISO 8859-1	List of titles (books) presented by Project Runeberg.
Articles.lst	1998		.lst 3 fields	Available for each of Project Runeberg's volumes (books), as described in Project Runeberg's Electronic Facsimile Editions of Nordic Literature (May 1999) Download through link in the page footer for each book.
Pages.lst	1998		.lst 2 fields

What is it?: In January 2023, we launched this page for digitized books that we have converted into "datasets", which can be downloaded and analyzed for fun and research. For an introduction to the terms, see Wikipedia on Natural language processing and Text mining.
.lst file format: .lst files (short for "list") are Project Runeberg's name for plain text files in UTF-8 encoding, with one record per line and fields separated by vertical bar (|). This is essentially a version of the CSV (comma-separated values) or TSV (tab-separated values) file format. In some cases, hash (#) indicates lines that are comments.

Tutorial

Here's how the salmonsen-2 dataset can be used on a Linux or UNIX system:

$ wget http://runeberg.org/salmonsen/2/salmonsen-2.lst
$ ls -sh salmonsen-2.lst 
  153M salmonsen-2.lst
$ wc salmonsen-2.lst
  157926  22910911 159458481 salmonsen-2.lst
$ sum salmonsen-2.lst 
  18292 155722

Filter out all the mark-up:

$ sed 's/<[^<>]*>//g' salmonsen-2.lst | wc
  157926 22827081 152873194

The full text (field 3) contains the mark-up added during proofreading, which is mostly a subset of HTML, including <b>boldface</b>, <i>italics</i>, <sp>spaced-out text</sp>, <img>images</img>, <poem>poetry</poem>, and <table>tables</table>.
Spaced-out text is used for given names of biographies and also redirections to other articles, e.g.:
<b>Nizza</b>, se <sp>Nice</sp>.

How is Denmark's capital spelled in a century-old encyclopedia?

$ grep København  salmonsen-2.lst | wc -l
    174
$ grep Kjøbenhavn salmonsen-2.lst | wc -l
   1512
$ grep Köpenhamn  salmonsen-2.lst | wc -l
      3
$ grep Copenhagen salmonsen-2.lst | wc -l
      9

To find the longest articles, use awk with vertical bar as the field separator:

$ awk '-F|' '{print length($3), $1, $2}' salmonsen-2.lst | sort -nr | head -2
  279759 24/0888 Verdenskrigen (Økonomi og Erhvervspolitik.)
  272009 8/0779 Frankrig (Historie)

To find that article, put "24/0888" (field 1) into the URL:
http://runeberg.org/salmonsen/2/24/0888.html
During proofreading, some very long articles, e.g. "Verdenskrigen" (World War) and "Frankrig" (France) have been divided into their sections, and still the history section of France exceeds a quarter of a million characters.

Who were the most prolific author signatures?

$ awk '-F|' '{print $4}' salmonsen-2.lst | sort | uniq -c | sort -nr | head -5
  59983 
   4341 G. Ht.
   3787 A. Hk.
   3168 M. V.
   2173 H. H. R.

The winners are G. Ht. (geographer Gudmund Hatt), A. Hk. (statistician Axel Holck), M. V. (geographer Martin Vahl) and H. H. R. (librarian Hans H. Ræder). This should be no surprise, since many of the articles cover geographic places.

Some 59,983 articles lack an author signature. But the majority of these are shorter than 100 characters, mere redirections such as "Nizza, see Nice":

$ awk '-F|' '$4=="" && length($3) < 100' salmonsen-2.lst | wc -l
  53415

For a more advanced analysis, the dataset can be loaded into a pandas.DataFrame in Python:

import pandas as pd
with open('salmonsen-2.lst') as source:
    dataset = pd.read_csv(source, sep="|", header=None,
                          names = ["filename", "title", "text", "sign"],
                          # allow empty strings, avoid NaN floats
                          keep_default_na = False)

Work in progress...

When this is written, in January 2023, Project Runeberg has just celebrated its 30th anniversary. It started out in December 1992 (timeline) as an e-text archive on a Gopher server, added electronic facsimile (scanned page images) in 1998, online proofreading around 2003, and has grown to 3.2 million book pages. In most cases, the result is plain text with a minimalistic mark-up. Linked or structured data is provided as a thin framework around this.

Existing structured data

Our website is maintained as and built from a source file tree with plain text (and image) files. In short, every book has a file directory and every chapter or page has a file, in a very shallow and wide file tree (more of an underbrush than a tree). The files for running text sometimes have filenames with suffix .txt, sometimes .html. The filename suffix .lst stands for "list" and represents files of lines and fields (or comma-separated values, CSV) with vertical bar (|) as the field separator and hash (#) for comments.

Our electronic editions (books, volumes) have a URL structure and some metadata. This was true before 1998, but with the addition in that year of electronic facsimile editions, two files Pages.lst and Articles.lst were introduced for each volume, creating the structure of scanned pages and book chapters (or articles).

The file Articles.lst has a line for each chapter or article, and fields for chapter filename, heading, and list of scanned page filenames. But the chapter heading is just one field, it does not specify how to express article author or article category. For some periodicals, this is done in the style of "Ättestupan. Af Fredrik Nycander", in others as Kautsky, Karl: Demokrati eller diktatur. These author names are not linked data, so there is no easy way to find all articles by each author, or all articles on a particular topic or of a particular genre (poem, review, commentary, biography).

Linked data for authors is only available on the work (book) level. There is a separate set of presentations of Nordic Authors, which serves the main purpose to keep track of their years of death, for copyright reasons. Each author has a presentation page, e.g. http://runeberg.org/authors/shakewil.html for William Shakespeare, and these filenames (e.g. shakewil) are known in Wikipedia and Wikidata as their "Runeberg author ID" or Property:P3154. On our end, the authors' years of birth and death, names, nationalities, professions, and IDs, are listed in the "a.lst" file of the Nordic Authors section. There is also a "t.lst" file that lists all the works (titles, books), derived from each work's metadata.

Multi-volume works, series, and periodicals are not always handled uniformly. We try to avoid more than two levels of depth, but exceptions exist:

For one encyclopedia, the URL http://runeberg.org/tieto/6/0333.html represents the work (/tieto/), volume 6, scanned page 0333.
For another, the URL http://runeberg.org/salmonsen/2/3/0098.html represents the work (/salmonsen/), 2nd edition, volume 3, scanned page 0098.
For a third, the URL http://runeberg.org/nfca/0612.html represents the work, 2nd edition, volume 21 (/nfca/), scanned page 0612.
For a periodical with parallel sections or subeditions, the URL http://runeberg.org/tektid/1929m/0103.html represents the periodical (/tektid/), its mechanical engineering section of year 1929 (/1929m/), scanned page 0103. This way to combine the year and section has turned out to be a smart way to save one level of hierarchy and create a flatter, more easily manageable structure.

Introducing datasets

From the structured data described above, a static website is built, suitable for browsing and reading the scanned books, and in particular to link directly to scanned pages from Wikipedia and other websites. Together with full-text web search (primarily using Google), this covers most use cases. It is also possible to download the entire text and/or images of a scanned work or volume, for further processing.

In January 2023, the full text of one encyclopedia (NB: which has been entirely proofread) was converted into a single .lst (i.e. CSV) file with one line per article. This makes it easy to experiment in new ways with pattern matching, statistics, information retrieval, and text mining. For this particular encyclopedia, most articles carry an author signature, and these were stored in a separate field.

Even though this Danish encyclopedia, Salmonsens konversationsleksikon (2nd ed., 26 volumes, 1915—1930) had been digitized in 2004—2008 and fully proofread in 2011, some 989 extra edits were needed to fix syntax and spelling errors for (and discovered during) the conversion. The resulting file is not guaranteed to be entirely correct. What if new errors are found and corrected, should a new dataset be generated and released? Immediately? Or with certain intervals? For now, only a static dataset is released in January 2023, having the size and checksum shown below.

The dataset is a 153 megabyte text file in UTF-8 (Unicode). On a Linux or UNIX system, it can be handled as any normal text file with command line tools such as grep, sed, awk. It should also be possible to import it as a CSV file into any spreadsheet program, but LibreOffice (version 7.4) complains that it exceeds the maximum number of characters per cell, apparently limited to 32767 characters.

(See tutorial above...)

This encyclopedia was published between 1915 and 1930, after the Danish spelling reform of 1892 that changed kj (Kjøbenhavn) to k (København). And still, the older spelling appears nearly nine times more often, perhaps because its articles mention old book titles. The English name of Denmark's capital appears 9 times, the Swedish name only 3 times. Another spelling reform in 1948 introduced the letter Å (for AA) and dropped capitalization of nouns, but this encyclopedia writes "paa denne Maade" and not "på denne måde".

The dataset has four fields:
1. a part of the URL where the article starts,
2. the article heading as specified in the <chapter> tag during proofreading,
3. the full text of the article except for the author signature, and
4. the author signature.

To be continued...

Project Runeberg, Sun Feb 5 21:05:17 2023 (aronsson) (diff) (history) (download) << Previous Next >> https://runeberg.org/admin/datasets.html

		About Project Runeberg / Datasets for Natural Language Processing and Text Mining Table of Contents / Innehåll \| << Previous \| Next >>
	Project Runeberg \| Catalog \| Recent Changes \| Donate \| Comments? \|