- Project Runeberg -  Nordic Words /
Frekvens 20070122

Tema: Dictionaries
Table of Contents / Innehåll | << Previous | Next >>
  Project Runeberg | Like | Catalog | Recent Changes | Donate | Comments? |   

This README file belongs in a file archive found at http://runeberg.org/words/frekvens-20070122.tgz

The files in this archive document word frequencies by year and language, based on raw or proofread text from Project Runeberg's electronic facsimile editions, as of January 22, 2007.

Project Runeberg is an archive of freely available electronic editions of classic out-of-copyright Scandinavian literature, http://runeberg.org/

Most of its titles consist of scanned images (electronic facsimile) and raw text from optical character recognition (OCR) in varying degrees of proofreading. Volunteers are welcome to help in proofreading the scanned text.

Since the scanned images depict a particular printed edition, the resulting text is tied to a publishing year and to a particular orthography (details in spelling), which is not the case for electronic texts that are not backed by scanned images.

Even if Ibsen's drama Peer Gynt was written in 1867 and first performed in 1876, its reprint in the author's collected works in 1898 marks the state of the Norwegian language at this latter year. This is the kind of Norwegian spelling that people were reading in 1898. It might be the authors' original spelling from 1867 or a modernized version of 1898, but it can't be modernized beyond the publishing year.

The files herein are plain text, encoded in UTF-8. The file no-1880.top contains word frequencies in Norwegian books printed in the year 1880. The following list means that the word "og" occurred 8161 times.

   8161 og
   5569 i
   3896 at
   3616 af
   3359 den

The words were extracted with hunspell 1.1.4, having the following affix and dictionary files:

   ---- blank.aff ----
   SET UTF-8
   WORDCHARS .:-'0123456789

   ---- blank.dic ----
   1
   xyzzy

and the Unix/Linux command line:

   sed 's/<[^>]*>//g' *.txt |
     hunspell -d blank -l |
     sort | uniq -c | sort -nrf

Having hyphen, period, apostrophe and digits in WORDCHARS means the output list will contain words such as "etc.", "Dyre-", "General-Vejmester", "3-årig" (3-year-old), "1700-talet" (18th century), "n:o" (numero), "1:20000" (map scale), "12:50" and "23:-" (prices). However, it also means that the period at the end of sentences will be included with some words.

Non-proofread text with OCR errors will also appear, e.g. "wwTQft" and "forunderJigere". This can only be improved by further proofreading. Only using the fully proofread pages would have reduced the amount of text too much.

The following printed and scanned volumes were used for each file. Prefix with http://runeberg.org/

filevolumes
no-1880.top norge80
no-1883.top tekuke/1883
no-1884.top tekuke/1884 tekuke/1884pat
no-1888.top tekuke/1888
no-1889.top tekuke/1889
no-1890.top tekuke/1890
no-1891.top tekuke/1891
no-1892.top tekuke/1892 tekuke/1892pat
no-1893.top tekuke/1893
no-1894.top tekuke/1894
no-1896.top ilnolihi/1 ilnolihi/2 ilnolihi/3 ilnolihi/4
no-1900.top ibsen/1 ibsen/2 ibsen/3 ibsen/4 ibsen/5 ibsen/6 ibsen/7 ibsen/8 ibsen/9 ibsen/10
no-1903.topbrand
no-1905.topilnolih2
no-1907.topbjorfort
no-1910.top bjornson/1 bjornson/2 bjornson/3 bjornson/4 bjornson/5
no-1916.top urmakeri
no-1934.top bokogbib/1934
no-1935.top bokogbib/1935


Project Runeberg, Thu Dec 20 02:32:47 2012 (aronsson) (diff) (history) (download) << Previous Next >>
http://runeberg.org/words/frekvens-20070122.html

Valid HTML 4.0! All our files are DRM-free