- Project Runeberg -  About Project Runeberg /
Comparison of OCR output for blackletter/fraktur

Table of Contents / Innehåll | << Previous | Next >>
  Project Runeberg | Like | Catalog | Recent Changes | Donate | Comments? |   

Comparison of OCR output for blackletter/fraktur

by Lars Aronsson on 29 June 2021.

Below, some frequency statistics are shown, the output of a program that arranges items (characters or words) in classes of half the number of occurrences (inspired by Zipf's law). In the first example, whitespace or [space] is the most common item, occurring 112582 times. Putting this at the center of the first 2-logarithmic interval, the first line of output presents items that occur at least 112582 / 1.4 = 80416 times, which includes [space] and the letter e. The following line of output presents items that occur half as often or 80416 / 2 = 40208 times or more.

2011 Norwegian digitization

The Norwegian translation Kaptein Grants børn (1901, printed in blackletter/fraktur) of Jules Verne's novel "Les Enfants du capitaine Grant" (In Search of the Castaways), has been digitized in 2011 by the Norwegian National Library, 745 page images in 400 dpi JPEG.

At that time (2011), the library had a rather poor OCR process for fraktur. The text is 938933 bytes, 931665 characters long. The most commonly occurring characters are:

   80416 or more times: [space] e 
   40208+: a n t r d 
   20104+: i l o s g 
   10052+: \n m f v h , 
    5026+: k u b . p 
    2513+: y j " D 
    1256+: „ M c H G
     628+: « P A ! ? ; T S J 
     314+: N - I E 1 O ' K V å B 2 x 6 R F 

We can see here, that æ and ø are not among the commonly occurring letters, which is strange for a Norwegian text. In fact, this was OCR made for German text (ä, ö), not Norwegian (æ, ø), but with a Norwegian dictionary. The long s is not present, because the OCR has rendered them as modern short s, which is the way Scandinavians tend to prefer to transcribe fraktur.

Commonly occurring words (of 130503 in all, 23340 unique):

    2432 or more times: og at 
    1216+: i en af til den paa det var de 
     608+: for med sig er han som et der havde De itte om 
     304+: Glenarvan jeg man Paganel Det ved har forn vi men saa Men
           ikke fra denne John sin blev faa vilde da hans sagde ham
           dem over vil
     152+: Han kunde vere efter nu dette to op sine eller I Den os
           alle mig mod kan ud Og Grant hvis hvor Mangles svarede min
           Robert mere have lady nogle maatte Ayrton dog Man meget
           fagde ind disse end hvad deres fig Der Glenarvan. Helena
           endnu majoren raabte maa

Here we can see the words "itte", "forn", "fagde" and "fig", which are OCR errors for "ikke", "som", "sagde" and "sig", printed with a long initial s, and "faa" might be correct or an OCR error for "saa".

dan_frak

In June 2021, I made a better OCR text with Tesseract 4.1 using the third-party language dan_frak (Danish fraktur). Historically, Danish and Norwegian have been the same language in printed literature, with Norwegian (bokmål) slowly diverging in spelling from 1850 to 1920.

My OCR text is 1007787 bytes, 980804 characters long. Commonly occurring characters are:

  102080 or more times: [space] e 
   51040+: a n r t 
   25520+: d s i l \n g o 
   12760+: m k v , f h 
    6380+: u b . p 
    3190+: « - y æ j ø D » 
    1595+: · M — c G H J 
     797+: P A ; ? T S 
     398+: 1 N ^L E ’ x K O V F L B : R " z 

Here æ and ø do occur (in the 3190+ range, 6th line), but long s is missing, since it has been rendered as a modern short s.

Commonly occurring words (of 159989 in all, 27141 unique):

    2944 or more times: og at 
    1472+: i en til af den paa det var de som ikke sig 
     736+: med for er han et der havde - De Glenarvan om 
     368+: saa man jeg Det Paganel har ved sagde fra vi Men sin men
           denne kunde blev John da vilde hans nu ham vil
     184+: dem Han svarede over være efter to dette sine skulde disse
           eller kan s kun Den os alle spurgte mod Jeg op Grant ud
           Mangles mig J raabte Og hvis endnu Helena Robert have mere
           min l hvor ogsaa svarte faa maatte uden nogle hvad meget
           Man dog sit lady Ja end majoren selv deres Der as maa

As can be seen, "itte" and "forn" are no longer present. The word "faa" occurs among words with 184-368 occurrences, which are fewer than in the previous OCR text. Note that "spurgte" (a correct word) appears in the 184+ range (184-368 occurrences). This is a good OCR output, which a human can start to proofread without feeling that it would be better to just transcribe the text from scratch.

Internet Archive in June 2021

Also in June 2021, I uploaded the scanned page images to the Internet Archive where they were OCR processed with Tesseract 5.0 (alpha) and the language Norwegian (bokmål).

The resulting text is 1044229 bytes, 1008377 characters long. Commonly occurring characters are:

  119808 or more times: [space] e 
   59904+: a 
   29952+: n r t f d \n i g l o 
   14976+: m v 
    7488+: , h u b . j p s 
    3744+: k æ y ø ſ D 
    1872+: — - M G „ E ” H 
     936+: A N P J c S ; T “ " 
     468+: ? B I R K 8 å L ! O 3 F 1 | * V 2 0 

Here the æ and ø are present, and so is the long s "ſ", all three in the range of 3744+ occurrences (6th line).

Commonly occurring words (of 164988 in all, 29657 unique):

    2880 or more times: og at 
    1440+: i af en til paa var den det de fom fig med 
     720+: for er han et havde De der faa iffe om Det 
     360+: ikke man jeg Glenarvan har ved | fra Paganel vi fin men Men
           blev vilde Den denne John hans vil da I Han å være
     180+: fagde efter ham over funde to dem fine nu em eller op mod
           alle fan ud dette Å Grant fvarede mig fun Der Jeg raabte Og
           endnu mere fit have fpurgte hvis Robert min Da Helena uden
           maatte Duncan hvor meget ord Man nogle 2 Ayrton Mangles Dem
           under hvad end jagde maa majoren - Glenarvan. Ja fvarte

Some obvious OCR errors appear: fom (som), fig (sig), iffe (ikke), fin (sin), fagde (sagde), funde (kunde), fine (sine), fan (kan), fvaraede (svarede), fun (kun), fit (sit), fpurgte (spurgte), jagde (sagde), fvarte (svarte), and the word "faa" occurs far too often in the range of 720-1440 times (many of which are OCR errors for "saa").

This is not good OCR output.

It seems as a blind OCR without support of a dictionary. A proper dictionary-guided OCR should not output fpurgte when spurgte is in the dictionary. Here "fpurgte" appears in the 180+ range and "ſpurgte" (with long s) only in the 22+ range, and "jpurgte" in the 5+ range.

Example

Text excerpt
2011 Norwegian digitization dan_frak
For en sterk nordost ploiede en pregtig lystyacht
Nordkanalens bolger. Fra agtermasten vaiede det engel
ste stag, medens man paa formastens blåa vimpel kunde
lefe bogstaverne N. <3. med en i guld broderet hertug
krone over.

luchten hed „Duncan" og eiedes af lord Glenar
uan, en af de fexten stotste pairs, forn har fete i over
huset, og et fremtredende medlem af den i de forenede
kongeriger vel bekjendte

Lord Edward Glenarvan, hans frue, lady Helena,
faml en af det unge pars slegtninge, major Mac
Nabbs, befandt fig ombord.
For en stærk nordost ploiede en prægtig lyftyacht
Nordkanalens bølger· Fra agtermaften vaiede det engel-
ske slag, medens man paa formastens blaa vimpel kunde
læse bogstaverne B. G. med en·i guld broderet hertug-
krone over.

Yachten hed »Duncan« og eiedes af lord Glenar-
van, en af de sexten stotske pairs, som har fæde i over-
huset, og et fremtrædende medlem af den i «de forenede
·kongeriger vel bekjendte »Royal—Thames—yacht—olub«.

Lord Edward Glenarvan, hans frue, lady Helena,
samt en af det unge pars slegtninge, major Mac
« Nabos, befandt sig ombord.
2021 Internet Archive Manually proofread
For en ſterk nordoft pløiede em prægtig lyſtyacht 
Nordkanalens bølger. Fra agtermaftern vatede det engel 
ffe flag, medens man paa formaftens blaa vimpel funde 
fæfje bogftaverne E. G. med em t guld broderet hertug- 
frone over. 

Yachten hed ,Duncan” og eiedes af ford Glfenar- 
van, en af De ferten ffotffe pairs, fom har fæde i over 
huſet, og et fremtrædende medlem af den i De forenede 
fongeriger vel befjendte ,Royal-Thames-yacht-club*. 

Lord. Edward Glenarvan, hans frue, lady Helena, 
famt em af det unge pars flegtninge, major Mac 
Nabbs, befandt fig ombord. 
For en stærk nordost pløiede en prægtig lystyacht
Nordkanalens bølger. Fra agtermasten vaiede det
engelske flag, medens man paa formastens blaa vimpel kunde
læse bogstaverne E. G. med en i guld broderet
hertugkrone over.

Yachten hed „Duncan“ og eiedes af lord
Glenarvan, en af de sexten skotske pairs, som har sæde i
overhuset, og et fremtrædende medlem af den i de forenede
kongeriger vel bekjendte „Royal-Thames-yacht-club“.

Lord Edward Glenarvan, hans frue, lady Helena,
samt en af det unge pars slegtninge, major Mac
Nabbs, befandt sig ombord.


Project Runeberg, Tue Jun 29 13:27:19 2021 (aronsson) (diff) (history) (download) << Previous Next >>
http://runeberg.org/admin/20210629-ocr.html

Valid HTML 4.0! All our files are DRM-free