- Project Runeberg -  About Project Runeberg /
Unicode character set

Table of Contents / InnehÄll | << Previous | Next >>
  Project Runeberg | Like | Catalog | Recent Changes | Donate | Comments? |   

Update: In December 2012, all of Project Runeberg's scanned works were converted from ISO 8859-1 (Latin-1) to UTF-8 (Unicode), as described in "Projekt Runebergs konvertering till Unicode" (in Swedish).

Unicode Character Set

by Lars Aronsson, March 2006.

Are you seeing strange, broken or missing letters on Project Runeberg's website? Here is an explanation why that can happen, and how we can help each other by reporting errors and fixing problems. If you have any questions, write to the editors@runeberg.org

Historic Background

Computers are numeric machines. Internally, they can only handle numbers. When we use them for text, every letter and other printable symbol is represented by a number. This representation must follow some rule, known as a character set.

In the oldest time, all computer storage was expensive and very limited character sets were used. Already in the 1960s, the American Standard Code for Information Interchange, or ASCII, was designed to allow for 128 different characters, including the 26 letters of the English alphabet in both UPPER and lower case, the 10 digits, comma, period, colon, parentheses, brackets, apostrophe, and a handful of other useful symbols. What ASCII didn't contain was accented characters (àéñ) and Scandinavian letters (åäöæøðþ).

When Project Runeberg was new in 1992, ASCII was slowly being replaced by extended character sets of twice the size, or 256 characters. We settled for one called ISO 8859-1. As can be seen from the name, this is an international standard from the United Nations' International Organization for Standardization (ISO). The standard ISO 8859 defines a whole family of character sets, where the version 8859-1 contains the characters used in Northern and Western Europe, including Scandinavia. It is also known as Latin-1. This was a natural choice because it was the default character set in the Linux operating system and the X Window System. Microsoft also adopted it as its default, starting with Microsoft Windows version 3.1.

The World Wide Web was also new in the early 1990s and since ASCII wasn't quite dead yet, the HTML file format can use ASCII to write any non-ASCII character, in a way known as an HTML entity. Such entities start with the ampersand (&), followed by the character's name written in ASCII and a semicolon (;). For example, the letter ä is known as &auml; which is short for a-with-umlaut. In addition to this, numeric HTML entities use the hash sign (#) followed by a decimal number instead of the character's name. Since the letter ä happens to be number 228, it can be written as &#228; as well.

The numbers used in numeric HTML entities must go way beyond 256 to represent all Greek, Arabic, Hebrew, Russian, Chinese, and Japanese characters. The full list of several hundred thousand letters is called the Universal Character Set (UCS), also known as Unicode. This too is an international standard, ISO 10646, first published in 1990. Version 3 was published in the year 2000.

While HTML entities have the double advantage of representing all strange letters in plain old ASCII, and being a de facto standard on the web, they also have drawbacks. When possible, it is always preferable to use a "native" encoding rather than ASCII. Project Runeberg has used ISO 8859-1 from the start in 1992, and always avoided HTML entities. As the years have gone by, we have felt an increasing need for changing to a full Unicode representation. At the same time, the increasing support for Unicode in commonly used operating systems and web browsers have made it possible for us to do this transition. However, this is not an easy decision. We believe that Microsoft Windows 95, almost as old as Project Runeberg, is still being used, and doesn't fully support Unicode. We don't want to leave our users behind.

Project Runeberg's Transition to Unicode

In February 2005, Pieni Tietosanakirja, a small encyclopedia in Finnish, was the first work we published in Unicode. This is because it contains the letters š and ž sometimes used in older Finnish texts, that are not available in Latin-1. For our Unicode files, we have settled for an encoding known as UTF-8, short for the 8-bit Unicode Transformation Format.

All new works that we have digitized after the summer of 2005 have been published in UTF-8.

Our plan is also to convert all previously published works to UTF-8, but we're moving slowly, one work at a time. Some works that need Greek letters for mathematical equations have been converted early. When we digitize a new volume of a journal, we try to convert all previously digitized volumes, so the entire work will use the same character set. With time, more and more works will be converted.

Problems

The conversion can cause problems, resulting in letters that are missing or broken. This can be caused by software problems on our side, or by lacking Unicode support in your web browser or operating system. Either way, Project Runeberg's editors are interested to learn what the problems can be and want to work with you to find solutions.

If you see a page with broken or missing characters, of a kind that isn't already depicted here, we would appreciate if you could send us a screenshot to show how it looked on your computer. Write to editors@runeberg.org

Problem Description Example Screenshot
A web page in UTF-8 where the title contains Latin-1 characters that haven't been properly converted. The broken characters are shown as black diamonds by the browser Mozilla 1.7 on Linux.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot
The title contains UTF-8 letters, but has been converted to UTF-8 once more, as if it contained Latin-1 letters. When this happens, every original letter (å) is converted to several letters.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot
In the Recent Changes page, one line contains black diamonds. In the Mozilla 1.7 browser on Linux, this indicates some Latin-1 characters in a UTF-8 page.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot
In the OCR text of a facsimile page, an apostrophe was represented according to the Windows-1252 character set (position 146), rather than any character from Latin-1. The OCR text was converted to UTF-8 under the assumption that it was Latin-1, and the apostrophe was turned into Unicode position 146, the control code PU2. The Mozilla 1.7 browser on Linux shows this as superscript PU2, right after the word musiker on the middle line.

Apostrophes, double quotes, and long dashes are the most likely victims of this kind of conversion error.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot
The OCR text of a book page is in UTF-8, but the surrounding web page is in Latin-1. When this happens, every original letter appears as several "garbage" letters. A typical garbage letter is upper case A-tilde (Ã).

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this. Probably, the whole book should be moved to UTF-8, since the OCR text is already in this character set.

screenshot
The OCR text of a book page is in Latin-1, but the surrounding web page is in UTF-8. The Mozilla 1.7 browser on Linux shows the broken letters as black diamonds.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot
One title in the catalog (Lärobok i mineralogi) contains broken letters because it has wrongly been passed twice through the UTF-8 conversion. Note how the other lines are fine, as is the author's name on the same line. These names and titles are pulled together from various sources, and only one is affected.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot
The book title (...för Svenskarne...) having prematurely been converted to UTF-8, is presented as part of a web page in Latin-1. Instead of the usual broken letters or black diamonds, the Mozilla 1.7 browser on Linux prints a single question mark and drops several of the following letters, recovering only at the second letter of the next word.

The cause is a software problem at Project Runeberg. This should be reported to the editors, who can fix this.

screenshot


Project Runeberg, Sat Jan 19 02:34:01 2013 (aronsson) (diff) (history) (download) << Previous Next >>
http://runeberg.org/admin/unicode.html

Valid HTML 4.0! All our files are DRM-free