Appendix A

Character Sets, Character Encodings, and Document Character Sets

Alternate Formats
This Appendix (Word 6 Format)

Related Content
Table A.1 (Latin 1 Characters) | Table A.2 (Character Set Names) | References

Entity Test Documents
Latin-1 | Miscellaneous (HTML 4) | Symbols (HTML 4)

SGML Entity Definition Files
ISO Latin 1 | Symbols | Miscellaneous

The communication of text-based information between computers is far more complicated than most people suspect. This is due in part to the anarchistic development of computer standards, as well as to the historical lack of understanding, by software designers, of the important technical, cultural, and political issues associated with languages and character sets. Fortunately, this is an era when these issues are finally being resolved, and the future will soon bring a day when true, multilingual content flows freely on the Web.

Understanding these character set issues, and how these issues impact on the creation of Web content, requires understanding in three technical areas: computer character sets, document character encodings, and document character sets. These areas are strongly related--and somewhat confusing in their relationships! The next three sections outline the basic details, and will hopefully clarify the most perplexing points.

Computer Character Sets

A computer character set is simply an agreed-upon relationship between binary codes and a set of letters or graphical characters. Since most computers use bytes (8 bits) as the basic storage unit, many (but not all!) character sets use individual bytes to store single characters. With such character sets, the value of a given byte corresponds to a specific character, as defined by the character set being used. Having 8 bits, a byte can represent any one of up to 256 different characters (256=28), while any defined relationship between these 256 codes and a particular set of graphical characters is called an 8-bit character set. ISO Latin-1, the "traditional" character set of the World Wide Web, is one example.

There are many character sets, and in general each is optimized for a different language or writing system (e.g., Cyrillic, Arabic, Japanese, Chinese, Korean, etc.). However, Latin-1 is by far the most common character set in current use. Latin-1, more formally named ISO 8859-1, is described in more detail later in this appendix.

Recently, the Web community has agreed to standardize on a new, 16-bit character set--known as the Universal Character Set (UCS) portion of ISO 10646--as the default for Web applications. Unlike Latin-1, this character set uses more than one byte to store a character, and defines tens of thousands of characters, including most of the symbols from the majority of the world's languages. The use of UCS within Web applications is described in more detail a bit later.

Character Encodings

When a document is created, it is created using a specific character set. This is referred to as the character encoding of the document. For example, a document created using the Latin-1 character set is said to be encoded using ISO Latin-1. To put the distinction more formally, a character set is an abstract relationship between characters and bytes, whereas a character encoding is the specific instance of one such relationship as applied to a particular document.

This distinction is important because, when documents are sent from one machine to another, they are separated from the character sets used to create them. The recipient receives only the bytes that encoded the characters in the document--and these bytes are meaningless without an understanding of the encoding used to create them. Thus, the recipient must be told of the encoding used for those data before it can convert the data back into the correct characters. Mechanisms for indicating the encoding, when data are passed from machine to machine, are discussed later.

Default Encodings

On the Web, most documents are currently distributed without any encoding information. In this case, the software receiving the document must assume an encoding. At present, most browsers assume that HTML (or other text) documents are encoded using the Latin-1 character set unless configured otherwise. On many browsers (e.g., Internet Explorer 3/ 4, or Netscape Navigator 3/ 4), the user can change the assumed default encoding using a drop-down menu.

URLs, on the other hand, are always encoded using Latin-1--the URL specification defines ISO Latin-1 as the sole encoding for URLs. Thus, all software must translate the bytes in a URL into the characters defined by the Latin-1 character set. Similarly, HTTP headers must also be encoded Latin-1.

Universal Character Sets and the Document Character Set

Most character sets present limitations that are unacceptable for a truly "World" Wide Web. The basic problem is that most sets restrict an author to a limited set of characters--for example, to 256 characters if using an 8-bit character set. Although there are several 8-bit character sets, optimized for different languages, an author cannot, using a single 8-bit character set, encode characters from different sets within the same document (for example, Japanese characters within Cyrillic text). Thus the pages are really not "universal," in the sense of allowing truly multilingual content.

Character and Entity References

In part to get around these limitations, HTML supports mechanisms for representing any "defined" character using special sequences of ASCII characters. These mechanisms are called character references, which reference characters using numbers, and entity references, which reference them using symbolic names. For example, the character reference for the character é is é (the semicolon is necessary and terminates the special reference), while the entity reference for this same character is é. Of course, for entity references to be meaningful, there must be a way of relating the entity names to a particular character. These definitions are also part of the HTML specification. Indeed, the HTML specification defines every entity reference in terms of a specific character reference; for example, it states that the entity é is equivalent to the symbol referenced by the character reference é. This, of course, still leaves the problem of relating the character reference to the desired character. This is the job of the document character set.

Character References and the Document Character Set

For character references to be useful, there must be a universal list that relates references to characters--for example, the reference é to the character é, independent of the encoding used to create a document. This list, known as the document character set and also specified in the HTML specification, defines a universal relationship between numeric references and actual characters. Thus, the reference é defines the character é, even if the reference is typed using a character encoding that does not support the referenced character.

For HTML, the document character set is the 16-bit Universal Character Set (UCS) portion of ISO 10646 (this is formally equivalent to Unicode 2.0). This set defines many thousands of characters or symbols (216=65,536; but not all the positions in this set are actually assigned characters), encompassing the symbols of most of the world's languages. In an HTML document, character references refer to the position of the character in the UCS character set. Thus, the reference é refers to the 233rd character in UCS (the character é), while the reference δ refers to the 948th character (the Greek lowercase letter d). Importantly, the first 256 characters in UCS are equivalent to the first 256 characters of ISO Latin-1.

Table A.1 lists the ISO Latin-1 characters, alongside the defined entity reference names and the numerical positions of these characters in the UCS character set. These entity references are supported by all current browsers.

HTML 4 has tentatively defined many additional entity references, encompassing common symbols from mathematics (Greek letters and mathematical symbols), typography (spaces, bars, and punctuation)and extended Latin letters (e.g., ligatures). Links to HTML documents that describe and test these references are given at the top of this document.

Note that these character and entity references are not understood by Netscape Navigator 4. Even when they are understood (for example, by Internet Explorer 4), they may not be displayed--the computer must also be equipped with a font capable of displaying the desired character. Thus the computer may "know" that the code δ corresponds to the Greek lowercase character "delta," but may not have a font capable of displaying that symbol.

The ISO Latin-1 Character Set

Currently, with most World Wide Web applications, the default set of printable characters is the 8-bit ISO Latin-1 (also known as ISO 8859-1) character set, shown in Table A.1. This character set is defined by the International Standards Organization (ISO), an organization responsible for a number of international character set specifications. A browser or other Web application will assume that text files are encoded using ISO Latin-1, unless some other encoding is specified.

The first 128 positions in ISO Latin-1 are equivalent to the 128 characters of the US-ASCII--also known as ISO 646--character set. (US-ASCII is known as a 7-bit character set, since it defines only 128 characters, and can be represented using just seven bits--128=27). Of these 128 characters, 32 are known as control characters, and are used to control printing devices and serial communications lines or devices (such as modems or terminals.)1 Control characters are not printable, and are indicated in Table A.1 by the two- or three-letter character sequences that mnemonically designate their function. For example, NUL is a null character, BEL is the bell character (rings a bell), CR is carriage return, BS is the backspace character, and so on. In addition, Table A.1 includes the space character (decimal 32) with the symbol SP, which would otherwise be invisible. Some important control characters, and their meanings, are:

1 Formally these control characters are not ISO Latin-1 characters, but are part of another ISO specification, which defines octal codes for special data line control characters.

Character	Meaning	Decimal Code Position
NUL	Null character	00
BS	Backspace	08
HT	Tab	09
LF	Line Feed/New Line (also NL)	10
CR	Carriage return	13
SP	Space character	32
DEL	Delete	127

ISO Latin 1 has an additional 128 characters, corresponding to octal value from 128 to 255. The first 32 are unprintable control characters; marked in Table A.1 by a double dash "--". The remaining characters are printable characters, consisting of many of the accented and other special characters common in western European languages.

ISO Latin 1 Character Table

A table of these characters is found in the attached HTML document, Table A.1.

Character Encodings in URLs

As discussed in Chapter 8, URLs can contain any ISO Latin-1 character (ISO Latin 1 is the defined character set for URLs), but must be written using a small subset of the printable ASCII characters. Within a URL, any 8-bit ISO Latin-1 character can be entered in a URL by indirect references. These encodings take the form:

%xx

where xx is the hexadecimal or hex code corresponding to the character--this is simply the position of the character in the character set, written as a hexadecimal (base 16) number. Table A.1 shows the hexadecimal codes for all the ISO Latin-1 and control characters. As an example, the URL encoding for the string %toads is:

%25toads

since the percent character is character 37 (hexadecimal 25) in the character set.

Character and Entity References Revisited

As mentioned previously, any character can be represented by either a character or entity reference. A character reference represents each character by the numeric position of the character in the UCS character set. Thus, the character reference for a capital U with an umlaut (Ü) is &#220, since this is the character at position 220 (decimal) in UCS.

As of HTML 4, character references can also be given as hexadecimal numbers. For example, the capital U with an umlaut (Ü) can be referenced as either of:

Ü	Decimal character reference
Ü	Hexadecimal character reference

where the letter "x" just after the hash character indicates a hexadecimal character reference. Current browsers, however, do not understand hex character references, so this form should be avoided in HTML documents.

In HTML, the four ASCII characters (>), (<), ("), and (&) are interpreted in special ways (e.g., the < marks the start of a markup tag). To display them as ordinary characters, you should use entity references in their place--the references for these characters are listed in Table A.1.

Character and Entity Reference Test Documents

The book's supporting Web site contains documents that illustrate the defined HTML entity references--these are also useful for testing a browser's support for entities. These documents can be found at:

en_test.html ISO Latin 1 entities
en_misc.html Miscellaneous entities
en_symbol.html Symbol entities

Entity Reference Caveats

Although an entity or character reference should always end with a semicolon, the terminating semicolon can formally be left off if this does not confuse the parsing of the reference. For example, "&Uuml is an..." contains an acceptable entity reference for the character Ü, while "&Uumlis an..." does not.

Also, an ampersand character indicates the start of a reference only when it is followed by an ASCII letter character (e.g., &a to start an entity reference), or by the hash character (e.g., &#2 to start a character reference). If an ampersand character does not appear in either of these contexts, then it is treated as a regular character. To be safe, however, it is best to use the ampersand character's entity reference & to represent the character itself.

Indicating Character Encodings via MIME Content-Types

The MIME protocol supports a charset parameter to indicate the character encoding used within a text component. The mechanism uses a content-type header of the form

Content-type: text/subtype; charset=character_set

where subtype gives the subtype of the text document (html, plain, etc.) and character_set indicates the character set used to encode the data. The World Wide Web assumes the type ISO-8859-1 in the absence of any specified charset. If a server sends out a document encoded in a different character set, it should then return an HTTP content-type header that indicates both the text type and the charset value.

Unfortunately, many servers do not send charset information, while many older browsers do not understand content-type headers containing charset specifications, and will not properly identify the MIME type if a charset parameter is present. For this reason, most browsers support HTML META tags for indicating the character set. The form is:

which places an entire content-type header in the META tag. This works because "most" character sets use ASCII characters in positions 0 to 127, so that a browser can assume just about any encoding, and still read the HTML HEAD. In general, a browser will first look to the real HTTP header to find the charset; if the header does not indicate the charset, the browser will then look for an appropriate META element in the document HEAD. Failing that, the browser may "test" the document--many character set encodings can be detected by looking for patterns in the first few bytes. Failing that, most browsers assume a default character set, usually ISO Latin-1.

Common Character Set Names for Character Encodings

There are dozens of character sets in common use. The ISO, for example, specifies the 8-bit character sets Latin-1 through Latin-10, more formally known as ISO 8859-1 through 8859-10. The ISO, however, is not the only organization to define character sets--many common sets were defined by national standards bodies, independently of the ISO. The attached HTML document Table A.2, lists some of the more common ones: Note that many of the text labels used here are not standardized names (as indicated by the leading x-). Where available, the table lists the Internet RFCs that document the character set.

Encoding via the UCS Character Set

As described earlier, HTML 4 now specifies the two-byte (16-bit) character set known as UCS (this is a portion of the much larger ISO 10646 character set) as the document character set for all HTML documents. UCS/Unicode is compatible with ISO 8859-1, in that the first 256-characters of ISO 10646 are equivalent to ISO 8859-1. For example, the hexadecimal codes for the letter "k" in the two character sets are:

ISO 10646	ISO 8859-1
00 6b	6b

where the first "plane" of characters in Unicode is referenced by the null leading byte (note how the order of the bytes is now important). Documents written using this two-byte format are said to be encoded via the UCS-2 encoding (see bottom of Table A.2).

Unicode characters can also be encoded using a special 8-bit encoding, known as UTF-8. In this "compact" encoding, the first 128 Unicode characters are coded identically with their ASCII character equivalents.

The issues surrounding characters and character sets are both complicated and confusing. The references at the end of this appendix are useful, should you wish to delve deeper into these issues.