[ Up (Contents) ] Last Updated: 30 August 2000

 

APPENDIX A

The XHTML 1.0 Web development Sourcebook

Character Sets, Character Encoding, and Document Character Sets

 

One of the main subjects of this book is the technology behind Web application design. For Web page or XML markup design, an important background issue is the digital representation of the text itself: the sequence of characters that make up the content and markup. This low level of individual characters, character sets, and character set encodings is in some ways more complex than either markup or formatting. In general, authors do not need to know the details of character sets, as most complications are handled automatically by editing or rendering software, once the software knows the way in which a digital document was created.


However, Web designers need to understand some issues related to character sets, as information about character sets and encodings must be provided when digital documents are distributed over the Web. If this is not done, a recipient of a document will not know how the document was created, and will not know how to convert the data back into characters.


This Appendix provides a very brief introduction to character set issues, designed to help Web designers understand the main issues. Additional information about character sets is available from the references at the end of this Appendix.

 

Alternate Formats
This Document (Word 6 Format)

 

A.1 Character Sets and Character Set Encodings

Text, at its most basic, is a sequence of characters plus semantic rules for how those characters flow (e.g., drawn from left to right, right to left, top to bottom, etc.) and connect (i.e., how punctuation or accents join with adjacent letters). For example, the document you are reading right now consists of so-called Latin letters and numbers (a–z, A–Z, and 0–9), plus some punctuation and other common symbols, which flow naturally from left to right and which, together, spell out meaningful words and sentences (at least, I hope that is the case!).

When those sequences of characters are stored on a computer, they must be represented in a digital format. This process requires two things:

  • A clear specification of the actual characters being considered (the character set), plus a definition of the position or index of each character in the character set (for example, the first character is the letter a, the second b, and so on).
  • A definition of how the character positions (i.e., their positions in the character set) are digitally encoded when the characters are represented (and probably stored) in a digital form.

The first of these is more formally called a coded character set: a set of characters under consideration, each character uniquely identified by its coded position in the set.

NOTE

The term coded character set is used instead of just character set, as the second term is poorly—and often conflictingly—defined in a variety of standards. Modern character set standards, such as ISO 10646 and Unicode (and also the XML specification), chose to choose a different formalism for describing character sets, and use the term coded character set to refer to this more precise definition. Please see www.w3.org/MarkUp/html-spec/charset-harmful.html for a more detailed discussion of this issue.

More formally, one can think of a coded character set as being a function whose domain is a subset of nonnegative integers, and whose range is a set of characters. For example, a set might define the Latin capital letter I to lie at position 73. The second issue is often called the character set encoding and refers to the manner in which the coded position of a character is stored in a binary format.

Whenever a digital representation of text is created, both of these issues are involved. In practice, the procedure works as follows: software takes a defined character (defined, for example, by the user typing it!), finds the index corresponding to the character (e.g., the character I lies at position 73 in the character set), and then digitally encodes this position using the defined character encoding. The encoded representation can then be stored in memory (as would be done when a document is being processed), or it can be stored on disk for future use.

Once this process is used to convert text into a digital format and store it in a file, software can easily reverse the process, for example, to read in a file and then display the text, by

  • Undoing the encoding process, turning the digital data into a sequence of code positions (essentially integers) that reference a sequence of characters.
  • Determining, from knowledge of the character set being used, to which character each index corresponds.

Of course, this is only possible if the software knows the character set and encoding used to create the data. This means that when documents are distributed over the Web, the identity of the character set and encoding used to create the document must be sent as well.

The definition of a coded character set is a bit more complicated than this, because the definitions must also define the nature of the characters. For example, character set specifications define the directionality of characters (whether they are natively drawn from left to right or right to left), when characters are combining characters (such as an accent that should be combined with the previous character), and so on. Indeed, much of a formal character set specification is spent defining characteristics such as these.

Back to top

 

A.1.1 Character Set Specifications

Each digital character set specification generally defines two things:

  • The set of characters and their positions (the coded character set)
  • One or more encodings by which the character indices can be stored in a binary format

For example, the ISO 8859-1 specification defines a character set (often called Latin-1, consisting of 191 characters common to Western European languages) and a single encoding for that character set (each character is stored in a single byte, encoded according to the position of the character in the set). Thus, ISO 8859-1 says that the character capital letter Q is the 81st character in the character set, and that this is digitally encoded as the binary string: 01010001 (the binary representation of 81). Because the encoding places all characters inside a single byte, at most 256 possible characters can be defined in ISO 8859-1. However, ISO 8859-1 actually defines only 191 characters and their positions (the other positions contain nonprintable control characters, defined in other standards we will not discuss here).

There are many other character sets that encode characters in a single byte. For example, ISO 8859-4 defines a character set consisting of Cyrillic (Estonian, Latvian, and Lithuanian) characters, as opposed to the Latin characters of ISO 8859-1. Thus, if writing an e-mail message in German, an author might choose to use 8859-1, whereas when writing a letter in Estonian, they might use 8859-4. Furthermore, there are many wide character sets that define many thousands of characters (e.g., for the Chinese, Korean, or Japanese writing systems) and that use more complex encodings (generally requiring multiple bytes per character, and sometimes more than one encoding is supported for a given character set) when storing the characters in digital form.

Consequently, if a document is created using a specific character set and encoding, the identity of this encoding must be sent with the document when the data is distributed via the Web. For example, if a document is created using ISO 8859-4 character set, the document and an appropriate identifier—in fact, the string ISO 8859-4—must be distributed together. The recipient can use the ISO-8859-4 identifier to determine the character set as an encoding, and can decode the data to properly display the text. These identifiers are often called charset values, as they are often specified, in MIME content-type headers, using an expression such as:

content-type: text/html; charset=iso-8859-4

which indicates that a message contains an HTML-format data, and that the text in the message was composed using the ISO 8859-4 character set and encoding.

The languages of the world support many tens of thousands of different characters. Unfortunately, many traditional character sets, such as ISO 8859-1, define and encode only a small number of those characters. Thus, although French, Estonian, and Chinese documents can be written using the ISO 8859-1, ISO 8859-4, and Big5 character set standards, respectively, the text of those documents cannot be mixed—none of these standards defines the characters used by all three writing systems. This is a big problem for universal document interchange, because text cannot be easily mixed together, nor distributed in a universal format.

Back to top

 

A.2 The Universal Character Set

This problem was recognized many years ago, and over the past decade much work took place designing universal sets of characters. The results were specifications for two universal sets, formally known as ISO/IEC 10646:1993 [a specification developed by the International Organization for Standardization (ISO)] and Unicode 2.1 (a specification developed by the Unicode Consortium). Fortunately, the two organizations realized that it was neither sensible nor practical to have two different universal character sets. Consequently, the two schemes were merged such that the most recent versions of Unicode and ISO 10646 define the same sets of characters, at the same locations in a common character set. They are thus identical, for all practical purposes. Indeed, we now refer to a single character set, called the Universal Character Set (UCS), to indicate this single universal standard.

The UCS standard defines a character set that can contain over 1 million possible characters. This includes the characters from the Latin, Cyrillic, Arabic, Hebrew, and other alphabets; Japanese, Chinese, and Korean characters; plus many other characters, punctuation marks, and other symbols. However, many positions in the character set are not yet assigned characters, leaving room for characters and symbols that have not yet been added (such as the symbols used to encode Inuit languages in Northern Canada), and for possible future uses.

Back to top

 

A.2.1 Document Encoding and the Document Character Set

Formally, UCS is the document character set of all XML (and thus XHTML) and HTML documents. This means that such documents can contain only characters defined in UCS. It also means that numeric character references in an XML document always reference characters by their positions in UCS. Thus, the character reference é refers to the 233rd character in the UCS character set, which is the character (e with an acute accent).

Historically, most Web documents have been written and encoded using the ISO 8859-1 (Latin-1) character set. This is not a problem—documents can be encoded using any character set, provided they only contain characters defined somewhere in UCS, and provided any character references refer to the position of the character in the UCS character set. Fortunately, the characters defined in ISO 8859-1 are defined at exactly the same positions in UCS [e.g., the character at position 233 in ISO 8859-1 is also (e with an acute accent)], so that all character references in ISO Latin-1 documents are still valid. Unfortunately, this is not the case for many documents written using other character sets. In these cases, the character references often refer to the position of characters in the character set used to create the document, which is usually not the position of the character in UCS. To be valid HTML or XHTML, such documents must have their character references updated to reference the correct UCS code positions.

Back to top

 

A.2.2 Allowed Tokens in XML Names

As noted in Chapter 2, XML does not define specific tag or attribute names, but provides a framework by which largely arbitrary markup languages can be constructed, with the names (of elements and attributes) being chosen to match the types of data that the language will represent. However, the mechanisms for creating such names are not entirely arbitrary, because software must be able to easily recognize the tag boundaries, and because it must also be easy to process the text making up the element and attribute names.

For this reason, the XML specification carefully classifies the different UCS characters and defines those characters that can be used in element and attribute names. We do not go into the details here; they are found, of course, in the official XML specification document, listed at the end of this appendix. They are also described, in general detail, in Appendix B of the XML Specification Guide.

Note that this step is not required for HTML or XHTML, because these languages predefine the names of all the allowed elements and attributes, so that this flexibility is not available.

Back to top

 

A.2.3 Binary Encodings of UCS

The UCS character set supports several different encodings. The main encoding, known as UTF-16 (UTF stands for Universal Character Set Transformation Format), stores each character in two bytes, although there is a mechanism for encoding some characters using consecutive two-byte sequences. This is the easiest encoding for software to handle and is often used when UCS text is stored in memory (e.g., by tools such as editors or browsers).

UCS also supports two encodings that use single bytes as the basic encoding unit. The first of these, known as UTF-8, represents each UCS character as a stream of one or more bytes—this encoding uses all the bits in the byte for encoding purposes. A second encoding, known as UTF-7, represents each character as one or more bytes, but uses only the seven least significant bits for encoding purposes.

The Unicode specification calls the seven- and eight-bit encodings transformation formats, because they correspond to a format suitable for storage or transmission. In this context, UTF-8 has the advantage of compactness, as the file size is small compared with a corresponding UTF-7- or UTF-16-encoded one. On the other hand, UTF-7 is best when a file is to be transported via older communications technologies, such as old e-mail systems, which may not properly transport information encoded in the most significant bit.

The existence of different encodings can create problems when files are stored on disk or sent over the Internet, because character set information must now tag along with the data and be available to subsequent software. If this information is not available, then the next program to see the text will not know how to decode the data and convert it back into the correct characters. Mechanisms for indicating the encoding when data are passed from machine to machine are discussed next.

 

A.2.4 ASCII, ISO 8859-1, and UCS

ASCII, ISO 8859-1, and UCS are the three most common character sets. ASCII is perhaps the oldest, and is a seven-bit character set that defines 92 characters in a code space covering the range 0 to 127, and an encoding that encodes every ASCII character in a single byte, without using the most significant (eighth) bit. The various ASCII characters and their coded positions are shown in Table A.1. The ASCII characters are defined in the range from 0 to 127, inclusive. The grayed-out entries correspond to control characters that are not formally defined by the ISO 646 standard—these characters are defined by a separate standard, known as ISO 6429.

ISO 8859-1 defines 191 characters over a range from 0 to 255, and an encoding such that every character is encoded in a single byte. The characters and code positions for those characters defined in ISO 8859-1 are also shown in Table A.1. This is because ISO 8859-1 was defined to extend the ASCII character set, such that ISO 8859-1 defines all the ASCII characters at the same positions (0 through 127) as in the ASCII character set, and extra characters were added in the range from 128 through 255.

UCS was designed similarly: the first 255 positions of UCS code for exactly the same characters as ISO 8859-1.

Table A.1
ISO 8879-1 (Latin-1) characters and common control characters (ISO 6429), showing the positions in both decimal and hexadecimal notation. Note that these are exactly equivalent to the 256 characters defined in UCS (Unicode/ISO 10646) at positions 0 through 255. Similarly, the characters defined at positions 0 through 127 are exactly the same as those defined by ASCII. Control characters’ short names are shown in italics, and printable control characters that are allowed in XML, and HTML documents are shown in boldface. The control characters that are forbidden in XML or HTML documents are showed against a gray background. Note that the printable space (32-decimal) and nonbreaking space (160-decimal) characters are denoted by the strings SP and NBSP, as they would otherwise be invisible.
POSITION POSITION POSITION POSITION
CHARACTER DEC HEX CHARACTER DEC HEX CHARACTER DEC HEX CHARACTER DEC HEX
NUL 0 00 SOH 1 01 STX 2 02 ETX 3 03
EOT 4 04 ENQ 5 05 ACK 6 06 BEL 7 07
BS 8 08 TAB 9 09 LF 10 0a VT 11 0b
NP 12 0c CR 13 Od SO 14 0e SI 15 0f
DLE 16 10 DC1 17 11 DC2 18 12 ESC 27 1b
DC4 20 14 NAK 21 15 SYN 22 16 ETB 23 17
CAN 24 18 EM 25 19 SUB 26 1a ESC 27 1b
FS 28 1c GS 29 1d RS 30 1e US 31 1f
SP 32 20 ! 33 21 = 34 22 # 35 23
$ 36 24 % 37 25 & 38 26 ' 39 27
( 40 28 ) 41 29 * 42 2a + 43 2b
, 44 2c - 45 2d . 46 2e / 47 2f
0 48 30 1 49 31 2 50 32 3 51 33
4 52 34 5 53 35 6 54 36 7 55 37
8 56 38 9 57 39 : 58 3a ; 59 3b
< 60 3c = 61 3d > 62 3e ? 63 3f
@ 64 40 A 65 41 B 66 42 C 67 43
D 68 44 E 69 45 F 70 46 G 71 47
H 72 48 I 73 49 J 74 4a K 75 4b
L 76 4c M 77 4d N 78 4e O 79 4f
P 80 50 Q 81 51 R 82 52 S 83 53
T 84 54 U 85 55 V 86 56 W 87 57
X 88 58 Y 89 59 Z 90 5a [ 91 5b
\ 92 5c ] 93 5d ^ 94 5e _ 95 5f
96 60 a 97 61 b 98 62 c 99 63
d 100 64 e 101 65 f 102 66 g 103 67
h 104 68 i 105 69 j 106 6a k 107 6b
l 108 6c m 109 6d n 110 6e o 111 6f
p 112 70 q 113 71 r 114 72 s 115 73
t 116 74 u 117 75 v 118 76 w 119 77
x 120 78 y 121 79 z 122 7a { 123 7b
| 124 7c } 125 7d ~ 126 7e DEL 127 7f
128 80 129 81 130 82 131 83
132 84 133 85 134 86 135 87
136 88 137 89 138 8a 139 8b
140 8c 141 8d 142 8e 143 8f
144 90 145 91 146 92 147 93
148 94 149 95 150 96 151 97
152 98 153 99 154 9a 155 9b
156 9c 157 9d 158 9e 159 9f
NBSP 160 a0 161 a1 162 a2 163 a3
164 a4 165 a5 | 166 a6 167 a7
168 a8 169 a9 170 aa 171 ab
172 ac - 173 ad 174 ae 175 af
176 b0 6 177 b1 2 178 b2 3 179 b3
180 b4 181 b5 182 b6 ? 183 b7
, 184 b8 185 b9 8 186 ba 187 bb
188 bc 189 bd 190 be 191 bf
192 c0 193 c1 194 c2 195 c3
196 c4 197 c5 198 c6 199 c7
200 c8 201 c9 202 ca 203 cb
204 cc 205 cd 206 ce 207 cf
208 d0 209 d1 210 d2 211 d3
212 d4 213 d5 214 d6 215 d7
216 d8 217 d9 218 da 219 db
220 dc 221 dd P 222 de 223 df
224 e0 225 e1 226 e2 227 e3
228 e4 229 e5 230 e6 231 e7
232 e8 233 e9 234 ea 235 eb
236 ec 237 ed 238 ee 239 ef
240 f0 241 fl 242 f2 243 f3
244 f4 245 f5 246 f6 247 f7
248 f8 249 f9 250 fa 251 fb
252 fc 253 fd 254 fe 255 ff

Back to top

A.3 Decoding Encoded Documents

As mentioned previously, although the characters in an XML document must be defined in the UCS character set, the document itself need not be encoded using UCS (although it is obviously much easier to process if it is). Indeed, a document can be encoded using any well-understood encoding scheme (ISO 8859-1, EUC-KR, Shift-JIS, etc.), provided the document contains only characters defined in UCS, and provided character references refer to characters by their positions in UCS.

An application reading such data must then know the encoding used to create the data, and it must be able to decode the data to create valid Unicode characters. The two steps required are

  1. Determine the character set and encoding used for the specified data.
  2. Decode the data stream and write it to memory, mapping each encoded item in the input data into the appropriate UCS character and storing this character in memory.

The key is to determine the character set and encoding: Once this is known, the rest is (relatively!) simple.

Back to top

A.3.1 Indicating Character Encodings

When text data is sent to a destination, the delivery process must indicate the character set and encoding used to create the data. Fortunately, there is an easy way of identifying this information, as there is an accepted naming scheme for identifying character set/encoding pairs. These scheme identifiers—simple ASCII text strings like ISO 8859-1, UTF-8, Big5—are often called charsets. Table A.2 lists some common charset names, and the languages/writing systems with which they are associated.

There are two ways in which charset information can be included when text data is sent to someone:

  • As part of the message that contains the data (for example, as part of the header that precedes the data being sent)
  • Embedded directly within the data in as easy-to-recognize string

The former is preferred, as it is the most direct and is pretty well guaranteed to work. Indeed, the e-mail MIME mechanism and the Web HTTP communications protocol include mechanisms for specifying the charset of any block of data included in the message. These mechanisms use the MIME content-type headers to specify the type of the data and the charset used to create it. This approach is discussed in the next section.

The latter mechanism is a useful fallback, particularly because files are not always sent by mechanisms that provide charset information; for example, when accessing a file directly from the file system, or retrieving a file via FTP. Note, however, that without foreknowledge of the charset, the software may not be able to read the data to find the string that identifies the charset (this is very much a chicken-and-egg problem).

Back to top

Specifying Character Encoding in the Message

Both the e-mail message syntax (MIME) and the HTTP protocol use MIME content-type headers for indicating the type of data (e.g., HTML, plain text, XML) being sent. This header is sent ahead of the actual data, and is encoded using characters (ASCII) and a character set (ISO 8859-1) understood by all Internet-aware software. This header supports a charset parameter to indicate the character encoding used within the following (or attached) text component. The form is

Content-type: text/subtype; charset=char-encoding

where subtype is the subtype of the text document (html, plain, xml, etc.), and char-encoding gives the character set and encoding used to create the data. Such headers are included with each part of a MIME-encoded mail message, and every HTTP request or response header can include a content-type header to indicate the type of the data being sent.

Unfortunately, some HTTP servers do not send charset information, while some older browsers and applications cannot handle content-type headers containing charset specifications and they misidentify the MIME type if a charset parameter is present. Thus, in some situations, it is necessary to omit the charset from the content-type header and to hope that the application receiving the data can infer the charset from the content of the document.

Back to top

Specifying Character Encoding in the Document

For the reasons just mentioned, it is important that documents include markup indicating the encoding used to create them. With XML (and hence XHTML), this information must be placed in the XML declaration. HTML supports a special meta element that does the same thing. The forms in these two cases are

<?xml version="1.0" encoding="char-encoding" ?>

<meta http-equiv="Content-Type"
      content="text/subtype; charset=char-encoding" />

where subtype>gives the subtype of the text document (html, xml, etc.), and where char-encoding is a well-known name for the character set and encoding used to create the document (see Table A.2). An XHTML document should include both these specifications. If the document is served as XML, then the value specified in the XML declaration is used. If the document is served as HTML, then the meta-element value is used. An example, assuming an HTML document encoded using UTF-8 is

<meta http-equiv="Content-Type"
      content="text/html; charset=utf-8" />

This approach is practical because most character sets place the standard ASCII characters in positions 0 to 127, and most encodings encode these positions in similar ways (as a single byte corresponding to their position in the character set). Consequently, a browser can assume just about any character encoding, guess the size of the smallest encoding unit (one or two bytes—this can usually be guessed by looking for patterns in the first few bytes), and then read the initial (ASCII-character) text until a markup string is encountered that gives the actual encoding.

The charset specified inside a document is ignored if the document is received as part of a message, and if the message itself uses a content-type header to indicate the charset.

Back to top

A.4 Character Encoding of URL Strings

The issues associated with URL encoding are somewhat different. Here, instead of encoding characters in a binary format, a URL represents an encoding, as printed characters (actually, a limited set of ASCII characters) of some underlying text. This encoding-as-characters step is required because of the intended use of URLs—they are designed to be easily written down on paper, or to be digitally encoded and sent via old-style e-mail systems that cannot handle complex character encodings.

This issue is discussed in somewhat more detail in Chapter 8, Section 8.1.4.

Back to top

A.5 Names for Common Character Sets and Encodings

There are dozens of character sets and encodings in common use. The ISO, for example, specifies several eight-bit character sets, in addition to ISO 10646 (UCS), and their appropriate encodings. The ISO, however, is not the only organization that defines character sets and encodings—national standards bodies, independent of the ISO, defined many sets. Table A.2 lists some of the more common ones. Many of the text labels used here (left-hand column) are not standardized names (note the leading x-). Where available, Table A.2 lists the Internet RFCs that document the encoding and associated coded character set. RFCs are available at:

http://www.rfc-editor.org/rfc.html

www.ietf.org/

www.rfc-editor.org/rfc/

It is important to note that most of these character sets/encodings are not widely supported—you may be able to encode documents using these character sets, but most users will not be able to view them. For portable documents, you should produce and send text encoded using UTF-8 or UTF-16, although ISO 8859-1 or US-ASCII—with character references for non-Latin characters—are useful options compatible with most current Web software.

An official list of well-defined charset names is maintained by Internet Assigned Numbers Authority (IANA). These are available at www.isi.edu/in-notes/iana/assignments/character-sets. Note that having a name in this list does not guarantee that software understands the name or knows how to process data so encoded!

 

Table A.2
Some common (and not always official) names for character set encodings, with descriptions. Note that the charset names are case-insensitive. For portable documents, you should avoid distributing text using most of these encodings. The names of encodings that are widely supported (or that must be supported by XML and HTML application) are in boldface, on a gray background. In the descriptions, the parenthetical term (Windows) means that the charset is Microsoft Windows-specific, and the term (Macintosh) means that the charset is Macintosh-specific.
CHARSET LABEL DESCRIPTION
US-ASCII US ASCII
ISO 8859-1 ISO Latin-1
UTF-8 UCS [ISO 10646/Unicode], one-byte (8-bit) encoding—universal transformation format
UTF-16 UCS [ISO 10646/Unicode], two-byte encoding, including surrogate extension mechanism
Unicode 1-1 Unicode, Version 1.1, two-byte encoding (defined before Unicode and ISO 10646 were "merged")
Unicode-2.0 Unicode, Version 2.0, two-byte encoding (equivalent to UTF-16)
UTF-7 Unicode, one-byte (8-bit) encoding—universal transformation format
ISO 10646-UCS-4, or UCS-4 ISO 10646, four-byte encoding
ISO 10646-UCS-4, or UCS-2 ISO 10646/Unicode, two-byte encoding (same as UTF-16, but only encodes the first 65,536 characters)
x-mac-roman, mac-roman,

or MacRoman

Like ISO 8859-1; with extra characters in positions

128 through 255 (Macintosh only)

windows-1250, win-1250,

or CP-1250

Central European (Windows)
 
x-mac-ce, mac-CE, or MacCE Central/East European (Macintosh)
ISO 8859-2 Central/East European (Slavic: Czech, Croat, German, Hungarian, Polish, Romanian, Slovak, and Slovenian)
windows-1251, win-1251,

or CP-1251

Russian and Central/Eastern European (Windows)
ISO 8859-3 Southern European (Esperanto, Galician, Maltese, and Turkish)
KOI8-R Cyrillic (RFC 1489)
ISO 8859-4 Cyrillic (Estonian, Latvian, Lithuanian)
ISO 8859-5 Cyrillic (Bulgarian, Byelorussian, Macedonian, Serbian, and Ukrainian)
windows-1257, win-1257,

or CP-1257

Cyrillic (Estonian, Latvian, Lithuanian) (Windows)
x-mac-cyrillic, mac-cyrillic,

or MacCyrillic

Cyrillic (Macintosh)
ISO IR-111, or ECMA-Cyrillic Cyrillic
CP-866 Cyrillic
KOI8-U Ukrainian
x-mac-ukrainian, mac-ukrainian,

or MacUkrainian

Ukrainian (Macintosh)
ISO 8859-6 Arabic
windows-1256, win-1256,

or CP-1256

Arabic (Windows)
ISO 8859-7 Greek
x-mac-greek, mac-greek,

or MacGreek

Greek (Macintosh)
windows-1253, win-1253,

or CP-1253

Greek (Windows)
ISO 8859-8 Hebrew
windows-1255, win-1255,

or CP-1255

Hebrew (Windows)
ISO 8859-14 Celtic
ISO 8859-15 Western
ARMISCII-8 Armenian
TIS-620 or Windows-874 Thai
ISO 8859-9 Turkish
x-mac-turkish, mac-turkish,

or MacTurkish

Turkish (Macintosh)
windows-1254, win-1254,

or CP-1254

Turkish (Windows)
VISCII Vietnamese
Windows-1258, win-1258,

or CP-1258

Vietnamese (Windows)
VIET-VPS Vietnamese
VIET-TCVN5712 Vietnamese
x-mac-croatian, mac-croatian,

or MacCroatian

Croatian (Macintosh)
x-mac-icelandic, mac-icelandic,

or MacIcelandic

Icelandic (Macintosh)
x-mac-romanian, mac-romanian,

or MacRomanian

Romanian (Macintosh)
ISO 8859-10 Greenlandic/Icelandic/Lapp
ISO 2022-jp Japanese (RFC 1468; not that this encoding can use more than one coded character set; the encoding itself indicates the set being used)
Shift_JIS or x-sjis Japanese Shift-JIS (Microsoft)
euc-jp or x-euc-jp Japanese; Extended UNIX Code
ISO 2022-kr Korean (RFC 1557)
euc-kr or x-euc-kr Korean; Extended UNIX Code (RFC 1557)
gb_2312-80 or gb-2312 Chinese, Simplified—People’s Republic (RFC 1345)
x-euc-tw or euc-tw Chinese-Taiwan; Extended UNIX Code
Big5 Chinese, Traditional—Taiwan—multibyte set

Back to top

A.6 References

www.w3.org/TR/REC-xml
This is the official specification for the XML language. Appendix B defines various classes of UCS, and Sections 2.2 and 2.3 define how various language components (attribute and element names, for example) can be written using characters from these different classes.
XML Specification Guide, by Ian Graham and Liam Quin, John Wiley & Sons (1999)
Appendix B of this book provides a more detailed overview of character sets and encodings.
CJKV Information Processing, by Ken Lunde, O'Reilly & Associates, Inc. (1999)
A definitive guide to character encoding issues, with particular emphasis on the problems of Chinese, Japanese, Korean, and Vietnamese text.
The Unicode Standard, Worldwide Character Encoding, Version 2.0(1996)
Complete description of the Unicode character set, with a CD-ROM illustrating all the defined characters. The book is not exactly fireside reading, but it is very useful for software developers. The standard can be purchased directly from the Unicode Consortium (www.unicode.org). Ordering information and access to online updates to the standard (Technical Report No. 8 brings the standard up to Version 2.1) are provided at the Unicode standards page, at www.unicode.org/unicode/uni2book/u2.html.
ISO/IEC 10646-1:1993, Information Technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane.
There are several amendments to the original specification, which you will need to bring the standard up to date. Information about these amendments can be found (if you're lucky -- the ISO keeps moving documents around) at www.iso.ch/cate/dl8741.html. Information about the ISO and instructions for purchasing ISO standards documents—including the amendment text—can be accessed via the Web at www.iso.ch.

 

INTERNET RESOURCES
www.isi.edu/in-notes/iana/assignments/character-sets IANA-assigned charset names
www.ietf.org/rfc/rfc1489.txt KOI8-R charset
www.ietf.org/rfc/rfc1468.txt ISO 2022-jp charset
www.ietf.org/rfc/rfc1557.txt ISO 2022-kr, euc-kr, charset
www.ietf.org/rfc/rfc1345.txt gb_2312-80 or gb-2312 charset
www.ietf.org/rfc/rfc2279.txt UTF-8 charset
www.ietf.org/rfc/rfc1738.txt URL specification
www.ietf.org/rfc/rfc2396.txt URL syntax update

Back to top


Back to top


The XHTML Web Development Sourcebook © 1995-2000 by Ian S. Graham