[ Up (Contents) ]

Last Updated: 30 August 2000

APPENDIX A

The XHTML 1.0 Web development Sourcebook

Character Sets, Character Encoding, and Document Character Sets

A.1 Character Sets and Character Set Encodings

Character Set Specifications

A.2 The Universal Character Set

Document Encoding and the Document Character Set
Allowed Tokens in XML Names
Binary Encodings of UCS
ASCII, ISO 8859-1, and UCS

A.3 Decoding Encoded Documents

Indicating Character Encodings

A.4 Character Encoding of URL Strings

A.5 Names for Common Character Sets and Encodings

A.6 References

One of the main subjects of this book is the technology behind Web application design. For Web page or XML markup design, an important background issue is the digital representation of the text itself: the sequence of characters that make up the content and markup. This low level of individual characters, character sets, and character set encodings is in some ways more complex than either markup or formatting. In general, authors do not need to know the details of character sets, as most complications are handled automatically by editing or rendering software, once the software knows the way in which a digital document was created.

However, Web designers need to understand some issues related to character sets, as information about character sets and encodings must be provided when digital documents are distributed over the Web. If this is not done, a recipient of a document will not know how the document was created, and will not know how to convert the data back into characters.

This Appendix provides a very brief introduction to character set issues, designed to help Web designers understand the main issues. Additional information about character sets is available from the references at the end of this Appendix.

Alternate Formats
This Document (Word 6 Format)

A.1 Character Sets and Character Set Encodings

Text, at its most basic, is a sequence of characters plus semantic rules for how those characters flow (e.g., drawn from left to right, right to left, top to bottom, etc.) and connect (i.e., how punctuation or accents join with adjacent letters). For example, the document you are reading right now consists of so-called Latin letters and numbers (a–z, A–Z, and 0–9), plus some punctuation and other common symbols, which flow naturally from left to right and which, together, spell out meaningful words and sentences (at least, I hope that is the case!).

When those sequences of characters are stored on a computer, they must be represented in a digital format. This process requires two things:

A clear specification of the actual characters being considered (the character set), plus a definition of the position or index of each character in the character set (for example, the first character is the letter a, the second b, and so on).
A definition of how the character positions (i.e., their positions in the character set) are digitally encoded when the characters are represented (and probably stored) in a digital form.

The first of these is more formally called a coded character set: a set of characters under consideration, each character uniquely identified by its coded position in the set.

NOTE

The term coded character set is used instead of just character set, as the second term is poorly—and often conflictingly—defined in a variety of standards. Modern character set standards, such as ISO 10646 and Unicode (and also the XML specification), chose to choose a different formalism for describing character sets, and use the term coded character set to refer to this more precise definition. Please see www.w3.org/MarkUp/html-spec/charset-harmful.html for a more detailed discussion of this issue.

More formally, one can think of a coded character set as being a function whose domain is a subset of nonnegative integers, and whose range is a set of characters. For example, a set might define the Latin capital letter I to lie at position 73. The second issue is often called the character set encoding and refers to the manner in which the coded position of a character is stored in a binary format.

Whenever a digital representation of text is created, both of these issues are involved. In practice, the procedure works as follows: software takes a defined character (defined, for example, by the user typing it!), finds the index corresponding to the character (e.g., the character I lies at position 73 in the character set), and then digitally encodes this position using the defined character encoding. The encoded representation can then be stored in memory (as would be done when a document is being processed), or it can be stored on disk for future use.

Once this process is used to convert text into a digital format and store it in a file, software can easily reverse the process, for example, to read in a file and then display the text, by

Undoing the encoding process, turning the digital data into a sequence of code positions (essentially integers) that reference a sequence of characters.
Determining, from knowledge of the character set being used, to which character each index corresponds.

Of course, this is only possible if the software knows the character set and encoding used to create the data. This means that when documents are distributed over the Web, the identity of the character set and encoding used to create the document must be sent as well.

The definition of a coded character set is a bit more complicated than this, because the definitions must also define the nature of the characters. For example, character set specifications define the directionality of characters (whether they are natively drawn from left to right or right to left), when characters are combining characters (such as an accent that should be combined with the previous character), and so on. Indeed, much of a formal character set specification is spent defining characteristics such as these.

A.1.1 Character Set Specifications

Each digital character set specification generally defines two things:

The set of characters and their positions (the coded character set)
One or more encodings by which the character indices can be stored in a binary format

For example, the ISO 8859-1 specification defines a character set (often called Latin-1, consisting of 191 characters common to Western European languages) and a single encoding for that character set (each character is stored in a single byte, encoded according to the position of the character in the set). Thus, ISO 8859-1 says that the character capital letter Q is the 81st character in the character set, and that this is digitally encoded as the binary string: 01010001 (the binary representation of 81). Because the encoding places all characters inside a single byte, at most 256 possible characters can be defined in ISO 8859-1. However, ISO 8859-1 actually defines only 191 characters and their positions (the other positions contain nonprintable control characters, defined in other standards we will not discuss here).

There are many other character sets that encode characters in a single byte. For example, ISO 8859-4 defines a character set consisting of Cyrillic (Estonian, Latvian, and Lithuanian) characters, as opposed to the Latin characters of ISO 8859-1. Thus, if writing an e-mail message in German, an author might choose to use 8859-1, whereas when writing a letter in Estonian, they might use 8859-4. Furthermore, there are many wide character sets that define many thousands of characters (e.g., for the Chinese, Korean, or Japanese writing systems) and that use more complex encodings (generally requiring multiple bytes per character, and sometimes more than one encoding is supported for a given character set) when storing the characters in digital form.

Consequently, if a document is created using a specific character set and encoding, the identity of this encoding must be sent with the document when the data is distributed via the Web. For example, if a document is created using ISO 8859-4 character set, the document and an appropriate identifier—in fact, the string ISO 8859-4—must be distributed together. The recipient can use the ISO-8859-4 identifier to determine the character set as an encoding, and can decode the data to properly display the text. These identifiers are often called charset values, as they are often specified, in MIME content-type headers, using an expression such as:

content-type: text/html; charset=iso-8859-4

which indicates that a message contains an HTML-format data, and that the text in the message was composed using the ISO 8859-4 character set and encoding.

The languages of the world support many tens of thousands of different characters. Unfortunately, many traditional character sets, such as ISO 8859-1, define and encode only a small number of those characters. Thus, although French, Estonian, and Chinese documents can be written using the ISO 8859-1, ISO 8859-4, and Big5 character set standards, respectively, the text of those documents cannot be mixed—none of these standards defines the characters used by all three writing systems. This is a big problem for universal document interchange, because text cannot be easily mixed together, nor distributed in a universal format.

A.2 The Universal Character Set

This problem was recognized many years ago, and over the past decade much work took place designing universal sets of characters. The results were specifications for two universal sets, formally known as ISO/IEC 10646:1993 [a specification developed by the International Organization for Standardization (ISO)] and Unicode 2.1 (a specification developed by the Unicode Consortium). Fortunately, the two organizations realized that it was neither sensible nor practical to have two different universal character sets. Consequently, the two schemes were merged such that the most recent versions of Unicode and ISO 10646 define the same sets of characters, at the same locations in a common character set. They are thus identical, for all practical purposes. Indeed, we now refer to a single character set, called the Universal Character Set (UCS), to indicate this single universal standard.

The UCS standard defines a character set that can contain over 1 million possible characters. This includes the characters from the Latin, Cyrillic, Arabic, Hebrew, and other alphabets; Japanese, Chinese, and Korean characters; plus many other characters, punctuation marks, and other symbols. However, many positions in the character set are not yet assigned characters, leaving room for characters and symbols that have not yet been added (such as the symbols used to encode Inuit languages in Northern Canada), and for possible future uses.

A.2.1 Document Encoding and the Document Character Set

Formally, UCS is the document character set of all XML (and thus XHTML) and HTML documents. This means that such documents can contain only characters defined in UCS. It also means that numeric character references in an XML document always reference characters by their positions in UCS. Thus, the character reference é refers to the 233rd character in the UCS character set, which is the character é (e with an acute accent).

Historically, most Web documents have been written and encoded using the ISO 8859-1 (Latin-1) character set. This is not a problem—documents can be encoded using any character set, provided they only contain characters defined somewhere in UCS, and provided any character references refer to the position of the character in the UCS character set. Fortunately, the characters defined in ISO 8859-1 are defined at exactly the same positions in UCS [e.g., the character at position 233 in ISO 8859-1 is also é (e with an acute accent)], so that all character references in ISO Latin-1 documents are still valid. Unfortunately, this is not the case for many documents written using other character sets. In these cases, the character references often refer to the position of characters in the character set used to create the document, which is usually not the position of the character in UCS. To be valid HTML or XHTML, such documents must have their character references updated to reference the correct UCS code positions.

A.2.2 Allowed Tokens in XML Names

As noted in Chapter 2, XML does not define specific tag or attribute names, but provides a framework by which largely arbitrary markup languages can be constructed, with the names (of elements and attributes) being chosen to match the types of data that the language will represent. However, the mechanisms for creating such names are not entirely arbitrary, because software must be able to easily recognize the tag boundaries, and because it must also be easy to process the text making up the element and attribute names.

For this reason, the XML specification carefully classifies the different UCS characters and defines those characters that can be used in element and attribute names. We do not go into the details here; they are found, of course, in the official XML specification document, listed at the end of this appendix. They are also described, in general detail, in Appendix B of the XML Specification Guide.

Note that this step is not required for HTML or XHTML, because these languages predefine the names of all the allowed elements and attributes, so that this flexibility is not available.

A.2.3 Binary Encodings of UCS

The UCS character set supports several different encodings. The main encoding, known as UTF-16 (UTF stands for Universal Character Set Transformation Format), stores each character in two bytes, although there is a mechanism for encoding some characters using consecutive two-byte sequences. This is the easiest encoding for software to handle and is often used when UCS text is stored in memory (e.g., by tools such as editors or browsers).

UCS also supports two encodings that use single bytes as the basic encoding unit. The first of these, known as UTF-8, represents each UCS character as a stream of one or more bytes—this encoding uses all the bits in the byte for encoding purposes. A second encoding, known as UTF-7, represents each character as one or more bytes, but uses only the seven least significant bits for encoding purposes.

The Unicode specification calls the seven- and eight-bit encodings transformation formats, because they correspond to a format suitable for storage or transmission. In this context, UTF-8 has the advantage of compactness, as the file size is small compared with a corresponding UTF-7- or UTF-16-encoded one. On the other hand, UTF-7 is best when a file is to be transported via older communications technologies, such as old e-mail systems, which may not properly transport information encoded in the most significant bit.

The existence of different encodings can create problems when files are stored on disk or sent over the Internet, because character set information must now tag along with the data and be available to subsequent software. If this information is not available, then the next program to see the text will not know how to decode the data and convert it back into the correct characters. Mechanisms for indicating the encoding when data are passed from machine to machine are discussed next.

A.2.4 ASCII, ISO 8859-1, and UCS

ASCII, ISO 8859-1, and UCS are the three most common character sets. ASCII is perhaps the oldest, and is a seven-bit character set that defines 92 characters in a code space covering the range 0 to 127, and an encoding that encodes every ASCII character in a single byte, without using the most significant (eighth) bit. The various ASCII characters and their coded positions are shown in Table A.1. The ASCII characters are defined in the range from 0 to 127, inclusive. The grayed-out entries correspond to control characters that are not formally defined by the ISO 646 standard—these characters are defined by a separate standard, known as ISO 6429.

ISO 8859-1 defines 191 characters over a range from 0 to 255, and an encoding such that every character is encoded in a single byte. The characters and code positions for those characters defined in ISO 8859-1 are also shown in Table A.1. This is because ISO 8859-1 was defined to extend the ASCII character set, such that ISO 8859-1 defines all the ASCII characters at the same positions (0 through 127) as in the ASCII character set, and extra characters were added in the range from 128 through 255.

UCS was designed similarly: the first 255 positions of UCS code for exactly the same characters as ISO 8859-1.

Table A.1
ISO 8879-1 (Latin-1) characters and common control characters (ISO 6429), showing the positions in both decimal and hexadecimal notation. Note that these are exactly equivalent to the 256 characters defined in UCS (Unicode/ISO 10646) at positions 0 through 255. Similarly, the characters defined at positions 0 through 127 are exactly the same as those defined by ASCII. Control characters’ short names are shown in italics, and printable control characters that are allowed in XML, and HTML documents are shown in boldface. The control characters that are forbidden in XML or HTML documents are showed against a gray background. Note that the printable space (32-decimal) and nonbreaking space (160-decimal) characters are denoted by the strings SP and NBSP, as they would otherwise be invisible.

POSITION

CHARACTER

DEC

HEX

CHARACTER

DEC

HEX

CHARACTER

DEC

HEX

CHARACTER

DEC

HEX

NUL

SOH

STX

ETX

EOT

ENQ

ACK

BEL

TAB

DLE

DC1

DC2

ESC

DC4

NAK

SYN

ETB

CAN

SUB

ESC

(

)

;

[

]

‘

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

{

123

124

}

125

126

DEL

127

—

128

—

129

—

130

—

131

—

132

—

133

—

134

—

135

—

136

—

137

—

138

—

139

—

140

—

141

—

142

—

143

—

144

—

145

—

146

—

147

—

148

—

149

—

150

—

151

—

152

—

153

—

154

—

155

—

156

—

157

—

158

—

159

NBSP

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

A.3 Decoding Encoded Documents

As mentioned previously, although the characters in an XML document must be defined in the UCS character set, the document itself need not be encoded using UCS (although it is obviously much easier to process if it is). Indeed, a document can be encoded using any well-understood encoding scheme (ISO 8859-1, EUC-KR, Shift-JIS, etc.), provided the document contains only characters defined in UCS, and provided character references refer to characters by their positions in UCS.

An application reading such data must then know the encoding used to create the data, and it must be able to decode the data to create valid Unicode characters. The two steps required are

Determine the character set and encoding used for the specified data.
Decode the data stream and write it to memory, mapping each encoded item in the input data into the appropriate UCS character and storing this character in memory.

The key is to determine the character set and encoding: Once this is known, the rest is (relatively!) simple.

A.3.1 Indicating Character Encodings

When text data is sent to a destination, the delivery process must indicate the character set and encoding used to create the data. Fortunately, there is an easy way of identifying this information, as there is an accepted naming scheme for identifying character set/encoding pairs. These scheme identifiers—simple ASCII text strings like ISO 8859-1, UTF-8, Big5—are often called charsets. Table A.2 lists some common charset names, and the languages/writing systems with which they are associated.

There are two ways in which charset information can be included when text data is sent to someone:

As part of the message that contains the data (for example, as part of the header that precedes the data being sent)
Embedded directly within the data in as easy-to-recognize string

The former is preferred, as it is the most direct and is pretty well guaranteed to work. Indeed, the e-mail MIME mechanism and the Web HTTP communications protocol include mechanisms for specifying the charset of any block of data included in the message. These mechanisms use the MIME content-type headers to specify the type of the data and the charset used to create it. This approach is discussed in the next section.

The latter mechanism is a useful fallback, particularly because files are not always sent by mechanisms that provide charset information; for example, when accessing a file directly from the file system, or retrieving a file via FTP. Note, however, that without foreknowledge of the charset, the software may not be able to read the data to find the string that identifies the charset (this is very much a chicken-and-egg problem).

Specifying Character Encoding in the Message

Both the e-mail message syntax (MIME) and the HTTP protocol use MIME content-type headers for indicating the type of data (e.g., HTML, plain text, XML) being sent. This header is sent ahead of the actual data, and is encoded using characters (ASCII) and a character set (ISO 8859-1) understood by all Internet-aware software. This header supports a charset parameter to indicate the character encoding used within the following (or attached) text component. The form is

Content-type: text/subtype; charset=char-encoding

where subtype is the subtype of the text document (html, plain, xml, etc.), and char-encoding gives the character set and encoding used to create the data. Such headers are included with each part of a MIME-encoded mail message, and every HTTP request or response header can include a content-type header to indicate the type of the data being sent.

Unfortunately, some HTTP servers do not send charset information, while some older browsers and applications cannot handle content-type headers containing charset specifications and they misidentify the MIME type if a charset parameter is present. Thus, in some situations, it is necessary to omit the charset from the content-type header and to hope that the application receiving the data can infer the charset from the content of the document.

Specifying Character Encoding in the Document

For the reasons just mentioned, it is important that documents include markup indicating the encoding used to create them. With XML (and hence XHTML), this information must be placed in the XML declaration. HTML supports a special meta element that does the same thing. The forms in these two cases are

<?xml version="1.0" encoding="char-encoding" ?>

<meta http-equiv="Content-Type"
      content="text/subtype; charset=char-encoding" />

where subtype>gives the subtype of the text document (html, xml, etc.), and where char-encoding is a well-known name for the character set and encoding used to create the document (see Table A.2). An XHTML document should include both these specifications. If the document is served as XML, then the value specified in the XML declaration is used. If the document is served as HTML, then the meta-element value is used. An example, assuming an HTML document encoded using UTF-8 is

<meta http-equiv="Content-Type"
      content="text/html; charset=utf-8" />

This approach is practical because most character sets place the standard ASCII characters in positions 0 to 127, and most encodings encode these positions in similar ways (as a single byte corresponding to their position in the character set). Consequently, a browser can assume just about any character encoding, guess the size of the smallest encoding unit (one or two bytes—this can usually be guessed by looking for patterns in the first few bytes), and then read the initial (ASCII-character) text until a markup string is encountered that gives the actual encoding.

The charset specified inside a document is ignored if the document is received as part of a message, and if the message itself uses a content-type header to indicate the charset.

A.4 Character Encoding of URL Strings

The issues associated with URL encoding are somewhat different. Here, instead of encoding characters in a binary format, a URL represents an encoding, as printed characters (actually, a limited set of ASCII characters) of some underlying text. This encoding-as-characters step is required because of the intended use of URLs—they are designed to be easily written down on paper, or to be digitally encoded and sent via old-style e-mail systems that cannot handle complex character encodings.

This issue is discussed in somewhat more detail in Chapter 8, Section 8.1.4.

A.5 Names for Common Character Sets and Encodings

There are dozens of character sets and encodings in common use. The ISO, for example, specifies several eight-bit character sets, in addition to ISO 10646 (UCS), and their appropriate encodings. The ISO, however, is not the only organization that defines character sets and encodings—national standards bodies, independent of the ISO, defined many sets. Table A.2 lists some of the more common ones. Many of the text labels used here (left-hand column) are not standardized names (note the leading x-). Where available, Table A.2 lists the Internet RFCs that document the encoding and associated coded character set. RFCs are available at:

http://www.rfc-editor.org/rfc.html

www.ietf.org/

www.rfc-editor.org/rfc/

It is important to note that most of these character sets/encodings are not widely supported—you may be able to encode documents using these character sets, but most users will not be able to view them. For portable documents, you should produce and send text encoded using UTF-8 or UTF-16, although ISO 8859-1 or US-ASCII—with character references for non-Latin characters—are useful options compatible with most current Web software.

An official list of well-defined charset names is maintained by Internet Assigned Numbers Authority (IANA). These are available at www.isi.edu/in-notes/iana/assignments/character-sets. Note that having a name in this list does not guarantee that software understands the name or knows how to process data so encoded!

Table A.2
Some common (and not always official) names for character set encodings, with descriptions. Note that the charset names are case-insensitive. For portable documents, you should avoid distributing text using most of these encodings. The names of encodings that are widely supported (or that must be supported by XML and HTML application) are in boldface, on a gray background. In the descriptions, the parenthetical term (Windows) means that the charset is Microsoft Windows-specific, and the term (Macintosh) means that the charset is Macintosh-specific.

CHARSET LABEL

DESCRIPTION

US-ASCII

US ASCII

ISO 8859-1

ISO Latin-1

UTF-8

UCS [ISO 10646/Unicode], one-byte (8-bit) encoding—universal transformation format

UTF-16

UCS [ISO 10646/Unicode], two-byte encoding, including surrogate extension mechanism

Unicode 1-1

Unicode, Version 1.1, two-byte encoding (defined before Unicode and ISO 10646 were "merged")

Unicode-2.0

Unicode, Version 2.0, two-byte encoding (equivalent to UTF-16)

UTF-7

Unicode, one-byte (8-bit) encoding—universal transformation format

ISO 10646-UCS-4, or UCS-4

ISO 10646, four-byte encoding

ISO 10646-UCS-4, or UCS-2

ISO 10646/Unicode, two-byte encoding (same as UTF-16, but only encodes the first 65,536 characters)

x-mac-roman, mac-roman,

or MacRoman

Like ISO 8859-1; with extra characters in positions

128 through 255 (Macintosh only)

windows-1250, win-1250,

or CP-1250

Central European (Windows)

x-mac-ce, mac-CE, or MacCE

Central/East European (Macintosh)

ISO 8859-2

Central/East European (Slavic: Czech, Croat, German, Hungarian, Polish, Romanian, Slovak, and Slovenian)

windows-1251, win-1251,

or CP-1251

Russian and Central/Eastern European (Windows)

ISO 8859-3

Southern European (Esperanto, Galician, Maltese, and Turkish)

KOI8-R

Cyrillic (RFC 1489)

ISO 8859-4

Cyrillic (Estonian, Latvian, Lithuanian)

ISO 8859-5

Cyrillic (Bulgarian, Byelorussian, Macedonian, Serbian, and Ukrainian)

windows-1257, win-1257,

or CP-1257

Cyrillic (Estonian, Latvian, Lithuanian) (Windows)

x-mac-cyrillic, mac-cyrillic,

or MacCyrillic

Cyrillic (Macintosh)

ISO IR-111, or ECMA-Cyrillic

Cyrillic

CP-866

Cyrillic

KOI8-U

Ukrainian

x-mac-ukrainian, mac-ukrainian,

or MacUkrainian

Ukrainian (Macintosh)

ISO 8859-6

Arabic

windows-1256, win-1256,

or CP-1256

Arabic (Windows)

ISO 8859-7

Greek

x-mac-greek, mac-greek,

or MacGreek

Greek (Macintosh)

windows-1253, win-1253,

or CP-1253

Greek (Windows)

ISO 8859-8

Hebrew

windows-1255, win-1255,

or CP-1255

Hebrew (Windows)

ISO 8859-14

Celtic

ISO 8859-15

Western

ARMISCII-8

Armenian

TIS-620 or Windows-874

Thai

ISO 8859-9

Turkish

x-mac-turkish, mac-turkish,

or MacTurkish

Turkish (Macintosh)

windows-1254, win-1254,

or CP-1254

Turkish (Windows)

VISCII

Vietnamese

Windows-1258, win-1258,

or CP-1258

Vietnamese (Windows)

VIET-VPS

Vietnamese

VIET-TCVN5712

Vietnamese

x-mac-croatian, mac-croatian,

or MacCroatian

Croatian (Macintosh)

x-mac-icelandic, mac-icelandic,

or MacIcelandic

Icelandic (Macintosh)

x-mac-romanian, mac-romanian,

or MacRomanian

Romanian (Macintosh)

ISO 8859-10

Greenlandic/Icelandic/Lapp

ISO 2022-jp

Japanese (RFC 1468; not that this encoding can use more than one coded character set; the encoding itself indicates the set being used)

Shift_JIS or x-sjis

Japanese Shift-JIS (Microsoft)

euc-jp or x-euc-jp

Japanese; Extended UNIX Code

ISO 2022-kr

Korean (RFC 1557)

euc-kr or x-euc-kr

Korean; Extended UNIX Code (RFC 1557)

gb_2312-80 or gb-2312

Chinese, Simplified—People’s Republic (RFC 1345)

x-euc-tw or euc-tw

Chinese-Taiwan; Extended UNIX Code

Big5

Chinese, Traditional—Taiwan—multibyte set

A.6 References

www.w3.org/TR/REC-xml: This is the official specification for the XML language. Appendix B defines various classes of UCS, and Sections 2.2 and 2.3 define how various language components (attribute and element names, for example) can be written using characters from these different classes.
XML Specification Guide, by Ian Graham and Liam Quin, John Wiley & Sons (1999): Appendix B of this book provides a more detailed overview of character sets and encodings.
CJKV Information Processing, by Ken Lunde, O'Reilly & Associates, Inc. (1999): A definitive guide to character encoding issues, with particular emphasis on the problems of Chinese, Japanese, Korean, and Vietnamese text.
The Unicode Standard, Worldwide Character Encoding, Version 2.0(1996): Complete description of the Unicode character set, with a CD-ROM illustrating all the defined characters. The book is not exactly fireside reading, but it is very useful for software developers. The standard can be purchased directly from the Unicode Consortium (www.unicode.org). Ordering information and access to online updates to the standard (Technical Report No. 8 brings the standard up to Version 2.1) are provided at the Unicode standards page, at www.unicode.org/unicode/uni2book/u2.html.
ISO/IEC 10646-1:1993, Information Technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane.: There are several amendments to the original specification, which you will need to bring the standard up to date. Information about these amendments can be found (if you're lucky -- the ISO keeps moving documents around) at www.iso.ch/cate/dl8741.html. Information about the ISO and instructions for purchasing ISO standards documents—including the amendment text—can be accessed via the Web at www.iso.ch.

INTERNET RESOURCES
www.isi.edu/in-notes/iana/assignments/character-sets	IANA-assigned charset names
www.ietf.org/rfc/rfc1489.txt	KOI8-R charset
www.ietf.org/rfc/rfc1468.txt	ISO 2022-jp charset
www.ietf.org/rfc/rfc1557.txt	ISO 2022-kr, euc-kr, charset
www.ietf.org/rfc/rfc1345.txt	gb_2312-80 or gb-2312 charset
www.ietf.org/rfc/rfc2279.txt	UTF-8 charset
www.ietf.org/rfc/rfc1738.txt	URL specification
www.ietf.org/rfc/rfc2396.txt	URL syntax update