A.1 Character Sets and Character Set Encodings
Text, at its most basic, is a sequence of characters plus semantic rules for how those
characters flow (e.g., drawn from left to right, right to left, top to bottom, etc.) and
connect (i.e., how punctuation or accents join with adjacent letters). For example, the
document you are reading right now consists of so-called Latin letters and numbers (az,
AZ, and 09), plus some punctuation and other common symbols, which flow
naturally from left to right and which, together, spell out meaningful words and sentences
(at least, I hope that is the case!).
When those sequences of characters are stored on a computer, they must be represented
in a digital format. This process requires two things:
- A clear specification of the actual characters being considered (the character set),
plus a definition of the position or index of each character in the character set
(for example, the first character is the letter a, the second b, and so on).
- A definition of how the character positions (i.e., their positions in the character set)
are digitally encoded when the characters are represented (and probably stored) in a
The first of these is more formally called a coded character set: a set of
characters under consideration, each character uniquely identified by its coded position
in the set.
The term coded character set is used instead of just character set, as
the second term is poorlyand often conflictinglydefined in a variety of
standards. Modern character set standards, such as ISO 10646 and Unicode (and also the XML
specification), chose to choose a different formalism for describing character sets, and
use the term coded character set to refer to this more precise definition. Please
for a more detailed discussion of this issue.
More formally, one can think of a coded character set as being a function whose domain
is a subset of nonnegative integers, and whose range is a set of characters. For example,
a set might define the Latin capital letter I to lie at position 73. The second
issue is often called the character set encoding and refers to the manner in which
the coded position of a character is stored in a binary format.
Whenever a digital representation of text is created, both of these
issues are involved. In practice, the procedure works as follows: software takes a defined
character (defined, for example, by the user typing it!), finds the index corresponding to
the character (e.g., the character I lies at position 73 in the character set), and
then digitally encodes this position using the defined character encoding. The encoded
representation can then be stored in memory (as would be done when a document is being
processed), or it can be stored on disk for future use.
Once this process is used to convert text into a digital format and store it in a file,
software can easily reverse the process, for example, to read in a file and then display
the text, by
- Undoing the encoding process, turning the digital data into a sequence of code positions
(essentially integers) that reference a sequence of characters.
- Determining, from knowledge of the character set being used, to which character each
Of course, this is only possible if the software knows the character set and encoding
used to create the data. This means that when documents are distributed over the Web, the
identity of the character set and encoding used to create the document must be sent as
The definition of a coded character set is a bit more complicated than this, because
the definitions must also define the nature of the characters. For example,
character set specifications define the directionality of characters (whether they are
natively drawn from left to right or right to left), when characters are combining characters
(such as an accent that should be combined with the previous character), and so on.
Indeed, much of a formal character set specification is spent defining characteristics
such as these.
Back to top
Each digital character set specification generally defines two things:
- The set of characters and their positions (the coded character set)
- One or more encodings by which the character indices can be stored in a binary format
For example, the ISO 8859-1 specification defines a character set (often called Latin-1,
consisting of 191 characters common to Western European languages) and a single encoding
for that character set (each character is stored in a single byte, encoded according to
the position of the character in the set). Thus, ISO 8859-1 says that the character
capital letter Q is the 81st character in the character set, and that this is
digitally encoded as the binary string: 01010001 (the binary representation of 81).
Because the encoding places all characters inside a single byte, at most 256 possible
characters can be defined in ISO 8859-1. However, ISO 8859-1 actually defines only 191
characters and their positions (the other positions contain nonprintable control
characters, defined in other standards we will not discuss here).
There are many other character sets that encode characters in a single byte. For
example, ISO 8859-4 defines a character set consisting of Cyrillic (Estonian, Latvian, and
Lithuanian) characters, as opposed to the Latin characters of ISO 8859-1. Thus, if writing
an e-mail message in German, an author might choose to use 8859-1, whereas when writing a
letter in Estonian, they might use 8859-4. Furthermore, there are many wide character sets
that define many thousands of characters (e.g., for the Chinese, Korean, or Japanese
writing systems) and that use more complex encodings (generally requiring multiple bytes
per character, and sometimes more than one encoding is supported for a given character
set) when storing the characters in digital form.
Consequently, if a document is created using a specific character set and encoding, the
identity of this encoding must be sent with the document when the data is distributed via
the Web. For example, if a document is created using ISO 8859-4 character set, the
document and an appropriate identifierin fact, the string ISO 8859-4must be
distributed together. The recipient can use the ISO-8859-4 identifier to determine the
character set as an encoding, and can decode the data to properly display the text. These
identifiers are often called charset values, as they are often specified, in MIME
content-type headers, using an expression such as:
content-type: text/html; charset=iso-8859-4
which indicates that a message contains an HTML-format data, and that the text in the
message was composed using the ISO 8859-4 character set and encoding.
The languages of the world support many tens of thousands of different characters.
Unfortunately, many traditional character sets, such as ISO 8859-1, define and encode only
a small number of those characters. Thus, although French, Estonian, and Chinese documents
can be written using the ISO 8859-1, ISO 8859-4, and Big5 character set standards,
respectively, the text of those documents cannot be mixednone of these standards
defines the characters used by all three writing systems. This is a big problem for
universal document interchange, because text cannot be easily mixed together, nor
distributed in a universal format.
Back to top
This problem was recognized many years ago, and over the past decade much work took
place designing universal sets of characters. The results were specifications for two
universal sets, formally known as ISO/IEC 10646:1993 [a specification developed by the
International Organization for Standardization (ISO)] and Unicode 2.1 (a specification
developed by the Unicode Consortium). Fortunately, the two organizations realized that it
was neither sensible nor practical to have two different universal character sets.
Consequently, the two schemes were merged such that the most recent versions of Unicode
and ISO 10646 define the same sets of characters, at the same locations in a common
character set. They are thus identical, for all practical purposes. Indeed, we now refer
to a single character set, called the Universal Character Set (UCS), to indicate
this single universal standard.
The UCS standard defines a character set that can contain over 1 million possible
characters. This includes the characters from the Latin, Cyrillic, Arabic, Hebrew, and
other alphabets; Japanese, Chinese, and Korean characters; plus many other characters,
punctuation marks, and other symbols. However, many positions in the character set are not
yet assigned characters, leaving room for characters and symbols that have not yet been
added (such as the symbols used to encode Inuit languages in Northern Canada), and for
possible future uses.
Back to top
Formally, UCS is the document character set of all XML (and thus XHTML) and HTML
documents. This means that such documents can contain only characters defined in UCS. It
also means that numeric character references in an XML document always reference
characters by their positions in UCS. Thus, the character reference é
refers to the 233rd character in the UCS character set, which is the character é (e with
an acute accent).
Historically, most Web documents have been written and encoded using the ISO 8859-1
(Latin-1) character set. This is not a problemdocuments can be encoded using any
character set, provided they only contain characters defined somewhere in UCS, and
provided any character references refer to the position of the character in the UCS
character set. Fortunately, the characters defined in ISO 8859-1 are defined at exactly
the same positions in UCS [e.g., the character at position 233 in ISO 8859-1 is also é (e
with an acute accent)], so that all character references in ISO Latin-1 documents are
still valid. Unfortunately, this is not the case for many documents written using other
character sets. In these cases, the character references often refer to the position of
characters in the character set used to create the document, which is usually not the
position of the character in UCS. To be valid HTML or XHTML, such documents must have
their character references updated to reference the correct UCS code positions.
Back to top
As noted in Chapter 2, XML does not define specific tag or attribute names, but
provides a framework by which largely arbitrary markup languages can be constructed, with
the names (of elements and attributes) being chosen to match the types of data that the
language will represent. However, the mechanisms for creating such names are not entirely
arbitrary, because software must be able to easily recognize the tag boundaries, and
because it must also be easy to process the text making up the element and attribute
For this reason, the XML specification carefully classifies the different UCS
characters and defines those characters that can be used in element and attribute names.
We do not go into the details here; they are found, of course, in the official XML
specification document, listed at the end of this appendix. They are also described, in
general detail, in Appendix B of the XML Specification Guide.
Note that this step is not required for HTML or XHTML, because these languages
predefine the names of all the allowed elements and attributes, so that this flexibility
is not available.
Back to top
The UCS character set supports several different encodings. The main encoding, known as
UTF-16 (UTF stands for Universal Character Set Transformation Format),
stores each character in two bytes, although there is a mechanism for encoding some
characters using consecutive two-byte sequences. This is the easiest encoding for software
to handle and is often used when UCS text is stored in memory (e.g., by tools such as
editors or browsers).
UCS also supports two encodings that use single bytes as the basic encoding unit. The
first of these, known as UTF-8, represents each UCS character as a stream of one or
more bytesthis encoding uses all the bits in the byte for encoding purposes. A
second encoding, known as UTF-7, represents each character as one or more bytes,
but uses only the seven least significant bits for encoding purposes.
The Unicode specification calls the seven- and eight-bit encodings transformation
formats, because they correspond to a format suitable for storage or transmission. In
this context, UTF-8 has the advantage of compactness, as the file size is small compared
with a corresponding UTF-7- or UTF-16-encoded one. On the other hand, UTF-7 is best when a
file is to be transported via older communications technologies, such as old e-mail
systems, which may not properly transport information encoded in the most significant bit.
The existence of different encodings can create problems when files are stored on disk
or sent over the Internet, because character set information must now tag along with the
data and be available to subsequent software. If this information is not available, then
the next program to see the text will not know how to decode the data and convert it back
into the correct characters. Mechanisms for indicating the encoding when data are passed
from machine to machine are discussed next.
ASCII, ISO 8859-1, and UCS are the three most common character sets. ASCII is perhaps
the oldest, and is a seven-bit character set that defines 92 characters in a code space
covering the range 0 to 127, and an encoding that encodes every ASCII character in a
single byte, without using the most significant (eighth) bit. The various ASCII characters
and their coded positions are shown in Table A.1. The ASCII characters are defined in the
range from 0 to 127, inclusive. The grayed-out entries correspond to control characters
that are not formally defined by the ISO 646 standardthese characters are defined by
a separate standard, known as ISO 6429.
ISO 8859-1 defines 191 characters over a range from 0 to 255, and an encoding such that
every character is encoded in a single byte. The characters and code positions for those
characters defined in ISO 8859-1 are also shown in Table A.1. This is because ISO 8859-1
was defined to extend the ASCII character set, such that ISO 8859-1 defines all the ASCII
characters at the same positions (0 through 127) as in the ASCII character set, and extra
characters were added in the range from 128 through 255.
UCS was designed similarly: the first 255 positions of UCS code for exactly the same
characters as ISO 8859-1.
ISO 8879-1 (Latin-1) characters and common control characters (ISO 6429), showing the
positions in both decimal and hexadecimal notation. Note that these are exactly equivalent
to the 256 characters defined in UCS (Unicode/ISO 10646) at positions 0 through 255.
Similarly, the characters defined at positions 0 through 127 are exactly the same as those
defined by ASCII. Control characters short names are shown in italics, and printable
control characters that are allowed in XML, and HTML documents are shown in boldface. The
control characters that are forbidden in XML or HTML documents are showed against a gray
background. Note that the printable space (32-decimal) and nonbreaking space (160-decimal)
characters are denoted by the strings SP and NBSP, as they would otherwise be invisible.
Back to top
As mentioned previously, although the characters in an XML document must be defined in
the UCS character set, the document itself need not be encoded using UCS (although it is
obviously much easier to process if it is). Indeed, a document can be encoded using any
well-understood encoding scheme (ISO 8859-1, EUC-KR, Shift-JIS, etc.), provided the
document contains only characters defined in UCS, and provided character references refer
to characters by their positions in UCS.
An application reading such data must then know the encoding used to create the data,
and it must be able to decode the data to create valid Unicode characters. The two steps
- Determine the character set and encoding used for the specified data.
- Decode the data stream and write it to memory, mapping each encoded item in the input
data into the appropriate UCS character and storing this character in memory.
The key is to determine the character set and encoding: Once this is known, the rest is
Back to top
When text data is sent to a destination, the delivery process must indicate the
character set and encoding used to create the data. Fortunately, there is an easy way of
identifying this information, as there is an accepted naming scheme for identifying
character set/encoding pairs. These scheme identifierssimple ASCII text strings like
ISO 8859-1, UTF-8, Big5are often called charsets. Table A.2 lists some common
charset names, and the languages/writing systems with which they are associated.
There are two ways in which charset information can be included when text data is sent
- As part of the message that contains the data (for example, as part of the header that
precedes the data being sent)
- Embedded directly within the data in as easy-to-recognize string
The former is preferred, as it is the most direct and is pretty well guaranteed to
work. Indeed, the e-mail MIME mechanism and the Web HTTP communications protocol include
mechanisms for specifying the charset of any block of data included in the message. These
mechanisms use the MIME content-type headers to specify the type of the data and
the charset used to create it. This approach is discussed in the next section.
The latter mechanism is a useful fallback, particularly because files are not always
sent by mechanisms that provide charset information; for example, when accessing a file
directly from the file system, or retrieving a file via FTP. Note, however, that without
foreknowledge of the charset, the software may not be able to read the data to find the
string that identifies the charset (this is very much a chicken-and-egg problem).
Back to top
Both the e-mail message syntax (MIME) and the HTTP protocol use MIME content-type
headers for indicating the type of data (e.g., HTML, plain text, XML) being sent. This
header is sent ahead of the actual data, and is encoded using characters (ASCII) and a
character set (ISO 8859-1) understood by all Internet-aware software. This header supports
a charset parameter to indicate the character encoding used within the following
(or attached) text component. The form is
Content-type: text/subtype; charset=char-encoding
subtype is the subtype of the text document (html, plain,
xml, etc.), and
char-encoding gives the character set and
encoding used to create the data. Such headers are included with each part of a
MIME-encoded mail message, and every HTTP request or response header can include a
content-type header to indicate the type of the data being sent.
Unfortunately, some HTTP servers do not send charset information, while some older
browsers and applications cannot handle
content-type headers containing
charset specifications and they misidentify the MIME type if a charset parameter is
present. Thus, in some situations, it is necessary to omit the charset from the
content-type header and to hope that the application receiving the data can infer the
charset from the content of the document.
Back to top
For the reasons just mentioned, it is important that documents include markup
indicating the encoding used to create them. With XML (and hence XHTML), this information
must be placed in the XML declaration. HTML supports a special meta
element that does the same thing. The forms in these two cases are
<?xml version="1.0" encoding="char-encoding" ?>
content="text/subtype; charset=char-encoding" />
subtype>gives the subtype of the text document (html,
xml, etc.), and where
char-encoding is a well-known name for
the character set and encoding used to create the document (see Table A.2). An XHTML
document should include both these specifications. If the document is served as
XML, then the value specified in the XML declaration is used. If the document is served as
HTML, then the meta-element value is used. An example, assuming an HTML
document encoded using UTF-8 is
content="text/html; charset=utf-8" />
This approach is practical because most character sets place the standard ASCII
characters in positions 0 to 127, and most encodings encode these positions in similar
ways (as a single byte corresponding to their position in the character set).
Consequently, a browser can assume just about any character encoding, guess the size of
the smallest encoding unit (one or two bytesthis can usually be guessed by looking
for patterns in the first few bytes), and then read the initial (ASCII-character) text
until a markup string is encountered that gives the actual encoding.
The charset specified inside a document is ignored if the document is received as part
of a message, and if the message itself uses a content-type header to indicate the
Back to top
The issues associated with URL encoding are somewhat different. Here, instead of
encoding characters in a binary format, a URL represents an encoding, as printed characters
(actually, a limited set of ASCII characters) of some underlying text. This
encoding-as-characters step is required because of the intended use of URLsthey are
designed to be easily written down on paper, or to be digitally encoded and sent via
old-style e-mail systems that cannot handle complex character encodings.
This issue is discussed in somewhat more detail in Chapter 8, Section 8.1.4.
Back to top
There are dozens of character sets and encodings in common use. The ISO, for example,
specifies several eight-bit character sets, in addition to ISO 10646 (UCS), and their
appropriate encodings. The ISO, however, is not the only organization that defines
character sets and encodingsnational standards bodies, independent of the ISO,
defined many sets. Table A.2 lists some of the more common ones. Many of the text labels
used here (left-hand column) are not standardized names (note the leading x-). Where
available, Table A.2 lists the Internet RFCs that document the encoding and associated
coded character set. RFCs are available at:
It is important to note that most of these character sets/encodings are not widely
supportedyou may be able to encode documents using these character sets, but most
users will not be able to view them. For portable documents, you should produce and send
text encoded using UTF-8 or UTF-16, although ISO 8859-1 or US-ASCIIwith character
references for non-Latin charactersare useful options compatible with most current
An official list of well-defined charset names is maintained by Internet Assigned
Numbers Authority (IANA). These are available at www.isi.edu/in-notes/iana/assignments/character-sets.
Note that having a name in this list does not guarantee that software understands the name
or knows how to process data so encoded!
Some common (and not always official) names for character set encodings, with
descriptions. Note that the charset names are case-insensitive. For portable
documents, you should avoid distributing text using most of these encodings. The names of
encodings that are widely supported (or that must be supported by XML and HTML
application) are in boldface, on a gray background. In the descriptions, the parenthetical
term (Windows) means that the charset is Microsoft Windows-specific, and the term (Macintosh)
means that the charset is Macintosh-specific.
||UCS [ISO 10646/Unicode], one-byte (8-bit)
encodinguniversal transformation format
||UCS [ISO 10646/Unicode], two-byte encoding,
including surrogate extension mechanism
||Unicode, Version 1.1, two-byte encoding (defined before
Unicode and ISO 10646 were "merged")
||Unicode, Version 2.0, two-byte encoding (equivalent to
||Unicode, one-byte (8-bit) encodinguniversal
|ISO 10646-UCS-4, or UCS-4
||ISO 10646, four-byte encoding
|ISO 10646-UCS-4, or UCS-2
||ISO 10646/Unicode, two-byte encoding (same as UTF-16, but
only encodes the first 65,536 characters)
|Like ISO 8859-1; with extra characters in positions
through 255 (Macintosh only)
|Central European (Windows)
|x-mac-ce, mac-CE, or MacCE
||Central/East European (Macintosh)
||Central/East European (Slavic: Czech, Croat, German,
Hungarian, Polish, Romanian, Slovak, and Slovenian)
|Russian and Central/Eastern European (Windows)
||Southern European (Esperanto, Galician, Maltese, and Turkish)
||Cyrillic (RFC 1489)
||Cyrillic (Estonian, Latvian, Lithuanian)
||Cyrillic (Bulgarian, Byelorussian, Macedonian, Serbian, and
|Cyrillic (Estonian, Latvian, Lithuanian) (Windows)
|ISO IR-111, or ECMA-Cyrillic
|TIS-620 or Windows-874
||Japanese (RFC 1468; not that this encoding can use more than
one coded character set; the encoding itself indicates the set being used)
|Shift_JIS or x-sjis
||Japanese Shift-JIS (Microsoft)
|euc-jp or x-euc-jp
||Japanese; Extended UNIX Code
||Korean (RFC 1557)
|euc-kr or x-euc-kr
||Korean; Extended UNIX Code (RFC 1557)
|gb_2312-80 or gb-2312
||Chinese, SimplifiedPeoples Republic (RFC 1345)
|x-euc-tw or euc-tw
||Chinese-Taiwan; Extended UNIX Code
||Chinese, TraditionalTaiwanmultibyte set
Back to top
- This is the official specification for the XML language. Appendix B defines various
classes of UCS, and Sections 2.2 and 2.3 define how various language components (attribute
and element names, for example) can be written using characters from these different
- XML Specification Guide, by Ian Graham and Liam Quin, John Wiley & Sons (1999)
- Appendix B of this book provides a more detailed overview of character sets and
- CJKV Information Processing, by Ken Lunde, O'Reilly & Associates, Inc.
- A definitive guide to character encoding issues, with particular emphasis on the
problems of Chinese, Japanese, Korean, and Vietnamese text.
- The Unicode Standard, Worldwide Character Encoding, Version 2.0(1996)
- Complete description of the Unicode character set, with a CD-ROM illustrating all the
defined characters. The book is not exactly fireside reading, but it is very useful for
software developers. The standard can be purchased directly from the Unicode Consortium
(www.unicode.org). Ordering information and access to online updates to the standard
(Technical Report No. 8 brings the standard up to Version 2.1) are provided at the Unicode
standards page, at www.unicode.org/unicode/uni2book/u2.html.
- ISO/IEC 10646-1:1993, Information TechnologyUniversal Multiple-Octet Coded
Character Set (UCS)Part 1: Architecture and Basic Multilingual Plane.
- There are several amendments to the original specification, which you will need to bring
the standard up to date. Information about these amendments can be
found (if you're lucky -- the ISO keeps moving documents around)
at www.iso.ch/cate/dl8741.html. Information
about the ISO and instructions for purchasing ISO standards documentsincluding the
amendment textcan be accessed via the Web at www.iso.ch.
Back to top