[ Up (Contents) ] Page Last Updated: 5 September 2001

Appendix E

Tags for Identifying Languages--RFC 1766

Alternate Formats
This Appendix (Word 6 Format)

Related Content
iso31366.html (Country Codes)  |  iso639.html (Language Codes)  |  References

In Internet applications, tags for identifying languages are specified by Internet RFC 1766. These are the tags used in HTML LANG attributes to specify the language used within the associated HTML element. This specification is based on the ISO standards for language (ISO 639) and country codes (ISO 3166), with extensions for situations not covered by these standards.

A language tag takes the general form

lang-subtag-subtag2...

where lang is a string of case-insensitive ASCII letters (a-z) specifying the language, and subtag is an optional, case-insensitive extension defining a subgroup of that language (there can be multiple subtag values, separated by successive dashes). Each string can have at most eight characters, with the prefix x- indicating a value defined for private use. Although uppercase letters are allowed, their use is discouraged within the lang portion of the tag. The following is a more detailed description of the meaning and allowed values for these fields.

Lang. This refers to the base language. If this string only contains two characters, then it indicates one of the language codes specified in ISO 639. For example, fr refers to the French language, and ja to Japanese. The only other allowed values are private codes, beginning with the prefix x-.

Subtag. This refers to a variant of a language, usually via two-letter national variant codes specified in ISO 3166 (e.g., fr-CA for Canadian French). Also allowed are values that refer to dialects (e.g., en-cockney), or that account for physical script variations appropriate to a language. National variant subtags are traditionally written in uppercase, although this is not required--the value is case-insensitive. Language codes can be used without country codes. This implies generic settings appropriate to the language.

The following table illustrates some language tags:

Language Tag Description
en-US American English
en-cockney Cockney dialect of English
x-romulan Romulan language
ar-EG Egyptian Arabic
fr French (generic)

Most modern browsers support language tags within the browser's configuration menus, and use this information to compose Accept-Language HTTP headers that the browser sends to the server when requesting a resource. A few browsers, such as Tango from Alis Technologies (www.alis.com), support language tags within the HTML element LANG attributes, and use this information to modify the presentation of the text.

ISO 639 Language Codes

The ISO specifies, via ISO 639, two-letter codes for the world's various languages. In Web applications, language codes are used to select between special native language symbols, such as punctuation marks, currency, numerical notation (e.g., commas instead of periods as the decimal separator), text direction (left to right or right to left), and so on. The use of these codes is discussed in Chapter 6. Note that language codes are not related to the character set used by an HTML document. In principle the same language code can be used for HTML documents using different character sets, provided the different sets support the symbols required by the language.

URLs providing up-to-date (if unofficial) lists of the ISO 639 language codes are provided in the "References" section at the end of this appendix.

ISO 3166 Country Codes

In addition to codes for the world's languages, the ISO also specifies, via the ISO 3166 standard, two-letter codes for the different countries of the world. Similar to the language codes, country codes are case-insensitive. URLs that provide up-to-date (if unofficial) lists of these codes are provided in the "References" section at the end of this appendix.

ISO 3166 codes are identical with the codes used in the Internet domain name scheme to identify the country domain of an Internet address. However, the Internet DNS system also uses some non-national domain names, such as ARPA (old-style Arpanet--obsolete), COM (commercial), EDU (educational), GOV (government), INT (international), MIL (U.S. military), NATO (for NATO, largely unused at present), NET (network), and ORG (nonprofit organization). Additional domain names have also been proposed (WEB, etc.), but these are not yet in common use. Note that, when used as a part of a domain name, these strings are given in lowercase (e.g., server.net, machine.domain.org).

References

Tags for the Identification of Languages on the Internet (RFC 1766)
http://www.ietf.org/rfc/rfc1766.txt

ISO 639 Language Codes
http://www.iangraham.org/books/html4ed/appe/iso639.html

ISO 3166 Country Codes
ftp://ftp.isi.edu/in-notes.iana/assignments/country-codes
http://www.iangraham.org/books/html4ed/appe/iso3166.html

The HTML 4.0 Sourcebook © 1995-1998 by Ian S. Graham