APPENDIX B

The XHTML 1.0 Web development Sourcebook

Multipurpose Internet Mail Extensions (MIME)

B.1 The MIME Content Type

B.1.1 Character Set Specifications for Text Types

B.2 Multipart MIME messages

B.2.1 Content-Transfer-Encoding
B.2.2 Important Multipart MIME Content-Types

B.3 How Servers Determine Content Type

B.3.1 Default Type for Unknown File Name Extensions

B.4 How Browsers Determine Content Type
B.5 Browser Handling of Content Types

B.5.1 Browser Sending Data to a Server

B.6 References

Multipurpose Internet Mail Extensions (MIME) is the part of the Internet mail system that allows for multimedia electronic mail. MIME mechanisms are also used on the Web to define the type of a piece of data (e.g., text/html) and to send complex multipart messages (messages with multiple parts) via HyperText Transfer Protocol (HTTP). Thus, every Web expert should understand the basics of the MIME mechanisms and how they are integrated into Web server operation and Web application design. The aim of this appendix is to provide this basic knowledge.

The original Internet mail message protocol, defined in RFC 822, was designed with text mail messages in mind. A mail message was defined as a block of plain text preceded by specially defined headers specifying routing or other information about the message (for example, where the message was from, who it was being sent to, who copies were sent to, etc.). This specification said little about the format of the message content. At the time (which was not that long ago!), electronic mail messages were plain text files, so that concerns about the format of content were unwarranted. My, how things have changed!

Today there is enormous demand for electronic mail that can deliver messages containing components such as HyperText Markup Language (HTML) text documents, image files, sound, and even video data, in addition to regular text. However, such messages can be widely communicated only if all mail handling programs share a standard for constructing, encoding, and transporting such complex, multipurpose messages.

MIME provides this common standard. MIME defines, as an extension of the original mail protocol (defined in RFC 822), an extensible format for including multimedia components within a mail message. Thus, MIME defines how to code the content, while RFC 822 specifies how to package the message and get it to its destination. MIME defines several document headers, placed inside the document, that specify such things as the nature of a message (multipart or single part), how the message parts are separated, the data content of each part, and the encoding scheme used to encode each part. The following sections summarize those features that are most important for Web applications. This is only a brief summary, however; for details you are referred to the relevant documentation (RFC 2045 through 2049, and others) listed at the end of this appendix.

Alternate Formats
This Document (Word 6 Format)

B.1 The MIME Content Type

Of primary importance is the MIME content-type header. This should be familiar from elsewhere in the book, since this is the same header used to indicate the type of data being transferred using the HTTP protocol (see Chapter 9). Whenever a client requests a document from an HTTP server, the server first determines the type of the document, and then sends the appropriate content type ahead of it. For example, if the file contains AIFF audio data, the server must send back the content-type header field:

Content-type: audio/aiff

Similarly, when a client browser uses the HTTP POST method to send data to a server, the data is preceded by an HTTP header that contains a content-type field to tell the server the format of the data being sent. The two supported types for POSTed form data are:

Content-type: application/x-www-form-urlencoded


        Content-type: multipart/form-data

How do content-type headers work? Each header has a minimum of two parts, giving the generic data type and also a specific subtype. The syntax is:

Content-type: type/subtype

The MIME specification defines type to be image, audio, text, video, application, multipart, message,or x-arbitrary-name (these names, as with the string content-type, are case insensitive). The meanings of the first four are obvious, and indicate the overall type of the data. The application type is for other data (perhaps binary) that needs to be processed in a special way. This could be a program to run, or perhaps a PostScript or PDF document to be displayed by a PostScript or PDF viewer. Multipart indicates a message containing more than one part, while message refers to an old-fashioned RFC 822 plain text e-mail message. X-arbitrary-name (i.e., any name beginning with the string X-)is called an extension token, and refers to experimental data types. This lets you—or anyone else—create special MIME type names that do not conflict with established ones. There are, in fact, many "experimental" MIME types in common use on the Web.

Two new basic types were recently introduced. World is used for Virtual Reality Markup Language (VRML) data and for 2D/3D data sets used for generating 3D views, while chemical is designed for communicating information about chemical models and structures. Both of these are also commonly still seen using the "experimental" type names x-world/ and x-chemical/.

In the type/subtype string, the subtype defines a specific type of data—for example, a specific type of text data or audio data. Thus, text/html means text that is HTML data, text/xml means that it is eXtensible Markup Language (XML) data, (application/postscript indicates PostScript data, and so on. There are lots of content types: Table B.1 lists those that are particularly important in Web applications. A more complete list is found at www.iangraham.org/books/xhtml2/appb/mimetypes.html. Subtypes can also be experimental extension types, such as the x-www-form-urlencoded subtype shown previously.

Table B.1
Some MIME Content Types of Particular Importance for Web applications. A more complete list is found at www.iangraham.org/books/xhtml2/appb/mimetypes.html.

MIME TYPE

MEANING

TYPICAL FILE NAME EXTENSIONS

text/html

HTML data

.html, .htm

text/xml

XML data

.xml

text/plain

Plain (non-marked-up) text

.txt

text/css

Cascading Style Sheets (CSS)

style sheet

.css

application/javascript

JavaScript (or Jscript) program code

.js

application/vbscript

VBScript program code

.vb, .vbs

multipart/form-data

Multipart-encoded data from an HTML form

application/x-www-

form-urlencoded

Uniform Resource Locator (URL)-

encoded data from an HTML form

application/octet-stream

Binary data of unknown type

Note that, because of the nature of the HTTP client-server interaction, there are many content types used by Web applications that are not used in electronic mail, and vice versa.

B.1.1 Character Set Specifications for Text Types

Any text (i.e., MIME types beginning with text/) can take an optional charset parameter to specify both the character set and character set encoding used to create the text data—it’s no use receiving data representing text if the receiver does not know the relationship between the bytes and the desired characters. The format for including this parameter is

Content-Type: text/subtype; charset=char_set_name

where subtype is the text subtype (e.g., html, xml, or plain) and char_set_name is a name that indicates the character set and character set encoding used in the document. Note how the semicolon (;) separates the text/subtype field from the charset parameter. Some possible values are US-ASCII and ISO 8859-1 (ISO Latin-1) through 8859-9. (See Appendix A for more information on character sets.) Web applications often assume the ISO Latin-1 character set by default if this parameter is omitted.

Of particular interest are charset values associated with the UCS (Unicode) character set. This character set is defined as the base character set for both XML and HTML, and the current Web specifications recommend distributing newer HTML documents using this character set and one of the supported encodings. (This means that any character reference in an XML or HTML document, such as &#915, refers to the character at position 915 in the UCS character set.) The two main encodings/charset values are UTF-16 (16-bit encoding of UCS; each character stored in 2 bytes) or UTF-8 (8-bit encoding of UCS; characters stored in 1 to 6 bytes), with UTF-8 being the recommended choice for most cases.

Some browsers, such as Netscape Navigator 4 and Internet Explorer 4, can handle these character sets. Unfortunately, a computer cannot display the desired characters unless you (the user) explicitly add the relevant character set’s fonts to your machine—most machines do not come with built-in Unicode fonts.

NOTE Older browsers do not understand charset parameters. At present, many older browsers (Navigator 3 and earlier, Internet Explorer 3 and earlier) cannot properly process content-type headers containing charsetspecifications, and assume that all documents are encoded in ISO Latin-1.

Appendix A discusses character sets and encodings in more detail.

HTML Level Specification

In principle, the text/html MIME type can take a version parameter. This optional parameter specifies the level of the HTML language used in the document. For example,

Content-type: text/html; version=4.0; charset=utf-8

indicates that the data is encoded using UCS and the UTF-8 character encoding, and that the HTML markup is consistent with the HTML 4.0 specification. This parameter is largely unused by current Web applications, and can sometimes cause problems if present, in the same manner as the charset parameter discussed previously. In general, it is safest to omit the version parameter and instead use a DOCTYPE declaration inside the actual HTML document to declare the version, as in:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
         "http://www.w3.org/TR/html40/strict.dtd">
<html>
         ...The head, body, etc. goes here...
</html>

B.2 Multipart MIME Messages

Multipart MIME messages play an important role in Web applications, both for browser-to-Web server transactions and for sending/receiving "Web-enabled" mail. The multipart type defines how multiple data parts can be included within a single message. These parts can be regular text files, HTML data, or binary data (such as images). The multipart specification defines how these parts are combined together, how binary data parts are encoded (i.e., stored as regular ASCII characters, so that the data can be safely transmitted in a mail message), and how parts can be referenced using special cid and mid URLs (discussed in Chapter 8).

In a multipart message, the different parts are placed in a single message, one after the other, separated by a special divider. This divider or boundary is a text string, defined in the message’s multipart content-type header field that precedes the entire message. For example, for the multipart/mixed content type (which indicates a multipart message containing a collection of unrelated parts of various types), the general form for the content-type field is

Content-Type: multipart/mixed; boundary=separatorstring

where separatorstringis a string of characters, guaranteed never to appear elsewhere in the message, that is used to separate the message parts.

The boundaries between adjacent message parts are then simple text lines consisting of the string

--separatorstring

that is, the boundary string preceded by two dash characters. This string is followed by the content-typedeclaration for the specific part, which is in turn followed by a blank line containing only CRLF (a carriage return and a line feed character) to indicate the end of the headers and the start of the data. The end of one part of the data and the beginning of the next part is indicated by another string of the form

--separatorstring

Finally, the end of the entire message is marked by the special string

--separatorstring--

which adds two hyphens at the end of the separator string to mark this as the special boundary at the end of the message.

The following simple example outlines such a message, where the strings CRLF denote the blank lines (containing only the carriage return and line feed characters) that follow the headers and precede the data:


MIME-Version: 1.0
Content-type: multipart/mixed; boundary=23xx1211
  CRLF
--23xx1211
Content-type: text/html
  CRLF
.... html document data  .(first "part" of the message)...
--23xx1211
Content-type: audio/aiff
Content-transfer-encoding: BASE64
  CRLF
..... audio data ..... (second "part" of the message)....
--23xx1211--

This simple example omits several details, but illustrates the general approach. It contains two parts: (1) a text file in HTML format, and (2) an audio file in AIFF format. The MIME multipart header indicates that there is more than one component to the message, and specifies the string used to divide the message parts.

B.2.1 Content-Transfer-Encoding

Note the content-transfer-encoding field in the second part of this message. This field indicates the mechanism used to encode the data included in the message part. Thus this field,

Content-transfer-encoding: base64

means that the data is encoded using base64 encoding. This encoding mechanism, defined in RFC 2045, essentially encodes any three adjacent bytes of binary data using 4 adjacent characters selected from a set of 65 "mail-friendly" ASCII characters. This encoding ensures that the binary attachment can be sent via e-mail without risk of the attachment being corrupted by intervening mail gateways.

B.2.2 Important Multipart MIME Content-Types

Three multipart MIME types are of particular relevance for Web applications. These are:

Content-Type: Multipart/x-mixed-replace.: This multipart model, introduced by Netscape with the Navigator 2.0 browser, is discussed in Section 11.1.2 of Chapter 11. This type can be used to stream a continuous series of data from a server to the browser. Because the browser understands that the message is coming in multiple parts, it displays each part when that part arrives, with the multipart divider string telling the browser where one part finishes and the next part begins.
Content-Type: Multipart/form-data.: The multipart specification is also crucial to the multipart/form-data type (defined in RFC 1867—see the references at the end of this appendix), developed to support the uploading of complex form content, such as forms containing uploaded binary data files, to HTTP servers. When a form element specifies the POST method along with enctype="multipart/form-data", the data entered into the form’s input elements is encoded into distinct parts of a multipart message, each user input element in the form giving rise to its own specific part. When the form is submitted, the data is sent to the server as a single message of the type multipart/form-data.
Note in particular that the presence in a form of an <input type="file" .../> element forces the brower to use enctype="multipart/form-data", since this is the only mechanism that can encode a file and send it as a part of the form.
Content-Type: Multipart/related.: The multipart/related content type denotes a message containing parts that are related one to another. A typical example is a mail message consisting of an HTML document, along with all the images that appear inline in the document. In this case, the mail message must contain all the parts, but must also support URLs that link between the parts, so that the document can contain URLs referencing the images lying in the same message.

The specification for employing the third of these special types is found in the new MHTML (MIME-HTML, or M-HTML) specification defined in RFC 2557. Figure B.1 illustrates one case from this specification: Here, the content-base header indicates the base from which the document and its parts came. The browser uses this content base to locate all the related parts, including the image file included as the second part of the message. This mechanism is used by the Netscape Navigator 4 (and later) mail client when composing HTML mail messages, and is now widely used by mail clients capable of creating HTML-format mail messages.

Figure B.1
An illustration of a simple multipart/related message containing an HTML document and a related image file. Comments are in italics. Note how the JPEG image message part is encoded using the base64 encoding.

regular mail headers .... (see RFC 822)
Mime-Version: 1.0
Content-Base: http://www.utoronto.ca/ian/
Content-Type: multipart/related; boundary="113101231231"; type=text/html
--113101231231
Content-Type: text/html; charset=ISO-8859-1

... here is the actual document, which contains reference to
the image file to be inlined when the document is displayed.
<IMG SRC="/images/ians-mug.jpeg" ALT="UGLY Picture!">
The browser uses the Content-base header to recognize that the
image is actually included below, in the next part of the document;

--113101231231

Content-Location: /images/ians-mug.jpeg
Content-Type: image/jpeg
Content-Transfer-Encoding: BASE64

APlGODlhGAGgAPEAAP/////ZRaCgoAZZDCH+PUNvcHlyaWdodCAoQykgMTk5
AQDVK32yilpdjladlsfg1116ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
and so on -- a BASE64 encoded image ...

--113101231231--

There are many other mechanisms allowed by the MHTML specification, including the use of cid and mid URLs (see Chapter 8) to reference specific parts within the message, or even specific parts within another mail message. See RFC 2557 (and also 2387) for details.

B.3 How Servers Determine Content Type

For the server to be able to send a content-type field in the header that precedes the data, it must know the type of the data being returned. When a server is returning data files stored on disk, this type is determined using conventions for the file name suffixes, or extensions—servers generally assume that the suffix defines the type of a document. For example, files with the suffix .mpeg are usually assumed to be MPEG movies (video/mpeg), while files ending in .html are assumed to be HTML documents (text/html), and so on. On some servers (historically, because of the three-character filename extension limits imposed by the Windows 3.1 operating system), these suffixes are often shortened to three letters; for example, .mpg for MPEG movies and .htm for HTML documents.

The actual relationship between file name suffixes/extensions and MIME types is configurable, and can be changed by modifying a server’s configuration database or files. Thus, when the server administrator defines a new type of data on a server, for example Kodak Photo-CD images, he or she must define an extension to use for such files (e.g., .pcd) and configure the server to associate the appropriate MIME type (e.g., image/x-photo-cd) with this extension. The server administrator can also associate multiple extensions with the same MIME type (e.g., .html and .htm, both corresponding to HTML documents). Of course, it is then up to the author, when placing documents on the server, to ensure that the documents use the extensions corresponding to each document’s content.

Of course, if the data is being returned by a gateway program of some sort, then the HTTP server does not know the type (a program can return practically anything). It is then up to the program designer to return, via the gateway program, a proper content-type header describing the data being returned. This issue is discussed in Chapter 10.

B.3.1 Default Type for Unknown File Name Extensions

If the server is returning a regular file, but does not know the type of a file (for example, the file has no file name suffix, or the suffix has no entry in the content types database), the server then assumes a default content-type value, often text/plain. However, this default can be changed to another value, and many servers are configured to send unknown data out as MIME type application/octet-stream. This corresponds to unidentified binary data — a browser receiving data of this type will generally prompt the user to ask what to do, typically giving the option of canceling the download or saving the data file to disk.

B.4 How Browsers Determine Content Type

If a browser receives a file from an HTTP server, it is explicitly told the content type by the server’s response header. With File Transfer Protocol (FTP) or local file access, this information is not available, and the browser must itself determine—or guess at—the file type. Again, this is done by the file name extension. To support this, Web clients also maintain a database matching file name extensions to data types, which is used in the absence of any other content-type information. In general, when new software is loaded onto a computer—either an independent application or a browser plug-in—the operating system and the browser are automatically configured to know about this new data type and to know what file name extensions correspond to this new type of data.

The location of this database varies from browser to browser, but in most cases this database can be accessed and modified from within the browser via pull-down menus.

A Web designer of course has no control over the browser’s type database. Thus, if you are developing content that is likely to be delivered using FTP or local file access, be aware that some users may not be able to properly view the data. In this case you may wish to explicitly identify the type of nonstandard data files—for example, by adding a text description of the type adjacent the link to the data.

NOTE
INTERNET EXPLORER IGNORES SOME CONTENT/TYPE VALUES. When Internet Explorer 3, 4, or 5 receives data of MIME type text/plain, the browser checks the document to see if it contains HTML markup tags. If it finds such tags, it ignores the actual MIME type and treats the document as text/html. Thus, you cannot deliver an HTML document as type text/plain if you want to let the user view the actual markup.

B.5 Browser Handling of Content Types

When a browser receives data, it checks the MIME type to see if the browser itself can view the indicated data type. If it cannot, it then looks for an alternative mechanism for displaying the data. If the data was accessed via an anchor element, the browser checks to see if there is a registered external application (a program separate from the browser, or one that is available as a browser plug-in) designed to handle the indicated type of data. If one is available, then the other program is started, and the data is passed to it for processing.

In some cases, the user will be asked to confirm this procedure, particularly if there are security implications associated with downloading and displaying the data. For example, if the user downloads a Microsoft Word document, the browser (or Word plug-in) may query the user about disabling Word macros or other avenues by which viruses or worms can attack the local machine.

If there is no program or plug-in registered for the indicated type, then the browser informs the user of its predicament and asks what it should do (save the file to disk, cancel the download, etc.).

If the data is accessed as an embedded object (via embed or object elements), the browser once again searches the user’s machine for an appropriate plug-in module. If no appropriate plug-in is available, the browser informs the user of the problem and asks the user what to do (save the file to disk, search the Web for an appropriate plug-in, cancel the download, etc.). The embed element in particular provides attributes (pluginspage, pluginsurl: see XHTML 1.0 Language and Design Sourcebook, Section 17.1.6) by which the document author can specify a URL providing information about appropriate plug-in software. A document author should always provide pluginspage and plusingsurl attribute values to support those who do not have appropriate plug-ins, and should also provide noembed element content to support browsers that simply do not support the desired plug-in.

B.5.1 Browser Sending Data to a Server

When a browser sends data (such as files included within a form using the input type="file" upload mechanism) to a server, the browser will infer the MIME type for the attached file from the local system’s table of file name extensions. Thus, if the user uses a form to upload the file testfile.doc, the browser will check to determine the type of this file, and will add the appropriate MIME content type within the appropriate part of the message sent to the server (for example, application/msword if a Microsoft Word document).

If the file name extension corresponds to no known type, then the browser will not include a content-type header (Navigator 4 and earlier), or will use the special string content-type: unknown (Navigator 5), or will assume the default type text/plain (Internet Explorer 5.5 and earlier).

B.6 References

INTRODUCTORY DOCUMENTS ON MIME (Slightly Out of Date)
ftp://ftp.uu.net/networking/mail/mime/mime.ps	PostScript
ftp://ftp.uu.net/networking/mail/mime/mime.txt	Plain text
MIME FAQ DOCUMENTS
http://www.irvine.com/~mime
INTERNET MAIL MESSAGE SYNTAX SPECIFICATION
http://www.ietf.org/rfc/rfc0822.txt	Updated by RFCs 1123, 1138, 1148, 1327, and 2156
	Updated by RFCs 1123, 1138, 1148, 1327, and 2156
MIME SPECIFICATIONS
http://www.ietf.org/rfc/rfc2045.txt	MIME part 1
http://www.ietf.org/rfc/rfc2046.txt	MIME part 2
http://www.ietf.org/rfc/rfc2047.txt	MIME part 3
http://www.ietf.org/rfc/rfc2048.txt	MIME part 4
http://www.ietf.org/rfc/rfc2049.txt	MIME part 5
http://www.ietf.org/rfc/rfc2392.txt	cid and mid URLs
http://www.ietf.org/rfc/rfc2387.txt	Multipart/related
http://www.ietf.org/rfc/rfc2557.txt	MHTML:HTML mail
http://www.ietf.org/rfc/rfc2646.txt	Text/plain
http://www.ietf.org/rfc/rfc1867.txt	Multipart/form-data
LISTS OF MIME TYPES
http://www.isi.edu/in-notes/iana/assignments/media-types/media-types	IANA registry
http://www.iangraham.org/books/xhtml2/appb/mimetypes.html
MIME TEST PAGE (Example Data Files)
http://www-dsed.llnl.gov/documents/WWWtest.html