A Syndication Project

Packaging Data Content

This URL: http://www.iangraham.org/projects/news/issues-3.html

Created: 28 August, 2000
Last Update: 2 October, 2000

Author(s): Ian Graham

Data Element and Content

In this model, we assume that the data content of a syndicationMessage is inside a data element. However, the model assumes that the software reading a syndicated data message can succesfully process it without any knowledge of the content inside data. This means that the data element attributes, or the metadata content of the parent syndicationMessage element, must provide sufficient information about data that the processor can determine how to handle that content (e.g., ignore it, unpack/decode it, direct it to another application for processing, etc.). For this to be possible, the consumer must know at least the following things about the data content:

  1. It's MIME type
  2. The compression algorithm (if any) used to compress the data prior to including it in the message
  3. The encoding algorithm (if any) used to encode the data (compressed or not) before including it as content
  4. If the content is XML, the primary namespace associated with the content.
  5. If the data content is unencoded, uncompressed XML, zero or more namespace declarations defining namespace prefixes used by content inside data.

Note that item 4 is not the same thing as an XML namespace declaration, but is instead really a type identifier for identifying subtypes of XML data. This is needed because the content of the element may be compressed or encoded XML data that the syndication system knows nothing about, other than that it is consistent with a specific namespace-defined XML dialect. Thus the syndication system simply uses this namespace identifier to determine where the decoded, decompressed content of the data element should be sent.

What about 'real' XML inside data?

Well, that too is possible. In that case, however, the processing tool assembling the mesage would need to ensure that markup inside data is appropriately namespace-qualified, and would need to add appropriate namespace prefix declarations to the data element. Note that the primary namespace identifier (4) is still useful, however, since it lets the application know the primary content of the element.

Note that the application assembling the message would need to ensure that the charset for the the data content is the same as that of the overall synchronized data message. This may mean doing a charset-to-charset transformation of the original data content.

Other Text Content (HTML, SGML or raw text) Inside data?

Clearly such data can be compressed/encoded and stuffed inside data, but this may not always be the desired choice. An alternative would be to place the content inside a CDATA section, and thereby 'escape' it from the XML processor. Of course, if this is done the inserted content would need to be preprocessed to 'escape' and occurrence of the string ]]>, which would otherwise prematurely end the CDATA section.

Once again, the application assembling the message would need to ensure that the charset used in the data content is the same as that used in the overall synchronized data message. This may mean doing a charset-to-charset transformation of the original data to make this so.

Markup Model

It consequently makes sense to define the following attributes for the data element:

Data Element Content Examples

Here are some simple examples illustrating how this would work in the real world.

1. Contains unescaped PostScript data

<data content-type="application/postscript">
%!PS-Adobe-2.0
%%BeginProlog
%%BeginResource ShowcaseResource
1 setlinejoin
....
</data>

2. Contains HTML content

<data content-type="text/html;version=4"
   content-encoding="gzip"
   content-escaping="base64">
PlGODlhGAGgAPEAAP/////ZRaCgoAZZDCH+PUNvcHlyaWdodCAoQykgMTk5q
AQDVK32yilpdjladlsfg1116ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
and so on -- BASE64 encoded gzipped data.......
</data> 

3. Contains 'raw' HTML content

<data content-type="text/html;version=4"
   content-escaping="cdata">
<[CDATA[
   <div align="right">
     <img src="image.gif" title="Hi Mommy!" >
     <p>This is a paragraph of useless text .... 
     <p>Here is another </p>
   </div> ]]>
</data> 

4. Contains XML data

<data content-type="text/xml"
      namespace="http://www.heml.org/schemas/heml1.0"
      xmlns:h="http://www.heml.org/schemas/html1.0"
      xmlns:dc="dublin core URL">
  <h:event id="Att1915">
    <h:label xml:lang="en">Attack on Gallipoli</h:label>
    <h:date h:calendar="Gregorian" h:era="AD" xml:lang="en">
      <h:year>1915</h:year>
      <h:month>08</h:month>
      <h:day>9</h:day>
    </h:date>
    <h:location>Gallipoli Peninsula</h:location>
    <h:origin xml:link="http://www.lib.byu.edu/~rdh/wwi/1915/gallpoli.html"/>
  </h:event>
</data> 

5. Contains Encoded XML Data

<data content-type="text/xml"
     namespace="http://www.heml.org/schemas/heml1.0"
     content-encoding="gzip"
     content-escaping="base64">
PlGODlhGAGgAPEAAP/////ZRaCgoAZZDCH+PUNvcHlyaWdodCAoQykgMTk5q
AQDVK32yilpdjladlsfg1116ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
and so on -- BASE64 encoded gzipped data, in this case XML..
</data>