Back to: Projects Home |
. . . . . . . . Personal Information Organizer Last Modified: 5 June 1996 |
2.1 Introduction
Soon after trying to use the Web, every user encounters the same problem -- finding a way of archiving the locations of useful resources for future reference. This was recognized early in the design of Web software, so that all browser contain some mechanism for storing lists of interesting resources. Such lists are known as hotlists, or bookmarks. Originally such lists were simple flat files. More recently, browsers have supported hierarchical lists, such that items on similar topics can be grouped into a folder, a subfolder under a folder, and so on. This allows for increased flexibility in the storage of bookmarked entries.
The names or titles chosen for each of the entries is, by default, taken from the content of the TITLE element in the HTML document being archived. Thus to a large extent the cataloging of the entries is determined by the author of the document, and not by the archiver of the bookmark. Users can modify the titles associated with their bookmarks, but my observations show that this is only rarely done.
Folder titles must be selected by the user: in general, users choose folder titles that associate well with the folder content. Some examples from my own list are "Linux Material", "Software Libraries", "Web Server Info", "Restaurant Reviews", "Stuff to be Filed", "Food and Wine", etc.
2.2 Statement of the Problem
These methods of archiving work well, provided the lists do not grow too large or to stale in the user's mind. When the lists get very large (greater than 50 or so items), traditional retrieval problems start to occur -- the user knows that a URL was recorded, but cannot find it. In addition, the user will often add a bookmark for a resource that already exists in the bookmark collection, having forgotten where the original entry lay. Finally, the user may enter two bookmarks for the same collection, but referencing slightly different locations (e.g., one referencing the Table of Contents, the other the Introduction). To summarize, the possible problems are: categories:
- The user cannot remember the TITLE of the desired resource.
- Some archived objects, such as images, FTP, or mail URLs, do not have a TITLE. The default is to use the URL, which is not terribly informative.
- The user cannot remember under which folder the item was stored.
- The user thinks the item was stored under one folder, but in fact it is in another (e.g. Linux HTTP server information being stored under "Linux Info", but not under "Web Server Info").
- The user has entered duplicate bookmark entries for the same resource, as s/he has forgotten about (or can't find), the earlier entry.
- The user has entered similar bookmark entries for the same resource -- for example, entries pointing to the Table of Contents, or Introduction, of the same collection.
- The link is no longer functional, because the original document has been deleted or moved.
- The link is no longer relevant, as the target resource has changed, and is no longer related to the original archived resource.
Finally, there is a semantic problem associated with the very idea of an hierarchical bookmark list. Many entries do not belong in a single place in the hierarchy, but rather in multiple locations. Thus it would be nice to find another way of storing that provides a better organizational model, along with a better interface for browsing or searching the bookmarks collection.
2.3 Resource Information Extraction
The first stage of analysis is to determine what information can be extracted for an object the user wishes to bookmark. At present, most bookmarking tools extract limited information, namely:
Some browser, such as netscape, also allow the user to add notes for each recorded URL. This is rarely used, as it places a heavy burden on the user to type in material, at a point where this is the last thing they will want to do (on Netscape, this notes field is searchable, but this is not an easy process).
- The resource URL
- The TITLE (if an HTML document)
- The time and date the resource was bookmarked
- The time and date the resource was last visited
Additional information is also available, and perhaps should also be stored in a bookmark database:
Less easily obtained, but also useful, are
- The type of the resource (HTML, plain text, image, mail address ....)
- The date the resource was last modified
- The text document type of the resource, if a text document -- is it a contents list, mostly a list, a readable text document, or lots of graphics? This requires some parsing of the document to guess at the type
- Keywords extracted from the resource (if text)
- Other extracted material (e.g. text descriptions of image files, other indexing schemes for text documents, etc.)
- The language of the document (to allow for mixed languages)
- The character set used by the document
- URLs for document mirrors, should they exist
- Unaliased URL -- sometimes machines have several URLs, and it is good to be able to resolve the name down to its absolute address (this is hard, as the domain name may 'move' from machine to machine).
2.4 Resource Information Archiving
Once you have the document and the above information, a user want's to appropriately index this information in some sort of database. This needs to be done such that it is easy for the user to do, and secondly that it is easy for the user to access. Here we concentrate on the former. The latter problem is discussed in Section 2.5. Each entry in a bookmark list should have, associated with it, the following information:(The items marked by the asterisks are the only ones stored by present bookmarking schemes). These 10 items are extremely useful. The MIME type allows all bookmarked objects to be sorted by type, while the character set and language information allows sorting by these characteristics. THe Expires: is also useful, as it can be used to warn the user if they try to bookmark an item that is likely to expire, and subsequently warn them when the item has indeed expired. Finally, the time and date that a resource was last modified is a hint to the user (or software) to indicate resources that vary rarely, or at all.
- URL for the resource *
- The MIME type of the resource
- The time and date the resource was last modified
- Expiry information, if any (A server can send an Expires: to indicate when a resource should be considered dead.)
- The TITLE (if an HTML document) *
- The time and date the resource was bookmarked *
- The time and date the resource was last visited *
and, if possible
- The character set of the document (from the HTTP headers)
- The predominant language used in the document
- Equivalently aliased URLs -- aliased by domain name, for example (from DNS querying)
Example (i) -- One could record, on each access to a document via the bookmark interface, the date of the access plus the last-modification date of the resource. If the last-modification dates do not vary, then you can infer (but not prove, of course) that the resource is generally stable, and unchanging.Example (ii) -- One could design the bookmarking tool such that resources that have been bookmarked, but not explored after a fixed length of time, are "tested" to make sure the linked resource is still there. The user could then be warned of stale links, which in turn could be reworked, or culled. Often a page being moved will first be replaced by a page saying "this page has been moved" -- some sort of semi-intelligent parser could check for this condition, and use this information to warn the user that the URL is about to die.
This is similar to the SmartMarks add-on package, provided by Netscape. SmartMarks monitors bookmarked Web pages using programmable agents -- a fancy name for programs that check to see if pages have been modified, and if so prompt the user. It can also be configured to autodownload certain pages for local viewing. There is also a searching interface, that will notify the user when new hits matching the search criteria appear on a search engine.
The above information is straightforwardly added to a bookmark database -- the hard part is the semantic structuring of the information. This must be done in a way that reflects the meaning associated to the bookmarked entry by the user, while the the interface by which entries are encoded into this index must be simple, as otherwise it will not be used.
What information is there to work with? We really have two things:
There are several ways these can be processed.
- The document text content
- User selection of some parameters
- -The Document Content -- Intelligent Software
- Determines the structural type of the document -- (resource list, text-based material, mostly graphics, FORM interface to tool). This could be based on text content, as well as information in the document head (LINK and META elements).
- Determines and extracts document keywords -- the s/w could look through the document and locate important keywords, and use these to index the content.
- Correlates the text with pre-defined categories or keywords -- The index may have predefined categories and/or keywords for indexing purposes, and the software could test the document against these, and choose appropriate categories.
- -User Selection of Parameters
- User selects arbitrary keywords and categories -- not very good, as the user is unlikely to do it, and the results are not well organized.
- User selects keywords and categories from predetermined list -- easier to do, but the user must also be able to add categories when necessary.
- Netscape SmartMarks
- http://home.netscape.com/home/add_ons/smrtmrks2_0_release.html
- WebCompass
- http://arachnid.qdeck.com/qdeck/demosoft/webcompass_lite/
Back to: Projects Home
Ian S. Graham |
. . . . . . . . Personal Information Organizer
Last Modified: 5 June 1996 |