[Updated: 22-Sep-2004 14:33 with link to list of implementations]
The other day I wrote that we really should be adopting RFC3229 "Delta Encoding in HTTP" in order to reduce the amount of bandwidth, etc. that is wasted in serving RSS and Atom files. I'm fairly convinced that if the folk at Microsoft had been using what I propose here, they would not have been forced to take the drastic measures that they did when they did.
Of course, use of RFC3229 would have only delayed, not eliminated the day when the current practice of polling for updates to RSS files would have become excessively expensive for Microsoft. The real solution to the bandwidth problem is to move from polling to a push-based solution. But, at least by implementing RFC3229, we can take the polling solution just about as far as it can be taken -- in terms of efficiency... This is a good intermediate step on the way to the push-based solutions that we won't have much choice but to implement as the audience for RSS and Atom data grows.
This post is intended to provide additional detail on what I'm proposing. It is my intention to create a Internet Draft describing the ideas here once I've had reasonable time to receive comments from folk and work out the inevitable bugs. Please feel free to comment on what is below:
Feeds aren't like HTML
Atom and RSS files are members of a distinct class of files that we call "feeds." Conceptually, a "feed" is a potentially infinitely long and growing series of items or entries. In order to reduce the cost of distributing new entries, RSS and Atom feeds are typically implemented as "sliding window" feeds. Such feeds don't contain every entry that has ever been published. Rather, they only contain some number of the most recent changes to the feed.
Common practice today is for feed providers to establish a certain fixed "window size" which defines the maximum number of entries that are contained in any instance of the feed file. Once the maximum number of entries has been reached, then every retrieval of the feed from then on will always receive that number of entries -- even if some smaller number of entries has been inserted into the feed since the last time the feed was retrieved by a specific client. The result is a great deal of wasted bandwidth and processing resource.
In order to allow the number of entries returned in a feed to be no more than the total number of new or modified inserted into the feed since the last time any specific client retrieved the feed, I propose that we rely on RFC3229 "Delta encoding in HTML" with a new instance-manipulation method defined to provide feed specific delta encoding.
The "feed" instance-manipulation method
The "feed" instance manipulation method is an abstract method for which concrete forms can be defined for use with various content types. In this document, I'll be speaking primarily about the use of Feed IM as appropriate for Atom files.
Unlike the IM methods currently registered for use with RFC3229, the "feed" IM method is not byte-oriented. Rather, it is item or entry oriented in that deltas are computed not in bytes but rather in whole items or entries. The definition of those items or entries is dependent on the underlying content type. For instance, if the content type is RSS, the delta unit is an "item". If the content type is Atom, the delta unit is an"entry". If the content type is "log file" then the delta unit is "lines".
When the "feed" IM method is applied to an instance, the result should conform to whatever are the syntactical requirements for the type of the instance. Thus, if the instance is Atom formatted, the result of applying the "feed" IM method would be Atom-conformant. This implies that the result would have atom:entry elements which would be wrapped in an atom:feed element that contained an atom:head element.
The detailed rules for applying "feed" instance-manipulation for various types should be easily derived from what is said above.
The requirement that the result of applying the "feed" IM method to an instance will result in a result of the same type as the instance presents an interesting opportunity. While byte-oriented IM methods must never be used unless specifically requested by a client -- since not every client may support the result or have the history needed to interpret it, this requirement need not exist for the feed IM method. Thus, servers that implement this method are free to apply it by default even if it is not requested.
feed: A worked example
The following shows what an RFC3229-compliant request to a server might look like:
GET /atom.xml HTTP/1.1
A-IM: feed, gzip
- The client wants to obtain the current value of /atom.xml
- It has previously received an instance whose entity tag is "321"
- It is willing to accept delta-encoded updates using the "feed" IM method. (Note: It is not strictly necessary for the client to request the "feed" IM method in all cases since some servers may actually apply this method by default. Nonetheless, it is good form to request it since some servers may not use the method unless it is requested.)
- It is willing to accept responses that have been compressed using "gzip," whether or not these responses have been delta-encoded.
If, when this request is received, the server's current entity tag for the resource is still "321," then the server should simply return a 304 (not modified) response, as would a traditional server.
If the entity tag has changed, the server could compute the delta between the entity whose entity tag was "321" and the current instance. If the server no longer knows what the "321" entity tag corresponds to, it would probably send the entire feed.
If the client requests delta-encoding but the server doesn't support this form of instance manipulation, the server will simply ignore this aspect of the request.
If the server responds with a delta encoded response, it would look something like this:
HTTP/1.1 226 IM Used
IM: feed, gzip
Date: Tue, 13 Sep 2004 18:30:05 GMT
Cache-Control: no-store, im
- The response status is 226 IM Used -- a success code.
- The entity tag given is that of the new state of the resource.
- The response carries an "IM" response-header field, indicating which delta encoding is used in the response.
- The Cache-control "no-store" is used to ensure that caches that do not understand delta-encoding do not cache this response. However, a cache that does understand the use of instance-manipulation is allowed to ignore the "no-store" directive which would otherwise be mandatory.
- The message-body is first delta-encoded using the feed IM method appropriate for the type of feed and is then gzipped.
f-range: A feed oriented Range
Just as it is appropriate to define a feed specific delta IM method, it is appropriate to provide a feed-specific IM method for range selection. RFC3229 currently only supports byte-oriented range selection.
The f-range IM method uses the content type's concept of item or entry as its unit of selection. Thus, "F-range: entries=1-20" would specify that the client only wanted to receive a maximum of 20 items starting at the "first" or most recent item in the feed. The specification "F-range: entries=20-" would indicate that all items, starting at the 20th oldest, should be returned in the result. All item offsets should be computed based on the state of the feed associated with the entity tag passed in the request or the If-F-Range if provided. Thus, it is possible for limited resource clients to "chunk" their way through a large number of available items in a fast moving feed.
When responding to an F-range request, the response should contain the entity tag associated with the feed at the time of the response and the cache control statements should be set to prevent caching.
f-range: A worked example
GET /atom.xml HTTP/1.1
A-IM: feed, f-range, gzip
- This request asks first for all entries added since the entity tag "321"
- The set of items in the response is limited to the most recent 20
- The response should be gzipped.
HTTP/1.1 226 IM Used
IM: feed, f-range, gzip
Date: Tue, 13 Sep 2004 18:30:05 GMT
Benefits of the approach
Implementing and deploying the "feed" IM method will provide the same general benefits as are provided by the various byte-oriented IM methods of RFC3229. These benefits are:
- A reduction in the mean size of HTTP responses, thereby improving latency and network utilization. For actual numbers which show savings from early implementations of RFC3229+feed, click here.
- Avoidance of any extra network round trips
- Minimization of per-request and per-response overheads.
- Support for a variety of encoding algorithms and formats.
- Interoperation with HTTP/1.0 and HTTP/1.1.
- Fully optional for clients, proxies, and servers.
- Moderately simple implementations are possible.