Microsoft's recent troubles with RSS files have focused the RSS/Atom communities squarely on fundamental problems with feed syndication as practiced today. While some of us have been warning about this problem for quite a while, it has taken Microsoft's recent embarrasment to convince folk that the problems are real -- not just theoretical. The fundamental problem is that polling, the mechanism used by today's news aggregators, is simply not scalable enough to take us where we need to go -- to world where reading feeds is just as common as reading HTML pages. Eventually, we're going to have to move to a push-based architecture -- such as what is implemented by PubSub.com. However, in the mean time, there are things that we can do to give the old polling model a bit more life and save providers of popular feeds, like Microsoft, quite a bit of bandwidth expense. In the process, we'll be facilitating the continued growth of feed syndication rather then seeing it die off or crippled just as it is finally gaining popularity.
RSS and Atom are examples of a special kind of network resource -- the content feed. Conceptually, a "feed" is a potentially infinitely long and growing series of items or entries inserted into the feed in sequential order. In many ways, you can think of a feed in much the same way you would an event log to which new entries (lines) are appended from time to time. But, just as event logs can grow very large, RSS and Atom feeds can grow very large. The result is, of course, that copying the entire feed whenever you want to see if something new has been added is simply not practical. This was recognized early in the history of RSS and Atom.
In order to reduce the cost of distributing new entries, RSS and Atom feeds are typically implemented as "sliding window" feeds. Such feeds don't contain every entry that has ever been published. Rather, they only contain some number of the most recent changes to the feed. You might find, for instance, that even though a feed has had thousands of entries inserted into it over the course of time, if you read the feed file from the network, you'll only see the most recent 10, 20, or 100 entries. The number of entries are what defines the "window" that slides constantly at the "front" of the feed. The size of the window is critical to determining how much bandwidth will be used when people monitor the feed. The larger the window, the more bandwidth will be used in copying versions of the feed since more data will be copied everytime the feed is copied. Of course, the smaller the window is, the more likely it is that people will miss items if they don't check the feed frequently enough -- old items can slide out of the window...
Implementing RSS and Atom as sliding-window feeds has been very important in making these formats useful in feed syndication. However, the sliding-window alone isn't enough to reduce bandwidth utilization to an acceptable level. Because feed readers don't like to miss content that slides out of the window and because people like to see content while it is still fresh, RSS and Atom readers tend to poll or check feeds fairly frequently -- usually about once an hour -- to determine if they have new content. This frequent polling can, of course, present quite a challenge to feed serving sites since if not properly handled, they could be forced to retransmit copies of unchanged feeds over and over again to the same clients who then simply discard the data if it has not changed. The result would be a massive amount of wasted bandwidth and expense. So, "responsible" news aggregators use HTTP's "conditional-get" methods that allow a client to ask a server to only send a full response if the feed has been modified since the last time it was fetched. Thus, while a client may ask for updates every hour, the server would only send a copy of the new sliding-window if there has been new content added in the last hour. If no new content is added, the server responds with a very small message saying "No new content." The result is a massive savings in network bandwidth -- at the cost of some minor increase in the complexity of clients, which must now remember when they last fetched a feed, and servers that must keep track of when updates were made.
While HTTP's "conditional-get" method allows a great savings of bandwidth, there is still much more that can be done. For instance, if someone is maintaining a feed that is only updated during "working hours," yet clients are asking for updates every hour of the day, then it is clear that if "working hours" are only eight hours of the day, normally each and every request made during the remaining 16 hours of the day are going to result in a response of "No new content." Similarly, if a feed is typically only updated once a day, then it doesn't make sense for people to ask for updates more than once a day -- rather than once an hour. In order to eliminate too frequent or badly timed requests, RSS provides a mechanism to define a "TTL" or "Time to Live" which defines how long a client should wait before asking for new information as well as "Skip Hours" and "Skip Days" that tell clients those times during which requests will almost inevitably result in no updates. Experience to date has shown that TTL and "Skip Hours/Days" has been useful in limiting unnecessary requests for some sites; however, not all sites can benefit from this mechanism since they publish throughout the day and at irregular frequencies. For instance, since Microsoft updates its RSS feed on average once every five minutes and at all hours of the day, these mechanisms won't be useful to them.
Whatever mechanisms are used to limit the number of entries in the sliding window, or limit the number of times that a file is transferred or the frequency and timing of update requests, we can still drastically reduce the number of bytes sent over the network by using compression schemes like gzip, etc. rather than sending raw uncompressed text. In fact, "responsible" news aggregators all support gzip and add an "Accepts" header to their HTTP requests that says they are willing and able to accept compressed content.
But, as Microsoft has found, even all the bandwidth limiting mechanisms described above aren't enough to reduce the bandwidth requirements as much as they can or need to be reduced -- and certainly not to the same level of efficiency that can be obtained from a push-based PubSub system. However, the methods above are pretty much all that feed servers do today... While moving to a push based solution would make a great deal of sense, there is still more that can be done to improve the efficiency of the existing polling or pull-based system.
The biggest remaining opportunity to reduce the amount of data sent in a feed is to reduce the size of the sliding-window. You can do this in two ways. First, you can truncate full entries to abbreviated entries as Microsoft did. (i.e. they went from publishing full entries to only publishing the first 100 words of each entry.) However, readers will often complain that this drastic reduction in quality of service is too painful to consider. Second, we could tailor the size of the sliding window to each reader.
Today, the window size is the same for all readers. Thus, if a feed pulblisher decides on a window-size of 20 entries, or entries which are less than 24 hours old, then every reader will get all 20 entries or every entry less than 24 hours old even if only one of the entries is something that they haven't seen before. While this is an improvement over sending the full feed (potentially containing thousands of entries many years old), we can do better by having clients tell the server when they last picked up new data and then having the server only send entries that have been added or modificed since the last entry picked up by the client. Although it appears that no existing feed servers implement it, RFC3229 defines an IETF standard for "Delta encoding in HTTP" that provides a well-understood and well-defined means to provide precisely the custom, client-specific window size that we need to reduce bandwidth requirements to the minimum possible with a polling solution.
RFC3229 extends the "conditional-get" support in HTTP (discussed above) by having servers send only the changes or deltas in a feed when an appropriatley formatted request is received. Basically, whenever a client picks up some new data, it also receives an "Etag" that gives it a "name" or "tag" that identifies the state of the feed at the time that it was read. Then, the next time the client asks for data, it includes the "Etag" in the request and the server will only send back the modifications to the feed that were made since the "Etag" was issued. Thus, instead of everyone receiving precisely the same number of items (and some clients receiving duplicate data) all clients only receive data that they have not yet seen. Effectively, window size has been optimized to the most efficient size for each client. You can't get much better than that with a polling solution....
As a nice side effect, use of RFC3229 will actually reduce at least one of the reasons people currently poll as frequently as they do... As mentioned above, one of the problems with a fixed window-size is that things inevitably "fall out of the window." Thus, clients who want to see everything will typically try to poll frequently enough so that they don't miss things that fall out of the window. Feed publishers, on the other hand, try to limit the window size so that they don't waste bandwidth by repeatedly sending duplicate entries to everyone. As window size is reduced and things fall our more quickly, people tend to poll more frequently... It is a vicious cycle... However, RFC3229 use would allow clients to could poll less frequently yet still be confident that they weren't missing content. For instance, if you only read blogs in the morning or in the afternoon -- once a day, you could poll Microsoft just once a day and be confident that you were receiving all the entries since you last polled them. (Note: Microsoft would undoubtedly still enforce a fixed window size on those clients who don't use RFC3229 delta encoding!)
RFC3229 can be improved upon when used with feeds like RSS and Atom. I'll be writing about this in the future. The basic weakness with RFC3229 is that all the registered "instance-manipulation" methods for it are byte-oriented; however, an item- or entry-oriented method would be more appropriate and easier to implement for use with feeds. I'll be proposing a new instance-manipulation method, probably called "feed-delta" which will address this issue. But, even as defined today, RFC3229 use would provide very significant efficiency improvements over current practice. Fortunately, we can get this improvement by using mostly existing standards. We just need to use what we've got...
Once we've added RFC3229 support to the existing collection of methods used by feed servers and aggregators, we will have gotten just about as far as we can go in improving the efficiency of polling based aggregation. At that point, we'll be sending the smallest possible amount of data to each reader (because of RFC3229 delta encoding) and we'll be sending it in an efficient format (because of compression.) Further advancements in efficiency will only come from improving on the encoding efficiency (for instance by using Binary XML formats like the ASN.1 based "fast-infoset") or by introducing push-based components into the architecture. (The will include distributed checking systems like Shrook uses to reduce unnecessary polling or by pushing updates to clients as PubSub.com does.)
Microsoft should be thanked for having focused the community of developers on this issue and for having provided unambiguous evidence that for even the largest of companies, the inefficiencies of the current system are, or will become, unacceptable. Fortunately, feed syndication is still "young" and we still know of many ways to improve on existing practices. It is going to be fun to see how all this develops over the next few years.
bob wyman
Would it be possible to get this benefit without any changes to existing clients or requiring RFC 3229? One such idea is explored here:
http://www.intertwingly.net/blog/2004/09/11/Vary-ETag
Posted by: Sam Ruby | September 11, 2004 at 23:55
"Further advancements in efficiency will only come from improving on the encoding efficiency or by introducing push-based components into the architecture."
You are forgetting one other efficiency mechanism: outsourcing the individual entries into their own retrievable resources, leaving only pointers in the feed itself. This is cache friendly, can be used in conjunction with conditional get and RFC3229 deltas, and has the added feature of providing efficiency even to those web hosts that only allow static serving (ie. no cgi).
Posted by: eric scheid | September 12, 2004 at 02:42
Sorry Bob, I only saw this after responding to your list post. As a long-term solution I do think this makes sense.
Regarding item-oriented deltas, it might having a look around the RDF literature - this is very close to a batch of requirements there. I'm sure work will have been done on statement-oriented or resource-oriented sync (like URIQA and CBDs - http://swdev.nokia.com/uriqa/URIQA.html), and given the interchange format is XML there may even be some stuff implemented.
Posted by: Danny | September 12, 2004 at 05:37