backenduserlandcom:
How to hook into UserLand.Com through XML, XML-RPC and SOAP.

 
XML

About

Home

RSS

OPML

XML-RPC

SOAP



Members
Join Now
Login

   

Missing vital metadata

Author:   Stephen Tyler  
Posted: 6/13/2000; 6:53:31 AM
Topic: RSS 0.91
Msg #: 47 (in response to 14)
Prev/Next: 46/49
Reads: 52473

While RSS 0.91 is extremely powerful, it strikes me as missing two vital pieces of metadata:

1) Ordering method

2) Categorisation

Ordering method ------------


RSS defines a list of items, or more specifically an ordered sequence. But what is the ordering criteria?

Weblogs and news are ordered by time. Most current RSS channels fall into this category.

Top 10 lists are ordered by a popularity measure. Some examples might be "Lettermans top 10 reasons for ...", "Top selling CDs", "most popular pages". There are a sprinkling of these channels.

Other lists are ordered by degree of match. For example the results of a search might be presented in this manner.

To allow the encoding of this data, I propose the following:

<ordering>time</ordering> Other values: none, top, match

A simple example. I gather several RSS streams about computer books. Using this new <ordering> item, I can automatically distinguish "top books" from "new books". I can merge multiple "new books" streams together, removing duplicates. On the other hand, I can merge merge "top books" streams together, weighting elements by duplication and order within each stream.

Categorisation -----------


Content aggregators need to be able to categorise their content, or risk providing extremely long lists of channels (like userland's :-( ). How is a new user meant to select from a list of 2500 channels, presented as a flat list?

Unfortunately categorisation is EXTREMELY hard to do across a broad range of subjects, in a way that suits most people.

Rather than define the one true categorisation schema and taxonomy, I think we should permit the channel author some flexibility, but still allow content aggregators some real meat to kickstart their categorisation.

I propose the following new element, by way of an example for an RSS channel associate with a book on encryption software:

<category>

<method>yahoo.com</method>

<value>Computers_and_Internet/Internet/World_Wide_Web/Security_and_Encryption</value>

<value>Business_and_Economy/Shopping_and_Services/Books/Booksellers/Computers/Internet/Titles/World_Wide_Web</value>

</category>

<category>

<method>dmoz.org</method>

<value>Computers/Security/Products_and_Tools/Cryptography/</value>

<value>Business/Industries/Publishing/Publishers/Nonfiction/Computers</value>

</category>

notes:

1) You can have multiple <value> items in each category.

2) You can have multiple <category> items.

3) Users can define their own methods. Yahoo and DMOZ are recommended, with DMOZ "more" recommended.

4) the <value> string is a list of "/" seperated values, from broadest to most specific.

Now the pedantic among you will probably disagree with me as to the ideal place in Yahoo and DMOZ to categorise this content. But this is missing the point. The point is that armed with the above data, the job of classifying this RSS document in any new category tree is made vastly simpler.

Even if my category tree does not align precisely with Yahoo or DMOZ, there is going to be some overlap. And the <value> string contains some good keywords, which I can disambigaute using WordNet or similar to automatically align with my own arbitrary hierarchy.

For aggregation portals targetting narrow niches, it is a simple job to find relevant RSS channels using a hand-compiled list of relevant paths on Yahoo and DMOZ.

The presence of these new items should not upset existing RSS clients (I hope they have been coded to ignore unknown elements).

Perhaps it would be clearer with an explaination of what I am trying to do with RSS, and where I have been struggling to apply RSS.

I publish a vertical search portal. http://www.growinglifestyle.com/h/garden/index.html

Currently it covers 2 topics (more are coming), one of which is gardening.

I scrape the top gardening web sites for articles (and only the articles), and assemble it in a categorised hierarchy. So the user can browse the hierarchy (like Yahoo), or do a full-text search (like Altavista). But no matter which way they look, they will only get quality articles on gardening.

What I have just done is add an RSS file at each node (well a few thousand nodes anyway) on this hierarchy. For example, there is an RSS file for "gardening", for "plants", for "bulbs" and for "tulips" (progressively narrower topics). Each of these RSS files is a weblog, displaying a time ordered sequence of articles being added to the tree. I'm adding about 1000 articles a week at the top level, so as you go down the tree the RSS files get quieter until the final nodes may only get 1 article added per month or two.

Why have I created so many RSS files? Well, not everybody is interested in everything. You in effect customise the RSS feed that suits your needs. If all you are interested in are "Dahlias", then that is all you will get. And publishing it in RSS makes re-purposing the content so much easier.

Actually, I am even thinking of adding an RSS file for every possible search phrase. In this case, the RSS file would be ordered by rank rather than time. Would you subscribe to such an RSS channel? Probably not as it would not change so often, but you might want to fetch it on demand. For example, a shopping site might want to display articles about each of their products. They could create a unique url containing the keywords and phrases, grab the rss file corresponding to that search, and display it using an RSS-reading content module.

So what is the problem?

Well how can I let other sites know the RSS channels I offer?

It is not really a good idea to add several thousand narrow channels to userland, and then have userland hammer my site every hour.

I could (and am) creating an OCS description, but this does not describe the categorisation or even the hierarchy. And I run the risk of having aggregators blindly add every single available RSS channel, and fetch them all hourly.

I could (and am) entering some of the more generally useful channels into xmltree and userland. But unless people actually wander all over my site, they will not be aware of the RSS customisation possibilities available.

Another problem with content syndication by RSS is as follows. I am adding around 1000 items per week, with around 1 update event per week. Ideally it would be better to release the articles in real-time, but alas it is computationally (and mentally) much less burdensome to do my processing in batches.

RSS has an implied length of circa 15 items. I know this is not fixed, but an RSS file with 1000 items is definitely considered unfriendly (My.Netscape requests file sizes below 8kB). The problem is that 990 of these new items will never make it onto my 10 element RSS file. Thus 990 items will miss out on the opportunities for content syndication and repurposing that RSS allows.

I am still thinking about the best way to solve this last problem. Some possibilities:

a) Trickle feed the RSS file. Instead of instantly acknowleding the 1000 new articles, the RSS generator could be spoon fed a steady dribble of articles (say 6 per hour = 1000/week). Clients reading the RSS file every single hour will get a chance to see all the new articles.

b) Track the IP address of clients reading the RSS file. Feed each IP address all the articles added since the last time they read the file, with some upper limit. So after a few big gulps of new articles the RSS file will settle down to a list of 10.

Neither of these approaches strikes me as being particularly clean.

I hope this stimulates some discussion about the application of RSS to search engines, instead of just the traditional areas of blogs and news feeds.

Steve




Last update: Tuesday, June 13, 2000 at 8:52:31 AM Pacific.

This is a Manila site.