Ugly metadata deployed

Peter Saint-André, a good person for preferring the public domain and much else, writes about Creative Commons metadata:

It’d be cool if smart search engines could automagically find web pages that are offered under one of the Creative Commons licenses.

I agree, which is why we (I work for Creative Commons, though I do not speak for them in this publication) built a prototype in early 2004 and a more robust beta based on Nutch later that year. March this year brought Yahoo! Search for Creative Commons, very recently also added to Yahoo! Advanced Search. I predict more and better for CC and other potentially metadata-enhanced searches.

For reasons unknown to mere mortals like me, CC recommends placing some RDF in an HTML comment as the proper way to “tag” a web page (Uche explains more here). Well, gosh, who thought that up? Are we not in possession of fine XHTML metadata technologies like the <meta/> tag?

Aaron Swartz thought it up, for good reasons. You can find a brief explanation I believe written by Aaron here (linked at the Wayback machine for reference as the current documentation may change). However, this doesn’t capture the most important reason, which I’ve had the pleasure of explaining a gazillion times, e.g., here

A separate RDF file is a nonstarter for CC. After selecting a license a user gets a block of HTML to put in their web page. That block happens to include RDF (unfortunately embedded in comments). Users don’t have to know or think about metadata. If we need to explain to them that you need to create a separate file, link to it in the head of the document, and by the way the separate file needs to contain an explicit URI in rdf:about … forget about it.

and here

Requiring metadata placed in the HEAD of an HTML page will dramatically decrease metadata adoption. The only reason so much CC metadata is out there now is that including it is a zero-cost operation. When the user selects a license and copies&pastes the HTML with a license statement and button into their web page, they get the embedded RDF without having to know anything about it. Getting people to take extra steps to include or produce metadata is very hard, perhaps futile. I tend to believe that good metadata must either be a side effect of some other process (e.g., selecting a license) or a collaborative effort by an interested community (e.g., Amazon book reviews, Bitzi, DMoz, MusicBrainz) (leaving out the case of $$$ for knowledge workers).

in reply to people who want CC metadata included with web documents in various fashions. On that, see my recent reply to someone else suggesting the same method Peter proposes:

There are zillions of options for sticking metadata into a [X]HTML document. If you must use whatever you prefer. It is my concern to encourage dominant uses so that software can reliably find metadata. IMO there are now three fairly widely deployed schemes for CC licenses, not all mutually exclusive:

1. Embed RDF in HTML comment
2. rel=”license” attribute on <a href=”license-uri”>
3. <link> to an external RDF file

#1 is our legacy format, the default produced by licensing engine, very widely deployed
#2 is also now produced by licensing engine, has support of small-s semantic web/semantic XHTML people, and will be RDF-compatible via GRDDL eventually
#3 is used by other RDF apps and is only non-controversial means of including RDF with an XHTML document. Wikipedia publishes CC compatible metadata using this method

In the future we’ll probably add a fourth, which will replace #1 and #2 in license engine output, when it gets baked into a W3C standard, which is ongoing —

Yes, RDF embedded in HTML comments is a horribly ugly hack. Eventually it’ll be superseded. In the meantime, massive deployment wins. Sorry.

One Response

  1. […] Ugly metadata deployed. See Metadata is technical debt. Deployment as an ugly hack makes it an even more obvious no-brainer no-go. Also, the uselessness of license-filtered crawl-based search. […]

Leave a Reply