Post Semantic Web

Semantic Web Oligopsonies

Wednesday, January 12th, 2005

Google’s director of search quality bashes manual ontologies, with much justification.

However, his attempt to paint successful ontologies into a tiny niche doesn’t exactly work:

The best place where ontologies will work is when you have an oligarchy of consumers who can force the providers to play the game. Something like the auto parts industry, where the auto manufacturers can get together and say, “Everybody who wants to sell to us do this.” They can do that because there’s only a couple of them. In other industries, if there’s one major player, then they don’t want to play the game because they don’t want everybody else to catch up. And if there’s too many minor players, then it’s hard for them to get together.

Aren’t search engines and browsers in a sense oligopolistic (actually oligopsonistic) consumers of web content? There are only a few of each that matter anyway.

Barring interest from Google, Microsoft, Mozilla, and Yahoo! there may be oligopsonists (or monopsonists who want their standards adopted) in niches that can drive metadata adoption in their niches.

[Via Andrew Newman.]

ccPublisher 1.0

Monday, December 27th, 2004

Nathan Yergler just cut ccPublisher 1.0, a Windows/Mac/Linux desktop app that helps you license, tag, and distribute your audio and video works. I’m very biased, but I think it’d be a pretty neat little application even if it weren’t Creative Commons centric.

  • It’s written in Python with a wxPython UI, but is distributed as a native windows installer or Mac disk image with no dependencies. Install and run like any other program on your platform, no implementation leakage. Drag’n’drop works.
  • Also invisible to the end user, it uses the Internet Archive’s XML contribution interface, ftp and CC’s nascent web services.
  • RDF metadata is generated, hidden from the user if published at IA, or available for self-publishing, ties into CC’s search and P2P strategies.

Python and friends did most of the work, but the 90/10-10/90 rule applied (making a cross platform app work well still isn’t trivial, integration is always messy, and anything involving ID3v2 sucks). Props to Nathan.

Version 2 will be much slicker, support more media types, and be extensible for use by other data repositories.

Addendum 2005-01-12: Check out Nathan’s 1.0 post mortem and 2.0 preview.

Search 2005

Thursday, December 23rd, 2004

Many of John “Searchblog” Battelle’s predictions for 2005 seem like near certainties, e.g., a fractious year for the blogosphere and trouble for those who expect major revenues from blogging.

Two trends I hope 2005 proves that Battelle’s predictions missed:

Metadata-enhanced search. Will be ad hoc and pragmatic, pulling useful bits from private sources and people following officious Semantic Web and lowercase semantic web practices.

Proliferation of niche web scale search engines. Anyone can be a small-scale google, crawling the portions of the web cared about and offering search options specific to a niche. The requisite hardware and bandwidth are supercheap and the Nutch open source search engine makes implementation trivial.

The Creative Commons search engine is a harbinger of both trends.

Battelle’s look ahead spans the web, not just web search. Possibly the biggest trend missing from his list is the rise of weblications. Egads, I have to learn DHTML, and it isn’t 1997!

A few of my near certainties: lots of desktop search innovation, very slow progress on making multimedia work with the web and usable security, open source slogs toward world domination, and most things get cheaper and more efficient.

MusicBrainz Discovery (II)

Friday, October 15th, 2004

Continuation of MusicBrainz Discovery (I).

One notable thing about MusicBrainz is that Rob Kaye and a small number of core developers and supporters have pursued a consistent vision for roughly six years with very little funding or even understanding outside this small group. It isn’t easy to really “get” MusicBrainz (I think it took me two years), though I think that at some point in the next few years everyone will “get” MusicBrainz more or less all at once.

If you’re a geek it’s hard not to get hung up on MusicBrainz use of acoustic fingerprint-based technology. Acoustic fingerprinting is fragile in three ways — it is subject to false positives and false negatives, there is no open source implementation of the concept, and the technology MusicBrainz uses, Relatable TRM, is proprietary and requires a centralized server. Indeed, many of the technology questions at Tuesday’s music metadata panel concerned acoustic fingerprints.

It is important to understand that while MusicBrainz uses acoustic fingerprints, it does not rely on them. TRM matching is just one mechanism for track identification. File metadata included in (e.g., ID3 tags) or with (filename) the file can and I believe are used to match existing records, as could track duration and file hashes (see if Bitzi or a P2P network has any metadata for the file in question). Additionally, file identification is only one component of MusicBrainz.

If you’re not a geek, you won’t notice acoustic fingerprints, because you wouldn’t, and because you’re not likely to get that far. So what the heck does MusicBrainz do? Here’s an attempt:

  • MusicBrainz can organize your music collection. Download the tagger.
  • MusicBrainz uniquely identifies artists, albums, and songs, facilitating rich and precise music applications, all on a level playing field.
    • Not at all speculative potential: include a MusicBrainz song identifier in a blog post, cover art (with your Amazon referrer of course) automagically appears in blog post, blog aggregator publishes top n lists and personalized recommendations.
    • Another: publish a playlist of MusicBrainz identifiers and others can recreate the experience so defined with no file transfer involved.
    • There are several others, some that could be offered by MusicBrainz itself, outlined in MusicBrainz tomorrow. I have to quote one because it’s fun:

      Music Genealogy: MusicBrainz may keep track of which artists/performers/engineers contributed to a piece of music, and when these contributions took place. Combining this contribution data with data on how artists influenced each other will create a genealogy of modern music. Imagine being able to track Britney Spears back to Beethoven!

  • The MusicBrainz database, created by the community, will remain free, unlike others.

Having been around for awhile, MusicBrainz has run into many of the technical and social problems inherent in music metadata and an evolving community website, and produced much good documentation on solutions, realized and potential. Here’s a sampling:

By the way, as of Wednesday MusicBrainz has a blog.

MusicBrainz Discovery (I)

Wednesday, October 13th, 2004

Earlier this evening I gave a brief introduction (slides PDF) to MusicBrainz at SDForum’s Emerging Technology SIG meeting on music metadata in the stead of MusicBrainz founder and leader Rob Kaye, who couldn’t make it up to Palo Alto. (I’m fairly familiar with MusicBrainz, having worked with Rob at Bitzi and getting updates when we cross paths in this small world.)

If I could pick a theme for the meeting (which included two other very interesting speakers — Stephen Bronstein of the Independent Online Distribution Alliance and David Marks of Loomia), and for recent months in general, it would be that in case you haven’t noticed, it’s clearly now a discovery problem, not a delivery problem.

SIG leader William Grosso led off with some quotes from the much-discussed Wired magazine article The Long Tail, which seems to have captured this zeitgeist. (Grosso also had a novel to me presentation technique — a slideshow of potentially relevant slides plays while he speaks, and if a slide happens to be relevant to the current sentence, he uses the slide to augment the point. Is there a name for this?)

Obviously there was tremendous interest in Creative Commons in this context, and several people seemed to be happy to learn of CC’s search engine and the great services and products offered by the Internet Archive (free hosting for CC-licensed audio and video, built in format conversion), Magnatune (all CC-licensed music label) and more.

Unfortunately in the eleven years I’ve been in the SF bay area I only definitively recall attending two previous SDForum events — a 1994 talk by Atari Jaguar developers in San Jose and in 2001 an evening with Phil Zimmermann in San Francisco (I suspect others who were there would deem the “an evening with” cliche appropriate in this case). This evening’s meeting was a total geekfest. I hung around for well over an hour commiserating on all manner of software development topics (I think that’s what “SD” stands for) with a number of hardcore geeks (no whatever-Dilbert’s-boss’s-name-is there) while two guys were lauging their asses off whiteboarding issues with Unicode encoding (as far as I could tell). I’ll have to go back.

More about what I’ve learned about MusicBrainz over the years and in preparing for the evening in a future post.

Update: part 2

Creative Commons Search, useful to me

Tuesday, March 2nd, 2004

Yesterday on the Creative Commons weblog:

Today we announce a search engine prototype exclusively for finding Creative Commons licensed and public domain works on the web.

Indexing only pages with valid Creative Commons metadata allows the search engine to display a visual indicator of the conditions under which works may be used as well as offer the option to limit results to works available licenses allowing for derivatives or commercial use.

This prototype partially addresses one of our tech challenges. It still needs lots of work. If you’re an interested developer you can obtain the code and submit bugs via the cctools project at SourceForge. The code is GNU GPL licensed and builds in part upon Nathan Yergler’s ccRdf library.

We also have an outstanding challenge to commercial search engines to build support for Creative Commons-enhanced searches.

And it hasn’t melted down yet.

Ben Adida wrote most of the code that needed to be written in Python (not much — PostgreSQL with tsearch2 full-text indexing does all of the heavy lifting). Former government employee Justin Palmer wrote an earlier prototype in AOLserver/Tcl, also using PG/tsearch2. (Turns out we needed the flexibility of running under Apache. I’ll miss AOLserver/Tcl when I last touch it, but I’ll also be glad to be rid of it.) I did a PHP hack job on the front end, and Matt Haughey made it look good (for end users, not code readers) in a matter of minutes.

Although everything possible sucks about this implementation, it is already a valuable tool for finding CC-licensed and public domain content — stuff you can reuse with permission already granted. Neeru Paharia was the visionary here, seeing that it would be valuable even if it sucked in every way technically.

Stephen Downes is exactly right about the long term goal:

Of course, this is only a step – such a search engine would not be useful for many purposed; copyright information needs, in the long run, to define a search field or a type of search, not a whole search engine.

With great justification major search enginges have ignored pure metadata for a long time, at least five years. Pure metadata, with no visibility, is nearly universally ill-maintained or fraudulent. I hope that this Creative Commons prototype inspires some people at major search engines to think again about metadata, but I think semantic HTML is what will finally prove useful to such folks, in no small part because it isn’t pure metadata. I’ll post on incremental semantic search engine features in the near future.

Real world 5emantic 3eb

Saturday, February 28th, 2004

Tantek �elik comments on Creative Commons use of rel="license", citing my small-s semantic web and CC post to cc-metadata (reproduced below).

I agree with Tantek’s comments, though I wouldn’t advise removing admittedly ugly and potentially redundant RDF-in-HTML-comments, at least not until mozCC and ccRdf and consequently some dependent code go case insensitive.

There are currently at least two Creative Commons metadata cases where a simple rel="license" attribute won’t do, requiring RDF:

In my view, metadata-enabled web tools will do well to include a RDF model layer, whether the statements be gleaned from semantic [X]HTML, parsed from human language, or mainlined from some RDF encoding, and whatever the tool’s internal knowledge representation. Content creators will do well to produce the simplest, most utilitarian metadata possible.

I’m turning a bit sour on the phrase “lowercase semantic web”. I like semantic [X]HTML. I like RDF. All in the service of our near-term goals. All in the service of the Semantic Web, which will surely be a superset of the RDF web. I dig real world semantics in any case.

As I mentioned
<http://lists.ibiblio.org/pipermail/cc-metadata/2003-December/000243.html>
I find "semantic HTML" very interesting — it keeps the metadata close
to the presentation, militating against "metacrap" and can be used to
populate the big-S Semantic Web through RDF generation.

Since then the RDF-in-XHTML proposal that builds on semantic HTML has
moved ahead and generalized, see <http://www.w3.org/2004/01/rdxh/spec>
"Gleaning Resource Descriptions from Dialects of Languages" is a pretty
good description.

Also, Kevin Marks and Tantek Celik headed up a very nice BoF at Etech
<http://wiki.oreillynet.com/etech/hosted.conf?RealWorldSemantics> in
which they discussed current small-semantic web implementations. See
that URL for some good links.

Largely, people are using the "rel" attribute of "a" elements
<http://www.w3.org/TR/xhtml1/dtds.html#dtdentry_xhtml1-strict.dtd_a>
to describe "the relationship from the current document to the URI
referred to by the element. The value of this attribute is a
space-separated list of link types."

<http://www.w3.org/TR/xhtml2/mod-attribute-collections.html#col_Hypertext>
(I can’t find the equivalent documentation for XHTML1, but rel is
supported, per the DTD above).

A neat thing on the presentation side is that CSS selectors can actually
change the document rendering based on rel attributes — making the
metadata not just close to the presentation, but part of it.

Anyway, a rel attribute on anchors removes the big problem with assuming
that a link to a license indicates that a page is available under that
license — the page could be linking to the license for any reason. <a
rel="license" href="http://creativecommons.org/licenses/by/1.0/"/> on
the other hand, is no more ambiguous than the following RDF snippet

<Work about="">
<license rdf:resource="http://creativecommons.org/licenses/by/1.0/"/>
</Work>

and can be used to generate the same.

The upshot is that I’m planning to recommend adding a rel="license"
attribute to links to CC licenses where the license applies to the
current page, have <http://creativecommons.org/license/> spit that out,
and encourage other apps to support the same.

Note that this is all entirely complementary with RDF. All apps should
continue to use/generate/support RDF, and RDF is required for making
license (or any metadata) statements about resources other than the
enclosing one.

Mediachest Theory

Thursday, February 26th, 2004

Pleasant Blogger saysBookcrossing + Orkut + Bitzi + HotOrNot = Mediachest“.

I don’t think addition is the correct operation. Perhaps we could say all of

  • Bookcrossing ∩ Mediachest ≠ ∅
  • Orkut ∩ Mediachest ≠ ∅
  • Bitzi ∩ Mediachest ≠ ∅
  • HotOrNot ∩ Mediachest ≠ ∅

or something more interesting if I actually knew set theory and notation.

Seriously though, lots of people realize that social networks can facilitate navigation, discovery, trust, filtering, communication and the like in many domains. Will people move on from sites that encourage building lists of “friends” for the sake of building such lists (and dating, I hear) to sites that use social networks to enhance other functions a la Mediachest or will the likes of Friendster and Orkut add more utility? Probably something else. Consider that

  • Orkut has only scratched the surface of what a pure social networking service could offer. There are no collaborative filtering or recommendation features for starters. I don’t think Orkut is near an 80/20 sweet spot, or wherever diminishing returns set in for a pure social networking site.
  • Sites with huge existing memberships haven’t added social networking to their offerings. Friends.yahoo.com does not exist.
  • I’m forgetting stuff, but not the decentralized path. See FOAF and XFN. Atop which every value-add you can imagine (a miniscule subset of the total) will be built in the semi-near future, like by 2009.

Bitzi has had a very simple social network feature since May, 2001, “interesting bitizens”. Mine (and those interested in me) are currently listed on the right side of my bitizen page. We still haven’t built any features using these relationships, apart from an ignored popularity contest. Eventually. Before 2009.

REGISTER NOW. IT’S FREE AND IT’S REQUIRED.

Thursday, February 26th, 2004

Experimenting with vote links:

Goodbye, WP, join the LAT in the infinite unread bin.

CC Etech BoF points

Tuesday, February 10th, 2004

Points mentioned at the Etech Creative Commons participant session (it’s a BoF!):

One Year Launch Anniversary

Watch Reticulum Rex AKA Remix Culture for an update.

License Versioning

International Commons

iCommons is porting licenses to multiple jurisdictions.

Content

New (and newly packaged) Licenses

Technology

Standards

Technology Challenges

The list

Hero Nathan Yergler, who created:

POTOTYPE RDF-enhanced Creative Commons search