Post Semantic Web

Darkfox

Tuesday, December 27th, 2005

I hate to write about software that could be vaporware, but AllPeers (via Asa Dotzler) looks like a seriously interesting darknet/media sharing/BitTorrent/and more Firefox extension.

It’s sad, but simply sending a file between computers with no shared authority nor intermediary (e.g, web or ftp server) is still a hassle. IM transfers often fail in my experience, traditional filesharing programs are too heavyweight and are configured to connect to and share with any available host, and previous attempts at clients (e.g., ) were not production quality. Merely solving this problem would make AllPeers very cool.

Assuming AllPeers proves a useful mechanism for sharing media, perhaps it could also become a lightnet bridge– as a Firefox extension.

Do check out AllPeers CTO Matthew Gertner’s musings on the AllPeers blog. I don’t agree with everything he writes, but his is a very well informed and well written take on open source, open content, browser development and business models.

Songbird Media Player looks to be another compelling application built on the (though run as a separate program rather than as a Firefox extension), to be released real soon now. 2006 should be another banner year for Firefox and Mozilla technology generally.

Lucas Gonze’s original lightnet post is now near the top of results for ‘lightnet’ on Google, Yahoo!, and MSN, and related followups fill up much of the next few dozen results, having displaced most of the new age and lighting sites that use the same term.

Annotating Wikipedia

Saturday, September 3rd, 2005

The Semantic MediaWiki proposal looks really promising.

Anyone who knows how to edit articles should find the syntax simple and usable:

Berlin is the captial of [[is capital of::Federal Republic of Germany|Germany]].

Berlin has about [[Population:=3.390.444|3.4 Mio]] inhabitants.

All that fantastic data, unlocked. (I’ve been meaning to write on post on why explicit metadata is democratic.) Wikipedia database dump downloads will skyrocket.

There are also interesting proposals under Wikidata as well (though big forms make me uneasy), but those mostly seem more applicable to new data-centric projects, while the Semantic MediaWiki proposal looks just right for the encyclopedia. Gordon Mohr‘s Flexible Fields for MediaWiki proposal could probably serve both roles.

Once people get hooked on access to a semantic encyclopedia, perhaps they’ll want similar access to the entire web.

Via Danny Ayers.

Ontology is Underrated

Monday, August 8th, 2005

A couple months ago I checked to see if anyone had written the exact and obvious words “ontology is underrated” or “ontologies are underrated” in response to Clay Shirky’s somewhat overrated Ontology is Overrated. Nothing, and amazingly, still nothing (according to Google and Yahoo).

I don’t feel up to writing a real Ontology is Underrated essay, not least because I don’t have strong feelings either way, apart from seeing mischaracterization (link only tangentially relevant to subject of this post) put to rest.

Peter Merholz’s Clay Shirky’s Viewpoints are Overrated would be a pretty good start on a definitive Ontology is Underrated.

Ugly metadata deployed

Friday, June 3rd, 2005

Peter Saint-André, a good person for preferring the public domain and much else, writes about Creative Commons metadata:

It’d be cool if smart search engines could automagically find web pages that are offered under one of the Creative Commons licenses.

I agree, which is why we (I work for Creative Commons, though I do not speak for them in this publication) built a prototype in early 2004 and a more robust beta based on Nutch later that year. March this year brought Yahoo! Search for Creative Commons, very recently also added to Yahoo! Advanced Search. I predict more and better for CC and other potentially metadata-enhanced searches.

For reasons unknown to mere mortals like me, CC recommends placing some RDF in an HTML comment as the proper way to “tag” a web page (Uche explains more here). Well, gosh, who thought that up? Are we not in possession of fine XHTML metadata technologies like the <meta/> tag?

Aaron Swartz thought it up, for good reasons. You can find a brief explanation I believe written by Aaron here (linked at the Wayback machine for reference as the current documentation may change). However, this doesn’t capture the most important reason, which I’ve had the pleasure of explaining a gazillion times, e.g., here

A separate RDF file is a nonstarter for CC. After selecting a license a user gets a block of HTML to put in their web page. That block happens to include RDF (unfortunately embedded in comments). Users don’t have to know or think about metadata. If we need to explain to them that you need to create a separate file, link to it in the head of the document, and by the way the separate file needs to contain an explicit URI in rdf:about … forget about it.

and here

Requiring metadata placed in the HEAD of an HTML page will dramatically decrease metadata adoption. The only reason so much CC metadata is out there now is that including it is a zero-cost operation. When the user selects a license and copies&pastes the HTML with a license statement and button into their web page, they get the embedded RDF without having to know anything about it. Getting people to take extra steps to include or produce metadata is very hard, perhaps futile. I tend to believe that good metadata must either be a side effect of some other process (e.g., selecting a license) or a collaborative effort by an interested community (e.g., Amazon book reviews, Bitzi, DMoz, MusicBrainz) (leaving out the case of $$$ for knowledge workers).

in reply to people who want CC metadata included with web documents in various fashions. On that, see my recent reply to someone else suggesting the same method Peter proposes:

There are zillions of options for sticking metadata into a [X]HTML document. If you must use whatever you prefer. It is my concern to encourage dominant uses so that software can reliably find metadata. IMO there are now three fairly widely deployed schemes for CC licenses, not all mutually exclusive:

1. Embed RDF in HTML comment
2. rel=”license” attribute on <a href=”license-uri”>
3. <link> to an external RDF file

#1 is our legacy format, the default produced by licensing engine, very widely deployed
#2 is also now produced by licensing engine, has support of small-s semantic web/semantic XHTML people, and will be RDF-compatible via GRDDL eventually
#3 is used by other RDF apps and is only non-controversial means of including RDF with an XHTML document. Wikipedia publishes CC compatible metadata using this method

In the future we’ll probably add a fourth, which will replace #1 and #2 in license engine output, when it gets baked into a W3C standard, which is ongoing — http://www.formsplayer.com/notes/rdf-a.html

Yes, RDF embedded in HTML comments is a horribly ugly hack. Eventually it’ll be superseded. In the meantime, massive deployment wins. Sorry.

H C

Wednesday, March 23rd, 2005

This music had every cell and fiber in my body on heavy sizzle mode.

Thurston Moore on mixtapes, could be describing me listening to early Sonic Youth or one of my many ecstasy-inducing 120 minute cassettes that I’m mostly afraid to touch, really need to digitize. Yes, Moore relates it all to MP3, P2P, etc., sounding like he’s from the EFF:

Once again, we’re being told that home taping (in the form of ripping and burning) is killing music. But it’s not: It simply exists as a nod to the true love and ego involved in sharing music with friends and lovers. Trying to control music sharing – by shutting down P2P sites or MP3 blogs or BitTorrent or whatever other technology comes along – is like trying to control an affair of the heart. Nothing will stop it.

[Via Lucas Gonze.]

I’d like little more right now than to have Sonic Youth or one of Moore’s many avant projects to release some crack under a Creative Commons license. Had they already you could maybe find it via the just released Yahoo! Search for Creative Commons. (How’s that for a lame segue?)

SemWeb, AI, Java: The Ontological Parallels

Friday, March 18th, 2005

“The Semantic Web: Promising Future or Utter Failure”, the panel I took part in at SXSW shed little light on the topic. Each panelist (including me) brought their own idiosyncratic views to bear and largely talked past each other. The overall SXSW interactive crowd seemed to tend toward web designers and web marketers, not sure about the audience for this panel. Some people, e.g., Chet Campbell, and others in person, apparently left with the impression that all of the panelists agreed that the semantic web is an utter failure (not my view at all).

Sam Felder and Josh Knowles have posted loose transcripts and Christian Bradford a photo of the panel.

The approximate (with links and a few small corrections) text of my introductory statement follows. I got a few laughs.

I want to draw some parallels between semantic web technologies and artificial intelligence and between semantic web technologies and Java.

AI was going to produce intelligent machines. It didn’t and since the late 80s we’ve been in an “AI winter.” That’s nearly twenty years, so web people who suffered and whined in 2001-3, your cup is more than half full. Anyway since then AI techniques have been used in all sorts of products, but once deployed the technology isn’t seen as AI. I mean, where are the conscious robots?

Semantic web technologies have a shorter history, but may play out similarly: widely used but not recognized as such. Machine “agents” aren’t inferring a perfect date for me from my FOAF profile. Or something. This problem is magnified because there’s a loose connection between sematnic web grand visions and AI. People work on both at MIT after all.

Now Java. Applets were going to revolutionize the web. In 1996! Applets didn’t work very well, but lots of people learned Java and it runs out Java is a pretty good solution on the server side. Java is hugely successful as the 21st century’s COBOL. Need some “business logic?” You won’t get fired for implementing it in Java, preferably using JDBC, JSP, JMX, JMS, EJB, JAXB, JDO and other buzzword-compliant APIs.

Semantic web technologies may be following a similar path. Utter failure to live up to initial hype in a sexy domain, but succeeding in the enterprise where the money is anyway. I haven’t heard anyone utter the word enterprise at this conference, so I won’t repeat it.

It turns out that semantic web technologies are really useful for data integration when you have heterogenous data, as many people do these days. Just one example: Oracle will support a “Network Data Model” in the next release of their database. That may sound like a throwback if you know database history, but it basically means explicit support for storing and querying graphs, which are the data model of RDF and the semantic web.

If you talk to a few of the people trying to build intelligent machines today, who may use the term Artificial General Intelligence to distinguish themselves from AI, you may get a feeling that AI research hasn’t really moved us toward the goal of building an AGI.

Despite Java’s success on the server it is no closer to being important on the web client than it was in 1996. It is probably further. If what you care about is sexy web browser deployment, all Java’s server success has accomplished is to keep the language alive.

Semantic web technologies may be different. Usefulness in behind the scenes data integration may help these technologies gain traction on the web. Why? Because for someone trying to make use of data on the web, the web is one huge heterogenous data integration problem.

An example of a project that uses RDF for data integration that you can see is mSpace. You can read a a paper about how they use RDF inside the application, but it won’t be obvious to an end user that they’re a semantic web technologies application, and that’s as it should be.

One interesting thing about mSpace is that they’re using a classical music ontology developed by someone else and found on SchemaWeb. SchemaWeb is a good place to look for semantic web schemas that can be reused in your project. Similarly, rdfdata.org is a good place to look for RDF datasets to reuse. There are dozens of schemas and datasets listed on these sites contributed by people and organizations around the world, covering beer, wine, vegetarian food, and lots of stuff you don’t put in your mouth.

I intended to close my statement with a preemption of the claim that use of semantic web technologies mandates hashing everything out in committees before deployment (wrong), but I trailed off with something I don’t recall. The committee myth came up again during the discussion anyway.

Perhaps I should’ve stolen Eric Miller’s The Semantic Web is Here slides.

SemWeb not by committee

Sunday, March 13th, 2005

At SXSW today Eric Meyer gave a talk on Emergent Semantics. He humorously described emergent as a fancy way of saying grassroots, groundup (from the bottom or like ground beef), or evolutionary. The talk was about adding rel attributes to XHTML <a> elements, or the lowercase semantic web, or Semantic XHTML, of which I am a fan.

Unfortunately Eric made some incorrect statements about the uppercase Semantic Web, or RDF/RDFS/OWL, of which I am also a fan. First, he implied that the lowercase semantic web is to the Semantic Web as evolution is to intelligent design, the current last redoubt of apolgists for theism.

Very much related to this analogy, Eric stressed that use of Semantic XHTML is ad hoc and easy to experiment with, while the Semantic Web requires getting a committee to agree on an ontology.

Not true! Just using rel="foo" is equivalent to using a http://example.com/foo RDF property (though the meaning of the RDF property is better defined — it applies to a URI, while the application of the implicit rel property is loose).

In the case of more complex formats, an individual can define something like hCard (lowercase) or vCard-RDF (uppercase).

No committee approval is required in any of the above examples. vCard-RDF happens to have been submitted to the W3C, but doing so is absolutely not required, as I know from personal experience at Bitzi and Creative Commons, both of which use RDF never approved by committee.

At best there may be a tendency for people using RDF to try to get consensus on vocabulary before deployment while there may be a tendency for people using Semantic XHTML to throw keywords at the wall and see if they stick (however, Eric mentioned that the XFN (lowercase) core group debated whether to include me in the first release of their spec). Neither technology mandates either approach. If either of these tendencies to exist, they must be cultural.

I think there is value in the ad hoc culture and more importantly closeness of Semantic XHTML assertions to human readable markup of the lowercase semantic web and the rigor of the uppercase Semantic Web.

It may be useful to transform a rel="" assertions to RDF assertions via GRDDL or a GRDDL-inspired XMDP transformation.

I will find it useful to bring RDF into XHTML, probably via RDF/A, which I like to call Hard Core Semantic XHTML.

Marc Canter as usual expressed himself from the audience (and on his blog). Among other things Marc asked why Eric didn’t use the word metadata. I don’t recall Eric’s answer, but I commend him for not using the term. I’d be even happier if we could avoid the word semantic as well. Those are rants for another time.

Addendum: I didn’t make it to the session this afternoon, but Tantek Çelik‘s slides for The Elements of Meaningful XHTML are an excellent introduction to Semantic XHTML for anyone familiar with [X]HTML.

Addendum 20050314: Eric Meyer has posted his slides.

SXSW & Etech

Saturday, March 12th, 2005

I’m in Austin now through Monday for SXSW and in San Diego Tuesday through Thursday for Etech. I’m sad that I won’t be around for any music showcases this year and that I have to leave Austin for one of my less favorite places, but Etech is the better conference.

I’m helping Matt Haughey with a SXSW panel, The Semantic Web: Promising Future or Utter Failure (I’ll be the SemWeb technologies advocate) and an Etech session, Remixing Culture with RDF: Running a Semantic Web Search in the Wild.

Creative Commons will have other events and a party at SXSW.

CodeCon Saturday

Sunday, February 13th, 2005

CodeCon is 5/5 today.

The Ultra Gleeper. A personal web page recommendation system. Promise of collaborative filtering unfulfilled, in dark ages since Firefly was acquired and shut down in the mid-90s. Presenter believes we’re about to experience a renaissance in recommendation systems, citing Audiocrobbler recommendations (I would link to mine, but personal recommendations seem to have disappeared since last time I looked; my audioscrobbler page) as a useful example (I have found no automated music recommendation system useful) and blogs as a use case for recommendations (I have far too much very high quality manually discovered reading material, including blogs, to desire automated recommendations for more and I don’t see collaborative filtering as a useful means of prioritizing my lists). The Ultra Gleeper crawls pages you link to, treating links as positive ratings, pages that link to you (via Technorati CosmosQuery and Google API), presents suggested pages to rate in a web interface. Uses a number of tricks to avoid showing obvious recommendations (does not recommend pages that are two popular) and pages you’ve already seen (including those linked to in feeds you subscribe to). Some problems faced by typical recommendation systems (new users get crummy recommendations until they enter lots of data, early adopters get doubly crummy recommendations due to lack of existing data to correlate with) obviated by bootstrapping from data in your posts and subscriptions. I suppose if lots of people run something like Gleeper robot traffic increases, more people complain about syndication bandwidth-like problems (I’m skeptical about this being a major problem). I don’t see lots of people running Gleepers as automated recommendation systems are still fairly useless and will remain so for a long time. Interesting software and presentation nonetheless.

H2O. Primarily a discussion system tuned to facilitate professor-assigned discussions. Posts may be embargoed and professor may assign course participants specific messages or other participants to respond to. Discussions may include participants from multiple courses, e.g., to facilitate a MIT engineering-Harvard law exchange. Anyone may register at H2O and create own group, acting as professor for created group. Some of the constraints that may be iposed by H2O are often raised in mailing list meta discussions following flame wars, in particular posting delays. I dislike web forums but may have to try H2O out. Another aspect of H2O is syllabus management and sharing, which is interesting largely because syllabi are typically well hidden. Professors in the same school of the same university may not be aware of what each other are teaching.

Jakarta Feedparser. Kevin Burton gave a good overview of syndication and related standards and the many challenges of dealing with feeds in the wild, which are broken in every conceivable way. Claims SAX (event) based Jakarta FeedParser is an order of magnitude faster than DOM (tree) based parsers. Nothing new to me, but very useful code.

MAPPR. Uses Flickr tags, GNS to divine geographic location of photos. REST web services modeled on Flickr’s own. Flash front end, which you could spend many hours playing with.

Photospace. Personal image annotation and search service, focus on geolocation. Functionality available as library, web fron end provided. Photospace publishes RDF which may be consumed by RDFMapper.

Note above two personal web applications that crawl or use services of other sites (The Ultra Gleeper is the stronger example of this). I bet we’ll see many more of increasing sophistication enabled by ready and easily deployable software infrastructure like Jakarta FeedParser, Lucene, SQLite and many others. A personal social networking application is an obvious candidate. Add in user hosted or controlled authentication (e.g., LID, perhaps idcommons) …

Yesterday.

Not following tags

Thursday, January 20th, 2005

“Do not credit this link” is a useful assertion that cannot be gleaned from surrounding content.

Thus, rel="nofollow" is a good if old idea. At least one of my two search predictions for 2005 is already coming true.

Creator assigned keywords or “tags” on the other hand, strike me as a contemporary implementation of HTML meta description tags, which failed because they placed a burden on good webmasters (classification is hard) and presented an open field for spammers, who tag[ged] their pages making a hard sell for whatever with completely unrelated keywords.

Global classification strikes me as a case in which Google is right — metadata inferred from content beats explicit, manual metadata when it comes to categorization. From the Peter Norvig (Google Director of Search Quality) interview I cited:

This is a Google News page from last night, and what we’ve done here is apply clustering technology to put the news stories together in categories, so you see the top story there about Blair, and there’re 658 related stories that we’ve clustered together.

Now imagine what it would be like if instead of using our algorithms we relied on the news suppliers to put in all the right metadata and label their stories the way they wanted to. “Is my story a story that’s going to be buried on page 20, or is it a top story? I’ll put my metadata in. Are the people I’m talking about terrorists or freedom fighters? What’s the definition of patriot? What’s the definition of marriage?”

Folksonomies are great in limited domains, thus far most famously for organizing and sharing bookmarks (decentralize using same technology as Technorati’s self-tagging) and organizing photos.

Keyword tagging is also a lightweight way to provide navigation for a website. I might categorize more posts on this weblog if I could do so in a similarly lightweight manner (now I have to create categories via an interface separate from posting). Haven’t I come right back to the creator-assigned keywords that I criticized above? No, there’s a subtle but very important difference: metadata as a side effect of useful work versus metadata as spammy make work.