Post Microformats

Life in the possibly bright future of the federated social indieweb

Saturday, June 8th, 2013

After about five years (2.5 year update) it’s hard not to be disappointed in the state of the federated social web. Legacy silos have only increased their dominance, abetting mass spying, and interop among federated social web experiments looks bleak (link on different topic, but analogous).

In hindsight it was disappointing 5 years ago that blogs and related (semweb 1.0?) technologies hadn’t formed the basis of the federated social web (my pet theory is that the failure is in part due to the separation of blog post/comment writing and feed reading).

Another way of looking at it is that despite negligible resources focused on the problem, much progress has been made in figuring out how to do the federated social web over the past five years. Essentially nothing recognizable as a social web application federated five years ago. There are now lots of experiments, and two of the pioneers have learned enough to determine a rewrite was necessary — Friendica→Red and the occasion for this post, StatusNet→

Right now is a good time to try out a federated social web service (hosted elsewhere, or run your own instance) again, or for the first time:

My opinion, at the moment: has the brightest future, Diaspora appears the most featureful (inclusive of looking nice) to users, and Friendica is the best at federating with other systems. Also see a comparison of software and protocols for distributed social networking and the Federated Social Web W3C community group.

The Indie Web movement is complementary, and in small part might be seen as taking blog technologies and culture forward. When I eventually rebuild a personal site, or a new site for an organization, indieweb tools and practices will be my first point of reference. Their Publish (on your) Own Site, Syndicate Elsewhere and Publish Elsewhere, Syndicate (to your) Own Site concepts are powerful and practical, and I think what a lot of people want to start with from federated social web software.

*Running StatusNet as I write, to be converted to over the next hours. The future of StatusNet is to be at GNU social.

Web Data Common[s] Crawl Attribution Metadata

Monday, January 23rd, 2012

Via I see Web Data Commons which has “extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010″. WDC publishes the extracted data as N-Quads (the fourth item denotes the immediate provenance of each subject/predictate/object triple — the URL the triple was extracted from).

I thought it would be easy and fun to run some queries on the WDC dataset to get an idea of how annotations associated with Creative Commons licensing are used. Notes below on exactly what I did. The biggest limitation is that the license statement itself is not part of the dataset — not as xhv:license in the RDFa portion, and for some reason rel=license microformat has zero records. But cc:attributionName, cc:attributionURL, and cc:morePermissions are present in the RDFa part, as are some Dublin Core properties that the Creative Commons license chooser asks for (I only looked at dc:source) but are probably widely used in other contexts as well.

Dataset URLs Distinct objects
Common Crawl 2010 corpus 5,000,000,000a
1% sampled by WDC ~50,000,000
with RDFa 158,184b
with a cc: property 26,245c
cc:attributionName 24,942d 990e
cc:attributionURL 25,082f 3,392g
dc:source 7,235h 574i
cc:morePermissions 4,791j 253k
cc:attributionURL = dc:source 5,421l
cc:attributionURL = cc:morePermissions 1,880m
cc:attributionURL = subject 203n

Some quick takeaways:

  • Low ratio of distinct attributionURLs probably indicates HTML from license chooser deployed without any parameterization. Often the subject or current page will be the most useful attributionURL (but 203 above would probably be much higher with canonicalization). Note all of the CC licenses require that such a URL refer to the copyright notice or licensing information for the Work. Unless one has set up a side-wide license notice somewhere, a static URL is probably not the right thing to request in terms of requiring licensees to provide an attribution link; nor is a non-specific attribution link as useful to readers as a direct link to the work in question. As (and if) support for attribution metadata gets built into Creative Commons-aware CMSes, the ratio of distinct attributionURLs ought increase.
  • 79% of subjects with both dc:source and cc:attributionURL (6,836o) have the same values for both properties. This probably means people are merely entering their URL into every form field requesting a URL without thinking, not self-remixing.
  • 47% of subjects with both cc:morePermissions and cc:attributionURL (3,977p) have the same values for both properties. Unclear why this ratio is so much lower than previous; it ought be higher, as often same value for both makes sense. Unsurprising that cc:morePermissions least provided property; in my experience few people understand it.

I did not look at the provenance item at all. It’d be interesting to see what kind of assertions are being made across authority boundaries (e.g. a page on makes a statements with an URI as the subject) and when to discard such. I barely looked directly at the raw data at all; just enough to feel that my aggregate numbers could possibly be accurate. More could probably be gained by inspecting smaller samples in detail, informing other aggregate queries.

I look forward to future extracts. Thanks indirectly to Common Crawl for providing the crawl!

Please point out any egregious mistakes made below…

# a I don't really know if the October 2010 corpus is the
# entire 5 billion Common Crawl corpus

# download RDFa extract from Web Data Commons
wget -c

# Matches number stated at
wc -l ccrdf.html-rdfa.nq

# Includes easy to use no-server triplestore
apt-get install redland-utils

# sanity check
grep '<>' ccrdf.html-rdfa.nq |wc -l

# Import rejects a number of triples for syntax errors
rdfproc xyz parse ccrdf.html-rdfa.nq nquads

# d Perhaps syntax errors explains fewer triples than above grep might
# indicate, but close enough
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l

# These replicated below with 4store because...
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?o <> ?o }' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?o <> ?o }' |wc -l

# ...this query takes forever, hours, and I have no idea why
rdfproc xyz query sparql - 'select ?s, ?o where { ?s <> ?o ; <> ?o }'

# 4store has a server, but is lightweight
apt-get install 4store

# 4store couldn't import with syntax errors, so export good triples from
# previous store first
rdfproc xyz serialize > ccrdf.export-rdfa.rdf

# import into 4store
curl -T ccrdf.export-rdfa.rdf 'http://localhost:8080/data/wdc'

# egrep is to get rid of headers and status output prefixed by ? or #
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

#Of course please use instead.
#Should be more widely deployed soon.
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <> ?o ; <> ?n }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <> ?o ; <> ?n }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n, ?m where { ?s <> ?o ; <> ?n ; <> ?m }' |egrep -v '^[\?\#]' |wc -l
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <> ?o } UNION { ?s <> ?n } UNION { ?s <> ?m }  }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <> ?o } UNION { ?s <> ?n }}' |egrep -v '^[\?\#]' |wc -l

#b note subjects not the same as pages data extracted from (158,184)
4s-query wdc -s '-1' -f text 'select distinct ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l

# Probably less than 1047250 claimed due to syntax errors
4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?s }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?s }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?o }'  |egrep -v '^[\?\#]' |wc -l

Life in the kind of bleak future of HTML data

Thursday, January 12th, 2012

Evan Prodromou wrote in 2006:

I think that if and the RDFa effort continue moving forward without coordinating their effort, the future looks kind of bleak.

I blogged about this at the time (and forgot and reblogged five months later). I recalled this upon reading a draft HTML Data Guide announced today, and trying to think of a tl;dr summary to at least microblog.

That’s difficult. The guide is intended to help publishers and consumers of HTML data choose among three syntaxes (all mostly focused on annotating data inline with HTML meant for display) and a variety of vocabularies, with heavy dependencies between the two. Since 2006, people working on microformats and RDFa have done much to address the faults of those specifications — microformats-2 allows for generic (rather than per-format) parsing, and RDFa 1.1 made some changes to make namespaces less needed, less ugly when needed, and usable in HTML5, and specifies a lite subset. In 2009 a third syntax/model, microdata, was launched, and then in 2011 chosen as the syntax for (which subsequently announced it would also support RDFa 1.1 Lite).

I find the added existence of microdata and suboptimal (optimal might be something like microformats process for some super widely useful vocabularies, with a relatively simple syntax but permitting generic parsing and distributed extensibility; very much like what Prodromou wanted in 2006), but when is anything optimal? I also wonder how much credit microdata ought get for microformats-2 and RDFa 1.1, due to providing competitive pressure? And for invigorating metadata-enhanced web-scale search and vocabulary design (for example, the last related thing I was involved in, at the beginning anyway)?

Hope springs eternal for getting these different but overlapping technologies and communities to play well together. I haven’t followed closely in a long time, but I gather that Jeni Tennison is one of the main people working on that, and you should really subscribe to her blog if you care. That leaves us back at the HTML Data Guide, of which Tennison is the editor.

My not-really-a-summary:

  1. Delay making any decisions about HTML data; you probably don’t want it anyway (metadata is usually a cost center), and things will probably be more clear when you’re forced to check back due to…
  2. If someone wants data from you as annotated HTML, or you need data from someone, and this makes business sense, do whatever the other party has already decided on, or better yet implemented (assuming their decision isn’t nonsensical; but if so why are you doing business with them?)
  3. Use a validator to test your data in whatever format. An earlier wiki version of some of the guide materials includes links to validators. In my book, Any23 is cute.

(Yes, CC REL needs updating to reflect some of these developments, RDFa 1.1 at the least. Some license vocabulary work done by SPDX should also be looked at.)

Semantic ref|pingback for re-use notification

Sunday, May 15th, 2011

Going back probably all the way to 2003 (I can’t easily pinpoint, as obvious mail searches turn up lots of hand-wringing about structured data in/for web pages, something which persists to this day) people have suggested using something like trackback to notify that someone has [re]used a work, as encouraged under one of the Creative Commons licenses. Such notification could be helpful, as people often would like to know someone is using their work, and might provide much better coverage than finding out by happenstance or out-of-band (e.g., email) notification and not cost as much as crawling a large portion of the web and performing various medium-specific fuzzy matching algorithms on the web’s contents.

In 2006 (maybe 2005) Victor Stone implemented a re-use notification (and a bit more) protocol he called the Sample Pool API. Several audio remix sites (including ccMixter, for which Victor developed the API; side note: read his ccMixter memoir!), but it didn’t go beyond that, probably in part because it was tailored to a particular genre of sites, and another part because it wasn’t clear how to do correctly, generally, get adoption, sort out dependencies (see hand-wringing above), and resource/prioritize.

I’ve had in mind to blog about re-use notification for years (maybe I already have, and forgot), but right now I’m spurred to by skimming Henry Story and Andrei Sambra’s Friending on the Social Web, which is largely about semantic notifications. Like them, I need to understand what the OStatus stack has to say about this. And I need to read their paper closely.

Ignorance thusly stated, I want to proclaim the value of refback. When one follows a link, one’s user agent (browser) often will send with the request for the linked page (or other resource) the referrer (the page with the link one just followed). In some cases, a list of pages linking to one’s resources that might be re-used can be rather valuable if one wants to bother manually looking at referrers for evidence of re-use. For example, Flickr provides a daily report on referrers to one’s photo pages. I look at this report for my account occasionally and have manually populated a set of my re-used photos largely by this method. This is why I recently noted that the (super exciting) MediaGoblin project needs excellent reporting.

Some re-use discovery via refback could be automated. My server (and not just my server, contrary to Friending on the Social Web; could be outsourced via javascript a la Google Analytics and Piwik) could crawl the referrer and look for structured data indicating re-use at the referrer (e.g., my page or a resource on it is subject or object of relevant assertions, e.g., dc:source) and automatically track re-uses discovered thusly.

A pingback would tell my server (or service I have delegated to) affirmatively about some re-use. This would be valuable, but requires more from the referring site than merely publishing some structured data. Hopefully re-use pingback could build upon the structured data that would be utilized by re-use refback and web agents generally.

After doing more reading, I think my plan to to file the appropriate feature requests for MediaGoblin, which seems the ideal software to finally progress these ideas. A solution also has obvious utility for oft-mooted [open] data/education/science scenarios.

Semantic Web Web Web

Wednesday, November 21st, 2007

The and particularly its efforts do great, valuable work. I have one massive complaint, particularly about the latter: they ignore the Web at their peril. Yes, it’s true, as far as I can tell (but mind that I’m one or two steps removed from actually working on the problems), that the W3C and Semantic Web activities do not appreciate the importance of nor dedicate appropriate resources to the Web. Not just the theoretical Web of URIs, but the Web that billions of people use and see.

I’m reminded of this by Ian Davis’ post Is the Semantic Web Destined to be a Shadow?:

My belief is that trust must be considered far earlier and that it largely comes from usage and the wisdom of the crowds, not from technology. Trust is a social problem and the best solution is one that involves people making informed judgements on the metadata they encounter. To make an effective evaluation they need to have the ability to view and explore metadata with as few barriers as possible. In practice this means that the web of data needs to be as accessible and visible as the web of documents is today and it needs to interweave transparently. A separate, dry, web of data is unlikely to attract meaningful attention, whereas one that is a full part of the visible and interactive web that the majority of the population enjoys is far more likely to undergo scrutiny and analysis. This means that HTML and RDF need to be much more connected than many people expect. In fact I think that the two should never be separate and it’s not enough that you can publish RDF documents, you need to publish visible, browseable and engaging RDF that is meaningful to people. Tabular views are a weak substitute for a rich, readable description.

SXSW: Growth of Microformats

Saturday, March 17th, 2007

Monday afternoon’s packed The Growth and Evolution of Microformats didn’t strike me as terribly different from last year’s Microformats: Evolving the Web. Last year’s highlight was a Flock demo, this year’s was an Operator demo.

My capsule summary of the growth and (not much) evolution of Microformats over the past twelve months: a jillion names, addresses, and events have been marked up with hCard and hCalendar formatting.

SXSW: Semantic Web 2.0 and Scientific Publishing

Saturday, March 10th, 2007

Web 2.0 and Semantic Web: The Impact on Scientific Publishing, probably the densest panel I attended today (and again expertly moderated by Science Commons’ John Wilbanks), covered , new business models for scientific publishers, and how web technologies can help with these and data problems, but kept coming back to how officious Semantic Web technologies and controlled ontologies (which are not the same at all, but are often lumped together) and microformats and tagging (also distinct) complement each other (all four of ‘em!), even within a single application. I agree.

Nearly on point, this comment elsewhere by Denny Vrandecic of the Semantic MediaWiki project:

You are supposed to change the vocabulary in a wiki-way, just as well as the data itself. Need a new relation? Invent it. Figured out it’s wrong? Rename. Want a new category of things? Make it.

Via Danny Ayers, oringal posted to O’Reilly Radar, which doesn’t offer permalinks for comments. This just needs a catchy name. Web 2.0 ontology engineering? Fonktology?

Perils of a too cool name

Wednesday, February 14th, 2007

I’ve seen lots of confusion about microformats, but Jon Udell takes the cake in describing XMP:

It’s a bit of a mish-mash, to say the least. There’s RDF (Resource Description Framework) syntax, Adobe-style metadata syntax, and Microsoft-style metadata syntax. But it works. And when I look at this it strikes me that here, finally, is a microformat that has a shot at reaching critical mass.

How someone as massively clued-in as Jon Udell could be so misled as to describe XMP as a microformat is beyond me.

, which is basically a constrained RDF/XML serialization following super-ugly conventions that may be embedded in a number of file formats (most prominently PDF and JPEG, but potentially almost anything), is about as far from a as one could possibly get. Off the top:

  • XMP is RDF/XML and as such is arbitrarily extensible; each microformat covers a specific use case and goes through great lengths to favor interoperability among publishers of each microformat (sometime I will write about how microformat and RDF people mean completely different things by “interoperability”) at the expense of extensibility.
  • XMP is embedded in a binary file, completely opaque to nearly all users; microformats put a premium on (practically require) colocation of metadata with human-visible HTML.
  • XMP would be extremely painful to write by hand and there are very few tools that support publishing it; microformats, to a fault, put a premium on publisher ease–anyone with a passing familiarity with HTML could be writing microformats.

I don’t agree with everything the microformats folk have done, but they do have a pretty self-consistent approach, if one bothers to try to understand it. XMP ain’t it.

XMP is by far the most promising embedded metadata format for “media” files — which is mostly a testament to how terribly useless to non-existent the alternatives are (by some definitions there are none).

Addendum: I’m really only picking on one word from Udell’s post, the remainder of which is recommended. It is to learn that “There’s also good support in .NET Framework 3.0 for reading and writing XMP metadata.”

Update 20070215: Udell explains:

Now there is, as Mike points out, a big philosophical difference between XMP, which aims for arbitrary extensibility, and fixed-function microformats that target specific things like calendar events. But in practice, from the programmer’s perspective, here’s what I observe.

Hand me an HTML document containing a microformat instance and I will cast about in search of tools to parse it, find a variety of ones that sort of work, and then wrestle with the details.

Hand me an image file containing an XMP fragment and, lo and behold, it’s the same story!

Yes, for 99% of the .01% of the world that cares at all, microformats and XMP are the same: metadata, embedded data, or even just data. That’s what I was hinting at in the title of this post — in the minds of 99% of the .01%, microformats are becoming synonymous with metadata, i.e., genericized. This is either a marketing and naming coup or disaster, depending on one’s perspective (I don’t particularly care).

I considered this headline: If XMP is a microformat, RDFa sure the heck is a microformat.

Microformats are worse

Sunday, October 22nd, 2006

I almost entirely agree with Mark Birbeck’s comparison of RDFa and microformats. The only thing to be said in defense of is that a few of the problems Birbeck calls out are also features, from the microformats perspective.

But .

I will reveal what this means later.

Another quip: My problem with microformats is the s.

Evan Prodromou provided a still-good RDFa vs Microformats roundup (better title: “RDFa and Microformats, please work together”) in May. I somehow missed it until now.

Ah, metadata.

Update 20061204: I didn’t miss Prodromou’s roundup in May, I blogged about it. And forgot.

Long tail of metadata

Monday, May 29th, 2006

Ben Adida notes that people are writing about RDFa, which is great, and envisioning conflict with microformats, which is not. As Ben says:

Microformats are useful for expressing a few, common, well-defined vocabularies. RDFa is useful for letting publishers mix and match any vocabularies they choose. Both are useful.

In other words RDFa is a technology.

Evan Prodromou thinks the future is bleak without cooperation. I like his proposed way forward (strikeout added for obvious reasons):

  1. RDFa gets acknowledged and embraced by as the future of semantic-data-in-XHTML
  2. The RDFa group makes an effort to encompass existing microformats with a minimum of changes
  3. leaders join in on the RDFa authorship process
  4. becomes a focus for developing real-world RDFa vocabularies

I see little chance of points one and three occuring. However, I don’t see this as a particularly bad thing. Point three will occur, almost by default: the simplest and most widely deployed microformats (e.g., , and rellicense) are also valid RDFa — the predicate (e.g., tag, nofollow, license) appearing in the default namespace to a RDFa application. More complex microformats may be handled by hGRDDL, which is no big deal as a microformat-aware application needs to parse each microformat it cares about anyway. From an RDF perspective any well-crafted metadata is a plus (and the microformats group do very careful work) as RDF’s killer app is integrating heterogenous data sources.

From a microformats perspecitve RDFa might well be ignored. While transformation of any microformat to RDF is relatively straightforward, transformation of RDF (which is a model, not a format) to microformats is nonsensical (well, I suppose the endpoint of such a transformation could be , though I’m not sure what the point would be). Microformats, probably wisely, is not reinventing RDF (as many do, usually badly).

So why would RDFa be of interest to developers? In a word, laziness. There is no process to follow for developing an RDF vocabulary (ironic), you can freely reuse existing vocabularies and tools, not write your own parsers, and trust that really smart people are figuring out the hard stuff for you (I believe the formal background of the Semantic Web is a long-term win). Or you might just want to, as Ben says “express metadata about other documents (embedded images)” which is trivial for RDF as images have URIs.

Addendum 20060601: The “simplest” microformats mentioned above have a name: elemental microformats.