Post Semantic Web


Saturday, November 10th, 2012

Last week I attended CODATA 2012 in Taipei, the biannual conference of the Committee on Data for Science and Technology. I struggled a bit with deciding to go — I am not a “data scientist” nor a scientist and while I know a fair amount about some of the technical and policy issues for data management, specific application to science has never been my expertise, all away from my current focus, and I’m skeptical of travel.

I finally went in order to see through a session on mass collaboration data projects and policies that I developed with Tyng-Ruey Chuang and Shun-Ling Chen. A mere rationalization as they didn’t really need my presence, but I enjoyed the conference and trip anyway.

My favorite moments from the panel:

  • Mikel Maron said approximately “not only don’t silo your data, don’t silo your code” (see a corresponding bullet in his slides), a point woefully and consistently underestimated and ignored by “open” advocates.
  • Chen’s eloquent polemic closing with approximately “mass collaboration challenges not only Ⓒ but distribution of power, authority, credibility”; I hope she publishes her talk content!

My slides from the panel (odp, pdf, slideshare) and from an open data workshop following the conference (odp, pdf, slideshare).

Tracey Lauriault summarized the mass collaboration panel (all of it, check out the parts I do not mention), including:

Mike Linksvayer, was provocative in stating that copyright makes us stupider and is stupid and that it should be abolished all together. I argued that for traditional knowledge where people are seriously marginalized and where TK is exploited, copyright might be the only way to protect themselves.

I’m pretty sure I only claimed that including copyright in one’s thinking about any topic, e.g., data policy, effectively makes one’s thinking about that topic more muddled and indeed stupid. I’ve posted about this before but consider a post enumerating the ways copyright makes people stupid individually and collectively forthcoming.

I didn’t say anything about abolishing copyright, but I’m happy for that conclusion to be drawn — I’d be even happier for the conclusion to be drawn that abolition is a moderate reform and boring (in no-brainer and non-interesting senses) among the possibilities for information and innovation policies — indeed, copyright has made society stupid about these broader issues. I sort of make these points in my future of copyright piece that Lauriault linked to, but will eventually address them directly.

Also, Traditional Knowledge, about which I’ve never posted unless you count my claim that malgovernance of the information commons is ancient, for example cult secrets (mentioned in first paragraph of previous link), though I didn’t have contemporary indigenous peoples in mind, and TK covers a wide range of issues. Indeed, my instinct is to divide these between issues where traditional communities are being excluded from their heritage (e.g., plant patents, institutionally-held data and items, perhaps copyrestricted cultural works building on traditional culture) and where they would like to have a collective right to exclude information from the global public domain.

The theme of CODATA 2012 was “Open Data and Information for a Changing Planet” and the closing plenary appropriately aimed to place the entire conference in that context, and question its impact and followup. That included the inevitable asking whether anyone would notice. At the beginning of the conference attendees were excitedly encouraged to tweet, and if I understood correctly, there were some conference staff to be dedicated to helping people tweet. As usual, I find this sort of exhortation and dedication of resources to social media scary. But what about journalists? How can we make the media care?

Fortunately for (future) CODATA and other science and data related events, there’s a great answer (usually there isn’t one), but one I didn’t hear mentioned at all outside of my own presentation: invite data journalists. They could learn a lot from other attendees, have a meta story about exactly the topic they’re passionate about, and an inside track on general interest data-driven stories developing from data-driven science in a variety of fields — for example the conference featured a number of sessions on disaster data. Usual CODATA science and policy attendees would probably also learn a lot about how to make their work interesting for data journalists, and thus be able to celebrate rather than whinge when talking about media. A start on that learning, and maybe ideas for people to invite might come from The Data Journalism Handbook (disclaimer: I contributed what I hope is the least relevant chapter in the whole book).

Someone asked how to move forward and David Carlson gave some conceptually simple and very good advice, paraphrased:

  • Adopt an open access data publishing policy at the inception of a project.
  • Invest in data staff — human resources are the limiting factor.
  • Start publishing and doing small experiments with data very early in a project’s life.

Someone also asked about “citizen science”, to which Carlson also had a good answer (added to by Jane Hunter and perhaps others), in sum roughly:

  • Community monitoring (data collection) may be a more accurate name for part of what people call citizen science;
  • but the community should be involved in many more aspects of some projects, up to governance;
  • don’t assume “citizen scientists” are non-scientists: often they’ll have scientific training, sometimes full-time scientists contributing to projects outside of work.

To bring this full circle (and very much aligned with the conference’s theme and Carlson’s first recommendation above) would have been consideration of scientist-as-citizen. Fortunately I had serendipitously titled my “open data workshop” presentation for the next day “Open data policy for scientists as citizens and for citizen science”.

Finally, “data citation” was another major topic of the conference, but semantic web/linked open data not explicitly mentioned much, as observed by someone in the plenary. I tend to agree, but may have missed the most relevant sessions, though they may have been my focus if I was actually working in the field. I did really enjoy happening to sit next to Curt Tilmes at a dinner, and catching up a bit on W3C Provenance (I’ve mentioned briefly before) of which he is a working group member.

I got to spend a little time outside the conference. I’d been to Taipei once before, but failed to notice its beautiful setting — surrounded and interspersed with steep and very green hills.

I visited National Palace Museum with Puneet Kishor. I know next to nothing about feng shui, but I was struck by what seemed to be an ultra-favorable setting (and made me think of feng shui, which I never have before in my life, without someone else bringing it up) taking advantage of some of the aforementioned hills. I think the more one knows about Chinese history the more one would get out of the museum, but for someone who loves maps, the map room alone is worth the visit.

It was also fun hanging out a bit with Christopher Adams and Sophie Chiang, catching up with Bob Chao and seeing the booming Mozilla Taiwan offices, and meeting Florence Ko, Lucien Lin, and Rock of Open Source Software Foundry and Emily from Creative Commons Taiwan.

Finally, thanks to Tyng-Ruey Chuang, one of the main CODATA 2012 local organizers, and instigator of our session and workshop. He is one of the people I most enjoyed working with while at Creative Commons (e.g., a panel from last year) and given some overlapping technology and policy interests, one of the people I could most see working with again.

Falsifiable PR, science courts, legal prediction markets, web truth

Saturday, September 15th, 2012

Point of Inquiry podcast host Chris Mooney recently interviewed Rick Hayes-Roth of

The site allows one to crowdfund a bounty for proving or disproving a claim that the sponsors believe to be a bogus or true statement respectively. If the sponsors’ claim is falsified, the falsifying party (challenger) gets the bounty, otherwise the initiating sponsor (campaign creator) gets 20% of the bounty, and other sponsors get about 80% of their contributions back. TruthMarket runs the site, adjudicates claims, and collects fees. See their FAQ and quickstart guide.

It seems fairly clear from the podcast that TruthMarket is largely a publicity mechanism. A big bounty for a controversial (as played out in the media anyway) claim could be newsworthy, and the spin would favor the side of truth. The claims currently on the site seem to be in this vein, e.g., Obama’s birth certificate and climate change. As far as I can tell there’s almost no activity on the site, the birth certificate claim, started by Hayes-Roth, being the only one funded.

The concept is fairly interesting though, reminding me of three things:

Many interesting combinations of these ideas are yet to be tried. Additionally, TruthMarket apparently started as TruthSeal, an effort to get web publishers to vouch monetarily for claims they make.

Web data formats Δ 2009-2012

Saturday, March 31st, 2012

January I used WebDataCommons first data publication (triples — actually quads, as URL triple found at is retained — extracted from a small subset of a 2009-2010 Common Crawl corpus) to analyze use of Creative Commons-related metadata. March 23 WDC announced a much, much larger data publication — all triples extracted from the entire Common Crawl 2009/2010 and 2012 corpora.

I’m not going to have time to perform a similar Creative Commons-related analysis on these new datasets (I really hope someone else does; perhaps see my previous post for inspiration) but aggregate numbers about formats found by the two extractions are interesting enough, and not presented in an easy to compare format on the WDC site, for me to write the remainder of this post.


  • 2009/10 data taken from and 2012 data taken from Calculated values italicized. Available as a spreadsheet.
  • The next points, indeed all comparisons, should be treated with great skepticism — it is unknown how comparable the two Common Crawl corpora are.
  • Publication of structured data on the web is growing rapidly.
  • Microdata barely existed in 2009/2010, so it is hardly surprising that it has grown tremendously.
  • Overall microformats adoption seems to have stagnated but still hold the vast majority of extracted data. It is possible however that the large decreases in hlisting and hresume and increase in hrecipe use are due to one or a few large sites or CMSes run by many sites. This (indeed everything) bears deeper investigation. What about deployment of micrformats-2 with prefixed class names that I don’t think would be matched by the WebDataCommons extractor?
  • Perhaps the most generally interesting item below doesn’t bear directly on HTML data — the proportion of URLs in the Common Crawl corpora parsed as HTML declined by 4 percentage points. Is this due to more non-HTML media or more CSS and JS?
2009/10 2012 Change (% points or per URL)
Total Data (compressed) 28.9 Terabyte 20.9 Terabyte
Total URLs 2,804,054,789 1,700,611,442
Parsed HTML URLs 2,565,741,671 1,486,186,868
Domains with Triples 19,113,929 65,408,946
URLs with Triples 251,855,917 302,809,140
Typed Entities 1,546,905,880 1,222,563,749
Triples 5,193,276,058 3,294,248,652
% Total URLs parsed as HTML 91.50% 87.39% -4.11%
% HTML URLs with Triples 9.82% 20.37% 10.56%
Typed Entities/HTML URL 0.60 0.82 0.22
Triples/HTML URL 2.02 2.22 0.19
2009/10 Extractor Domains with Triples URLs with Triples Typed Entities Triples % HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa 537,820 14,314,036 26,583,558 293,542,991 0.56% 5.68% 2.81% 1.72% 5.65%
html-microdata 3,930 56,964 346,411 1,197,115 0.00% 0.02% 0.02% 0.02% 0.02%
html-mf-geo 244,838 5,051,622 7,326,516 28,831,795 0.20% 2.01% 1.28% 0.47% 0.56%
html-mf-hcalendar 226,279 2,747,276 21,289,402 65,727,393 0.11% 1.09% 1.18% 1.38% 1.27%
html-mf-hcard 12,502,500 83,583,167 973,170,050 3,226,066,019 3.26% 33.19% 65.41% 62.91% 62.12%
html-mf-hlisting 31,871 1,227,574 25,660,498 88,146,122 0.05% 0.49% 0.17% 1.66% 1.70%
html-mf-hresume 10,419 387,364 1,501,009 12,640,527 0.02% 0.15% 0.05% 0.10% 0.24%
html-mf-hreview 216,331 2,836,701 8,234,850 84,411,951 0.11% 1.13% 1.13% 0.53% 1.63%
html-mf-species 3,244 25,158 152,621 391,911 0.00% 0.01% 0.02% 0.01% 0.01%
html-mf-hrecipe 13,362 115,345 695,838 1,228,925 0.00% 0.05% 0.07% 0.04% 0.02%
html-mf-xfn 5,323,335 37,526,630 481,945,127 1,391,091,386 1.46% 14.90% 27.85% 31.16% 26.79%

1,519,975,911 4,898,536,029

98.26% 94.32%
2012 Extractor Domains with Triples URLs with Triples Typed Entities Triples % HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa 16,976,232 67,901,246 49,370,729 456,169,126 4.57% 22.42% 25.95% 4.04% 13.85%
html-microdata 3,952,674 26,929,865 90,526,013 404,413,915 1.81% 8.89% 6.04% 7.40% 12.28%
html-mf-geo 897,080 2,491,933 4,787,126 11,222,766 0.17% 0.82% 1.37% 0.39% 0.34%
html-mf-hcalendar 629,319 1,506,379 27,165,545 65,547,870 0.10% 0.50% 0.96% 2.22% 1.99%
html-mf-hcard 30,417,192 61,360,686 865,633,059 1,837,847,772 4.13% 20.26% 46.50% 70.80% 55.79%
html-mf-hlisting 69,569 197,027 8,252,632 20,703,189 0.01% 0.07% 0.11% 0.68% 0.63%
html-mf-hresume 9,890 20,762 92,346 432,363 0.00% 0.01% 0.02% 0.01% 0.01%
html-mf-hreview 615,681 1,971,870 7,809,088 50,475,411 0.13% 0.65% 0.94% 0.64% 1.53%
html-mf-species 4,109 14,033 139,631 224,847 0.00% 0.00% 0.01% 0.01% 0.01%
html-mf-hrecipe 127,381 422,289 5,516,036 5,513,030 0.03% 0.14% 0.19% 0.45% 0.17%
html-mf-xfn 11,709,819 26,004,925 163,271,544 441,698,363 1.75% 8.59% 17.90% 13.35% 13.41%

1,082,667,007 2,433,665,611

88.56% 73.88%
2009/10 – 2012 Change (% points)

% HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples

4.01% 16.74% 23.14% 2.32% 8.20%

1.81% 8.87% 6.02% 7.38% 12.25%

-0.03% -1.18% 0.09% -0.08% -0.21%

-0.01% -0.59% -0.22% 0.85% 0.72%

0.87% -12.92% -18.91% 7.89% -6.33%

-0.03% -0.42% -0.06% -0.98% -1.07%

-0.01% -0.15% -0.04% -0.09% -0.23%

0.02% -0.48% -0.19% 0.11% -0.09%

0.00% -0.01% -0.01% 0.00% 0.00%

0.02% 0.09% 0.12% 0.41% 0.14%

0.29% -6.31% -9.95% -17.80% -13.38%

-9.70% -20.45%
2009/10 – 2012 Change (%%)

% HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples

718.95% 294.55% 822.40% 134.99% 144.98%

81515.61% 39220.30% 29290.79% 32965.42% 53156.82%

-14.84% -58.97% 7.07% -17.33% -38.64%

-5.34% -54.39% -18.73% 61.45% 57.22%

26.74% -38.94% -28.91% 12.55% -10.19%

-72.29% -86.65% -36.21% -59.31% -62.97%

-90.75% -95.54% -72.26% -92.22% -94.61%

20.01% -42.18% -16.83% 19.99% -5.73%

-3.70% -53.61% -62.99% 15.76% -9.55%

532.05% 204.50% 178.58% 903.02% 607.21%

19.63% -42.36% -35.72% -57.13% -49.94%

-9.87% -21.68%

Pinterest Exclusion Protocol

Tuesday, February 28th, 2012
<meta name="pinterest" content="nopin"/>

Weirdly vendor-specific and coarse at the same time. Will other sites follow this directive, which could mean something like “don’t repost images referenced in this page”, which does differ a bit from:

<meta name="robots" content="noimageindex"/>

Not to mention actually using the Robots Exclusion Protocol, and perish the thought, POWDER, or even annotating individual images with microdata/formats/RDFa.

Then there’s the Spam Pinterest Spam Protocol, I mean “pin this” button. I have not been following web actions/activities/intents development beyond the headlines, but please rid us of the so-called NASCAR effect.

Not entirely orthogonal to these vendor-specific exclusion and beg-for-inclusion protocols, are images released under public licenses — not entirely orthogonal as nopin seems to be aimed at countering copyright threats (supplementing DMCA takedown compliance), which public licenses, at least free ones, waive conditionally; and releasing work under a public license is a more general invitation to include.

As far as I can tell Pinterest relies on no public license, and thus complies with no public license condition (ie license notice and attribution). As it probably should not, given its strategy appears to be relying on safe harbors and making it possible for those who want to make an effort to opt-out entirely to do so: public licenses are superfluous. Obviously Pinterest could have taken a very different strategy, and relied on public copyright licenses and public domain materials — at a very high cost: pintereters(?) would need to know about copyright, which is hugely less fun than pinning(?) whatever looks cool.

Each of these (exclusion, inclusion, copyright mitigation strategy) are fine examples of simple-ugly-but-works vs. general-elegant-but-but-but…

I’m generally a fan of general-elegant-but-but-but, but let’s see what’s good and hopeful about reality:

  • “Don’t repost images referenced in this page” is a somewhat understandable preference; let’s assume people get something out of expressing and honoring it. nopin helps people realize some of this something, using a <meta> tag is not ridiculous, and if widely used, maybe provides some in-the-wild examples to inform more sophisticated approaches.
  • I can’t think of anything good about site-specific “share” buttons. But of the three items in this list, I have by far the highest hope for a general-elegant mechanism “winning” in the foreseeable future.
  • Using copyright exceptions and limitations is crucial to maintaining them, and this is wholly good. Now it’d be nice to pass along the availability of a public license, even if one is not relying on such, as a feature for users who may wish to rely on the license for other purposes, but admittedly providing this feature is not cost-free. But I also want to see more projects services (preferably also free and federated, but putting that issue aside) that do take the strategy of relying on public licenses (which does not preclude also pushing on exceptions and limitations) as such have rather different characteristics which I think have longer-term and large benefits for collaboration, policy, and preservation.

Pin It

8 year Refutation Blog

Saturday, February 4th, 2012

I first posted to this blog exactly 8 years ago, after a few years of dithering over which blog software to use (WordPress was the first that made me not feel like I had to write my own; maybe I was waiting for 1.0, released January 2004).

A little over two years ago I had the idea for a “refutation blog”: after some number of years, a blogger might attempt to refute whatever they wrote previously. In some cases they may believe they were wrong and/or stupid, in all cases, every text and idea is worthy of all-out attack, given enough resources to carry out such, and passing of time might allow attacks to be carried out a bit more honestly. I have little doubt this has been done before, and analogously for pre-blog forms; I’d love pointers.

The last two Februaries have passed without adequate time to start refuting. In order to get started (I could also write software to manage and render refutations, and figure out what vocabulary to use to annotate them, and unlikely but might in the fullness of time, but I won’t accept the excuse for years more of delay right now) I’m lowering my sights from “all-out attack” to a very brief attack on the substance of a previous post, and will do my best to avoid snarky asides.

I have added a refutation category. I will probably continue non-refutation posts here (and hope to refute those 8 years after posting). I may eventually move my current blogging or something similar to another site.

Back to that first post, See Yous at Etech. “Alpha geeks” indeed. With all the unintended at the time, but fully apparent in the name, implication of status seeking and vaporware over deep technical substance and advancement. The “new CC metadata-enhanced application” introduced there was a search prototype. The enhancement was a net negative. Metadata is costly, and usually crap. Although implemented elsewhere since then, as far as I can tell a license filter added to text-based search has never been very useful. I never use it, except as a curiosity. I do search specific collections, where metadata, including license, is a side effect of other collection processes. Maybe as and if sites automatically add annotations to curated objects, aggregation via search with a license and other filters will become useful.

Web Data Common[s] Crawl Attribution Metadata

Monday, January 23rd, 2012

Via I see Web Data Commons which has “extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010”. WDC publishes the extracted data as N-Quads (the fourth item denotes the immediate provenance of each subject/predictate/object triple — the URL the triple was extracted from).

I thought it would be easy and fun to run some queries on the WDC dataset to get an idea of how annotations associated with Creative Commons licensing are used. Notes below on exactly what I did. The biggest limitation is that the license statement itself is not part of the dataset — not as xhv:license in the RDFa portion, and for some reason rel=license microformat has zero records. But cc:attributionName, cc:attributionURL, and cc:morePermissions are present in the RDFa part, as are some Dublin Core properties that the Creative Commons license chooser asks for (I only looked at dc:source) but are probably widely used in other contexts as well.

Dataset URLs Distinct objects
Common Crawl 2010 corpus 5,000,000,000a
1% sampled by WDC ~50,000,000
with RDFa 158,184b
with a cc: property 26,245c
cc:attributionName 24,942d 990e
cc:attributionURL 25,082f 3,392g
dc:source 7,235h 574i
cc:morePermissions 4,791j 253k
cc:attributionURL = dc:source 5,421l
cc:attributionURL = cc:morePermissions 1,880m
cc:attributionURL = subject 203n

Some quick takeaways:

  • Low ratio of distinct attributionURLs probably indicates HTML from license chooser deployed without any parameterization. Often the subject or current page will be the most useful attributionURL (but 203 above would probably be much higher with canonicalization). Note all of the CC licenses require that such a URL refer to the copyright notice or licensing information for the Work. Unless one has set up a side-wide license notice somewhere, a static URL is probably not the right thing to request in terms of requiring licensees to provide an attribution link; nor is a non-specific attribution link as useful to readers as a direct link to the work in question. As (and if) support for attribution metadata gets built into Creative Commons-aware CMSes, the ratio of distinct attributionURLs ought increase.
  • 79% of subjects with both dc:source and cc:attributionURL (6,836o) have the same values for both properties. This probably means people are merely entering their URL into every form field requesting a URL without thinking, not self-remixing.
  • 47% of subjects with both cc:morePermissions and cc:attributionURL (3,977p) have the same values for both properties. Unclear why this ratio is so much lower than previous; it ought be higher, as often same value for both makes sense. Unsurprising that cc:morePermissions least provided property; in my experience few people understand it.

I did not look at the provenance item at all. It’d be interesting to see what kind of assertions are being made across authority boundaries (e.g. a page on makes a statements with an URI as the subject) and when to discard such. I barely looked directly at the raw data at all; just enough to feel that my aggregate numbers could possibly be accurate. More could probably be gained by inspecting smaller samples in detail, informing other aggregate queries.

I look forward to future extracts. Thanks indirectly to Common Crawl for providing the crawl!

Please point out any egregious mistakes made below…

# a I don't really know if the October 2010 corpus is the
# entire 5 billion Common Crawl corpus

# download RDFa extract from Web Data Commons
wget -c

# Matches number stated at
wc -l ccrdf.html-rdfa.nq

# Includes easy to use no-server triplestore
apt-get install redland-utils

# sanity check
grep '<>' ccrdf.html-rdfa.nq |wc -l

# Import rejects a number of triples for syntax errors
rdfproc xyz parse ccrdf.html-rdfa.nq nquads

# d Perhaps syntax errors explains fewer triples than above grep might
# indicate, but close enough
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l

# These replicated below with 4store because...
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?o <> ?o }' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?o <> ?o }' |wc -l

# ...this query takes forever, hours, and I have no idea why
rdfproc xyz query sparql - 'select ?s, ?o where { ?s <> ?o ; <> ?o }'

# 4store has a server, but is lightweight
apt-get install 4store

# 4store couldn't import with syntax errors, so export good triples from
# previous store first
rdfproc xyz serialize > ccrdf.export-rdfa.rdf

# import into 4store
curl -T ccrdf.export-rdfa.rdf 'http://localhost:8080/data/wdc'

# egrep is to get rid of headers and status output prefixed by ? or #
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

#Of course please use instead.
#Should be more widely deployed soon.
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <> ?o ; <> ?n }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <> ?o ; <> ?n }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n, ?m where { ?s <> ?o ; <> ?n ; <> ?m }' |egrep -v '^[\?\#]' |wc -l
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <> ?o } UNION { ?s <> ?n } UNION { ?s <> ?m }  }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <> ?o } UNION { ?s <> ?n }}' |egrep -v '^[\?\#]' |wc -l

#b note subjects not the same as pages data extracted from (158,184)
4s-query wdc -s '-1' -f text 'select distinct ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l

# Probably less than 1047250 claimed due to syntax errors
4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?s }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?s }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?o }'  |egrep -v '^[\?\#]' |wc -l

Life in the kind of bleak future of HTML data

Thursday, January 12th, 2012

Evan Prodromou wrote in 2006:

I think that if and the RDFa effort continue moving forward without coordinating their effort, the future looks kind of bleak.

I blogged about this at the time (and forgot and reblogged five months later). I recalled this upon reading a draft HTML Data Guide announced today, and trying to think of a tl;dr summary to at least microblog.

That’s difficult. The guide is intended to help publishers and consumers of HTML data choose among three syntaxes (all mostly focused on annotating data inline with HTML meant for display) and a variety of vocabularies, with heavy dependencies between the two. Since 2006, people working on microformats and RDFa have done much to address the faults of those specifications — microformats-2 allows for generic (rather than per-format) parsing, and RDFa 1.1 made some changes to make namespaces less needed, less ugly when needed, and usable in HTML5, and specifies a lite subset. In 2009 a third syntax/model, microdata, was launched, and then in 2011 chosen as the syntax for (which subsequently announced it would also support RDFa 1.1 Lite).

I find the added existence of microdata and suboptimal (optimal might be something like microformats process for some super widely useful vocabularies, with a relatively simple syntax but permitting generic parsing and distributed extensibility; very much like what Prodromou wanted in 2006), but when is anything optimal? I also wonder how much credit microdata ought get for microformats-2 and RDFa 1.1, due to providing competitive pressure? And for invigorating metadata-enhanced web-scale search and vocabulary design (for example, the last related thing I was involved in, at the beginning anyway)?

Hope springs eternal for getting these different but overlapping technologies and communities to play well together. I haven’t followed closely in a long time, but I gather that Jeni Tennison is one of the main people working on that, and you should really subscribe to her blog if you care. That leaves us back at the HTML Data Guide, of which Tennison is the editor.

My not-really-a-summary:

  1. Delay making any decisions about HTML data; you probably don’t want it anyway (metadata is usually a cost center), and things will probably be more clear when you’re forced to check back due to…
  2. If someone wants data from you as annotated HTML, or you need data from someone, and this makes business sense, do whatever the other party has already decided on, or better yet implemented (assuming their decision isn’t nonsensical; but if so why are you doing business with them?)
  3. Use a validator to test your data in whatever format. An earlier wiki version of some of the guide materials includes links to validators. In my book, Any23 is cute.

(Yes, CC REL needs updating to reflect some of these developments, RDFa 1.1 at the least. Some license vocabulary work done by SPDX should also be looked at.)

Penumbra of Provenance

Thursday, January 12th, 2012


Yesterday the W3C’s Provenance Working Group posted a call for feedback on a family of documents members of that group have been working on. Provenance is an important issue for the info commons, as I’ve sketched elsewhere. I hope some people quickly flesh out examples of application of the draft ontology to practical use cases.

Intellectual Provenance

Apart from some degree of necessity for current functioning of some info commons (obviously where some certainty about freedoms from copyright restriction is needed, but conceivably even moreso to outgrow copyright industries), provenance can also play an important symbolic role. Unlike “intellectual property”, intellectual provenance is of keen interest to both readers and writers. Furthermore, copyright and other restrictions make provenance harder, in both practical (barriers to curation) and attitudinal — the primacy of “rights” (as in rents, and grab all that your power allows) deprecates the actual intellectual provenance of things.

Postmodern Provenance

The umbra of provenance seems infinite. As we preserve scratches of information (or not) incomparably vast amounts disappear. But why should we only care for what we can record that led to current configurations? Consider independent invention and convergent evolution. Who cares what configurations and events led to current configurations: what are the recorded configurations that could have led to the current configuration, what are all of the configurations that could have led to the current configuration; what configurations are most similar (including history, or not) to a configuration in question?


In order to highlight the exposure of provenance information on the internet and provide added impetus for organizations to expose in a way that can efficiently be found and accessed, I am exploring the possibility of a .prov TLD.

CSS text overlay image, e.g. for attribution and license notice

Sunday, January 8th, 2012

A commenter called me on providing inadequate credit for an map image I used on this blog. I’ve often seen map credits overlaid on the bottom right of maps, so I decided to try that. I couldn’t find an example of using CSS to overlay text on an image that only showed the absolute minimum needed to achieve the effect, and explained why. Below is my attempt.

Example 1

The above may be a good example of when to not use a text overlay (there is already text at the bottom of the image), but the point is to demonstrate the effect, not to look good. I have an image and I want to overlay «Context+Colophon» at the bottom right of the image. Here’s the minimal code:

<div style="position:relative;z-index:0;width:510px">
  <img src=""/>
  <div style="position:absolute;z-index:1;right:0;bottom:0">
    <a href="">Context</a>+<a href="">Colophon</a>


The outer div creates a container which the text overlay will be aligned with. A position is necessary to enable z-index, which specifies how objects will stack. Here position:relative as I want the image and overlay to flow with the rest of the post, z-index:0 as the container is at the bottom of the stack. I specify width:510px as that’s how wide the image is, and without hardcoding the size of the div, the overlay as specified will float off to the right rather than align with the image. There’s nothing special about the img; it inherits from the outer div.

The inner div contains and styles the text I want to overlay. position:absolute as I will specify an absolute offset from the container, right:0;bottom:0, and z-index:1 to place above the image. Finally, I close both divs.

That’s it. I know precious little CSS; please tell me what I got wrong.

Example 2

Above is the image that prompted this post, with added attribution and license notice. Code:

<div style="z-index:0;position:relative;width:560px"
  <a href=";lon=-122.2776&amp;zoom=14&amp;layers=Q">
    <img src=""/></a>
  <div style="position:absolute;z-index:1;right:0;bottom:0;">
      © <a rel="cc:attributionURL"
           href=";lon=-122.2776&amp;zoom=14&amp;layers=Q">OpenStreetMap contributors</a>,
        <a rel="license"


With respect to the achieving the text overlay, there’s nothing in this example not in the first. Below I explain annotations added that (but are not required by) fulfillment of OpenStreetMap/CC-BY-SA attribution and license notice.

The xmlns:ccprefix, and even that may be superfluous, given cc: as a default prefix.

about sets the subject of subsequent annotations.

small isn’t an annotation, but does now seem appropriate for legal notices, and is usually rendered nicely.

rel="cc:attributionURL" says that the value of the href property is the link to use for attributing the subject. property="cc:attributionName" says that the text (“OpenStreetMap contributors”) is the name to use for attributing the subject. rel="license" says the value of its href property is the subject’s license.

If you’re bad and not using HTTPS-Everywhere (referrer not sent due to protocol change; actually I’m bad for not serving this blog over https), clicking on BY-SA above might obtain a snippet of HTML with credits for others to use. Or you can copy and paste the above code into RDFa Distiller or checkrdfa to see that the annotations are as I’ve said.

Addendum: If you’re reading this in a feed reader or aggregator, there’s a good chance inline CSS is stripped — text intended to overlay images will appear under rather than overlaying images. Click through to the post in order to see the overlays work.

Encyclopedia of Original Research

Thursday, December 15th, 2011

As I’m prone to say that some free/libre/open projects ought strive to not merely recapitulate existing production methods and products (so as to sometimes create something much better), I have to support and critique such strivings.

A proposal for the Encyclopedia of Original Research, besides a name that I find most excellent, seems like just such a project. The idea, if I understand correctly, is to leverage Open Access literature and using both machine- and wiki-techniques, create always-up-to-date reviews of the state of research in any field, broad or narrow. If wildly successful, such a mechanism could nudge the end-product of research from usually instantly stale, inaccessible (multiple ways), unread, untested, singular, and generally useless dead-tree-oriented outputs toward more accessible, exploitable, testable, queryable, consensus outputs. In other words, explode the category of “scientific publication”.

Another name for the project is “Beethoven’s open repository of research” — watch the video.

The project is running a crowdfunding campaign right now. They only have a few hours left and far from their goal, but I’m pretty sure the platform they’re using does not require projects to meet a threshold in order to obtain pledges, and it looks like a small amount would help continue to work and apply for other funding (eminently doable in my estimation; if I can help I will). I encourage kicking in some funds if you read this in the next couple hours, and I’ll update this post with other ways to help in the future if you’re reading later, as in all probability you are.

EoOR is considerably more radical than (and probably complementary to and/or ought consume) AcaWiki, a project I’ve written about previously with the more limited aim to create human-readable summaries of academic papers and reviews. It also looks like, if realized, a platform that projects with more specific aims, like OpenCures, could leverage.

Somehow EoOR escaped my attention (or more likely, my memory) until now. It seems the proposal was developed as part of a class on getting your Creative Commons project funded, which I think I can claim credit for getting funded (Jonas Öberg was very convincing; the idea for and execution of the class are his).