Post Semantic Web

8 year Refutation Blog

Saturday, February 4th, 2012

I first posted to this blog exactly 8 years ago, after a few years of dithering over which blog software to use (WordPress was the first that made me not feel like I had to write my own; maybe I was waiting for 1.0, released January 2004).

A little over two years ago I had the idea for a “refutation blog”: after some number of years, a blogger might attempt to refute whatever they wrote previously. In some cases they may believe they were wrong and/or stupid, in all cases, every text and idea is worthy of all-out attack, given enough resources to carry out such, and passing of time might allow attacks to be carried out a bit more honestly. I have little doubt this has been done before, and analogously for pre-blog forms; I’d love pointers.

The last two Februaries have passed without adequate time to start refuting. In order to get started (I could also write software to manage and render refutations, and figure out what vocabulary to use to annotate them, and unlikely but might in the fullness of time, but I won’t accept the excuse for years more of delay right now) I’m lowering my sights from “all-out attack” to a very brief attack on the substance of a previous post, and will do my best to avoid snarky asides.

I have added a refutation category. I will probably continue non-refutation posts here (and hope to refute those 8 years after posting). I may eventually move my current blogging or something similar to another site.

Back to that first post, See Yous at Etech. “Alpha geeks” indeed. With all the unintended at the time, but fully apparent in the name, implication of status seeking and vaporware over deep technical substance and advancement. The “new CC metadata-enhanced application” introduced there was a search prototype. The enhancement was a net negative. Metadata is costly, and usually crap. Although implemented elsewhere since then, as far as I can tell a license filter added to text-based search has never been very useful. I never use it, except as a curiosity. I do search specific collections, where metadata, including license, is a side effect of other collection processes. Maybe as and if sites automatically add annotations to curated objects, aggregation via search with a license and other filters will become useful.

Web Data Common[s] Crawl Attribution Metadata

Monday, January 23rd, 2012

Via I see Web Data Commons which has “extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010″. WDC publishes the extracted data as N-Quads (the fourth item denotes the immediate provenance of each subject/predictate/object triple — the URL the triple was extracted from).

I thought it would be easy and fun to run some queries on the WDC dataset to get an idea of how annotations associated with Creative Commons licensing are used. Notes below on exactly what I did. The biggest limitation is that the license statement itself is not part of the dataset — not as xhv:license in the RDFa portion, and for some reason rel=license microformat has zero records. But cc:attributionName, cc:attributionURL, and cc:morePermissions are present in the RDFa part, as are some Dublin Core properties that the Creative Commons license chooser asks for (I only looked at dc:source) but are probably widely used in other contexts as well.

Dataset URLs Distinct objects
Common Crawl 2010 corpus 5,000,000,000a
1% sampled by WDC ~50,000,000
with RDFa 158,184b
with a cc: property 26,245c
cc:attributionName 24,942d 990e
cc:attributionURL 25,082f 3,392g
dc:source 7,235h 574i
cc:morePermissions 4,791j 253k
cc:attributionURL = dc:source 5,421l
cc:attributionURL = cc:morePermissions 1,880m
cc:attributionURL = subject 203n

Some quick takeaways:

  • Low ratio of distinct attributionURLs probably indicates HTML from license chooser deployed without any parameterization. Often the subject or current page will be the most useful attributionURL (but 203 above would probably be much higher with canonicalization). Note all of the CC licenses require that such a URL refer to the copyright notice or licensing information for the Work. Unless one has set up a side-wide license notice somewhere, a static URL is probably not the right thing to request in terms of requiring licensees to provide an attribution link; nor is a non-specific attribution link as useful to readers as a direct link to the work in question. As (and if) support for attribution metadata gets built into Creative Commons-aware CMSes, the ratio of distinct attributionURLs ought increase.
  • 79% of subjects with both dc:source and cc:attributionURL (6,836o) have the same values for both properties. This probably means people are merely entering their URL into every form field requesting a URL without thinking, not self-remixing.
  • 47% of subjects with both cc:morePermissions and cc:attributionURL (3,977p) have the same values for both properties. Unclear why this ratio is so much lower than previous; it ought be higher, as often same value for both makes sense. Unsurprising that cc:morePermissions least provided property; in my experience few people understand it.

I did not look at the provenance item at all. It’d be interesting to see what kind of assertions are being made across authority boundaries (e.g. a page on makes a statements with an URI as the subject) and when to discard such. I barely looked directly at the raw data at all; just enough to feel that my aggregate numbers could possibly be accurate. More could probably be gained by inspecting smaller samples in detail, informing other aggregate queries.

I look forward to future extracts. Thanks indirectly to Common Crawl for providing the crawl!

Please point out any egregious mistakes made below…

# a I don't really know if the October 2010 corpus is the
# entire 5 billion Common Crawl corpus

# download RDFa extract from Web Data Commons
wget -c

# Matches number stated at
wc -l ccrdf.html-rdfa.nq

# Includes easy to use no-server triplestore
apt-get install redland-utils

# sanity check
grep '<>' ccrdf.html-rdfa.nq |wc -l

# Import rejects a number of triples for syntax errors
rdfproc xyz parse ccrdf.html-rdfa.nq nquads

# d Perhaps syntax errors explains fewer triples than above grep might
# indicate, but close enough
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l

# These replicated below with 4store because...
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?o <> ?o }' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select distinct ?o where { ?s <> ?o}' |wc -l
rdfproc xyz query sparql - 'select ?o where { ?o <> ?o }' |wc -l

# ...this query takes forever, hours, and I have no idea why
rdfproc xyz query sparql - 'select ?s, ?o where { ?s <> ?o ; <> ?o }'

# 4store has a server, but is lightweight
apt-get install 4store

# 4store couldn't import with syntax errors, so export good triples from
# previous store first
rdfproc xyz serialize > ccrdf.export-rdfa.rdf

# import into 4store
curl -T ccrdf.export-rdfa.rdf 'http://localhost:8080/data/wdc'

# egrep is to get rid of headers and status output prefixed by ? or #
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

#Of course please use instead.
#Should be more widely deployed soon.
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o}' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?o where { ?o <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <> ?o ; <> ?n }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <> ?o ; <> ?n }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n, ?m where { ?s <> ?o ; <> ?n ; <> ?m }' |egrep -v '^[\?\#]' |wc -l
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <> ?o ; <> ?o ; <> ?o }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <> ?o } UNION { ?s <> ?n } UNION { ?s <> ?m }  }' |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <> ?o } UNION { ?s <> ?n }}' |egrep -v '^[\?\#]' |wc -l

#b note subjects not the same as pages data extracted from (158,184)
4s-query wdc -s '-1' -f text 'select distinct ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l

# Probably less than 1047250 claimed due to syntax errors
4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?s }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?s }'  |egrep -v '^[\?\#]' |wc -l

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?o }'  |egrep -v '^[\?\#]' |wc -l

Life in the kind of bleak future of HTML data

Thursday, January 12th, 2012

Evan Prodromou wrote in 2006:

I think that if and the RDFa effort continue moving forward without coordinating their effort, the future looks kind of bleak.

I blogged about this at the time (and forgot and reblogged five months later). I recalled this upon reading a draft HTML Data Guide announced today, and trying to think of a tl;dr summary to at least microblog.

That’s difficult. The guide is intended to help publishers and consumers of HTML data choose among three syntaxes (all mostly focused on annotating data inline with HTML meant for display) and a variety of vocabularies, with heavy dependencies between the two. Since 2006, people working on microformats and RDFa have done much to address the faults of those specifications — microformats-2 allows for generic (rather than per-format) parsing, and RDFa 1.1 made some changes to make namespaces less needed, less ugly when needed, and usable in HTML5, and specifies a lite subset. In 2009 a third syntax/model, microdata, was launched, and then in 2011 chosen as the syntax for (which subsequently announced it would also support RDFa 1.1 Lite).

I find the added existence of microdata and suboptimal (optimal might be something like microformats process for some super widely useful vocabularies, with a relatively simple syntax but permitting generic parsing and distributed extensibility; very much like what Prodromou wanted in 2006), but when is anything optimal? I also wonder how much credit microdata ought get for microformats-2 and RDFa 1.1, due to providing competitive pressure? And for invigorating metadata-enhanced web-scale search and vocabulary design (for example, the last related thing I was involved in, at the beginning anyway)?

Hope springs eternal for getting these different but overlapping technologies and communities to play well together. I haven’t followed closely in a long time, but I gather that Jeni Tennison is one of the main people working on that, and you should really subscribe to her blog if you care. That leaves us back at the HTML Data Guide, of which Tennison is the editor.

My not-really-a-summary:

  1. Delay making any decisions about HTML data; you probably don’t want it anyway (metadata is usually a cost center), and things will probably be more clear when you’re forced to check back due to…
  2. If someone wants data from you as annotated HTML, or you need data from someone, and this makes business sense, do whatever the other party has already decided on, or better yet implemented (assuming their decision isn’t nonsensical; but if so why are you doing business with them?)
  3. Use a validator to test your data in whatever format. An earlier wiki version of some of the guide materials includes links to validators. In my book, Any23 is cute.

(Yes, CC REL needs updating to reflect some of these developments, RDFa 1.1 at the least. Some license vocabulary work done by SPDX should also be looked at.)

Penumbra of Provenance

Thursday, January 12th, 2012


Yesterday the W3C’s Provenance Working Group posted a call for feedback on a family of documents members of that group have been working on. Provenance is an important issue for the info commons, as I’ve sketched elsewhere. I hope some people quickly flesh out examples of application of the draft ontology to practical use cases.

Intellectual Provenance

Apart from some degree of necessity for current functioning of some info commons (obviously where some certainty about freedoms from copyright restriction is needed, but conceivably even moreso to outgrow copyright industries), provenance can also play an important symbolic role. Unlike “intellectual property”, intellectual provenance is of keen interest to both readers and writers. Furthermore, copyright and other restrictions make provenance harder, in both practical (barriers to curation) and attitudinal — the primacy of “rights” (as in rents, and grab all that your power allows) deprecates the actual intellectual provenance of things.

Postmodern Provenance

The umbra of provenance seems infinite. As we preserve scratches of information (or not) incomparably vast amounts disappear. But why should we only care for what we can record that led to current configurations? Consider independent invention and convergent evolution. Who cares what configurations and events led to current configurations: what are the recorded configurations that could have led to the current configuration, what are all of the configurations that could have led to the current configuration; what configurations are most similar (including history, or not) to a configuration in question?


In order to highlight the exposure of provenance information on the internet and provide added impetus for organizations to expose in a way that can efficiently be found and accessed, I am exploring the possibility of a .prov TLD.

CSS text overlay image, e.g. for attribution and license notice

Sunday, January 8th, 2012

A commenter called me on providing inadequate credit for an map image I used on this blog. I’ve often seen map credits overlaid on the bottom right of maps, so I decided to try that. I couldn’t find an example of using CSS to overlay text on an image that only showed the absolute minimum needed to achieve the effect, and explained why. Below is my attempt.

Example 1

The above may be a good example of when to not use a text overlay (there is already text at the bottom of the image), but the point is to demonstrate the effect, not to look good. I have an image and I want to overlay «Context+Colophon» at the bottom right of the image. Here’s the minimal code:

<div style="position:relative;z-index:0;width:510px">
  <img src=""/>
  <div style="position:absolute;z-index:1;right:0;bottom:0">
    <a href="">Context</a>+<a href="">Colophon</a>


The outer div creates a container which the text overlay will be aligned with. A position is necessary to enable z-index, which specifies how objects will stack. Here position:relative as I want the image and overlay to flow with the rest of the post, z-index:0 as the container is at the bottom of the stack. I specify width:510px as that’s how wide the image is, and without hardcoding the size of the div, the overlay as specified will float off to the right rather than align with the image. There’s nothing special about the img; it inherits from the outer div.

The inner div contains and styles the text I want to overlay. position:absolute as I will specify an absolute offset from the container, right:0;bottom:0, and z-index:1 to place above the image. Finally, I close both divs.

That’s it. I know precious little CSS; please tell me what I got wrong.

Example 2

Above is the image that prompted this post, with added attribution and license notice. Code:

<div style="z-index:0;position:relative;width:560px"
  <a href=";lon=-122.2776&amp;zoom=14&amp;layers=Q">
    <img src=""/></a>
  <div style="position:absolute;z-index:1;right:0;bottom:0;">
      © <a rel="cc:attributionURL"
           href=";lon=-122.2776&amp;zoom=14&amp;layers=Q">OpenStreetMap contributors</a>,
        <a rel="license"


With respect to the achieving the text overlay, there’s nothing in this example not in the first. Below I explain annotations added that (but are not required by) fulfillment of OpenStreetMap/CC-BY-SA attribution and license notice.

The xmlns:ccprefix, and even that may be superfluous, given cc: as a default prefix.

about sets the subject of subsequent annotations.

small isn’t an annotation, but does now seem appropriate for legal notices, and is usually rendered nicely.

rel="cc:attributionURL" says that the value of the href property is the link to use for attributing the subject. property="cc:attributionName" says that the text (“OpenStreetMap contributors”) is the name to use for attributing the subject. rel="license" says the value of its href property is the subject’s license.

If you’re bad and not using HTTPS-Everywhere (referrer not sent due to protocol change; actually I’m bad for not serving this blog over https), clicking on BY-SA above might obtain a snippet of HTML with credits for others to use. Or you can copy and paste the above code into RDFa Distiller or checkrdfa to see that the annotations are as I’ve said.

Addendum: If you’re reading this in a feed reader or aggregator, there’s a good chance inline CSS is stripped — text intended to overlay images will appear under rather than overlaying images. Click through to the post in order to see the overlays work.

Encyclopedia of Original Research

Thursday, December 15th, 2011

As I’m prone to say that some free/libre/open projects ought strive to not merely recapitulate existing production methods and products (so as to sometimes create something much better), I have to support and critique such strivings.

A proposal for the Encyclopedia of Original Research, besides a name that I find most excellent, seems like just such a project. The idea, if I understand correctly, is to leverage Open Access literature and using both machine- and wiki-techniques, create always-up-to-date reviews of the state of research in any field, broad or narrow. If wildly successful, such a mechanism could nudge the end-product of research from usually instantly stale, inaccessible (multiple ways), unread, untested, singular, and generally useless dead-tree-oriented outputs toward more accessible, exploitable, testable, queryable, consensus outputs. In other words, explode the category of “scientific publication”.

Another name for the project is “Beethoven’s open repository of research” — watch the video.

The project is running a crowdfunding campaign right now. They only have a few hours left and far from their goal, but I’m pretty sure the platform they’re using does not require projects to meet a threshold in order to obtain pledges, and it looks like a small amount would help continue to work and apply for other funding (eminently doable in my estimation; if I can help I will). I encourage kicking in some funds if you read this in the next couple hours, and I’ll update this post with other ways to help in the future if you’re reading later, as in all probability you are.

EoOR is considerably more radical than (and probably complementary to and/or ought consume) AcaWiki, a project I’ve written about previously with the more limited aim to create human-readable summaries of academic papers and reviews. It also looks like, if realized, a platform that projects with more specific aims, like OpenCures, could leverage.

Somehow EoOR escaped my attention (or more likely, my memory) until now. It seems the proposal was developed as part of a class on getting your Creative Commons project funded, which I think I can claim credit for getting funded (Jonas Öberg was very convincing; the idea for and execution of the class are his).

Creative Commons hiring CTO

Monday, July 11th, 2011

See my blog post on the CC site for more context.

Also thanks to Nathan Yergler, who held the job for four years. I really miss working with Nathan. His are big shoes to fill, but also his work across operations, applications, standards, and relationships set the foundation for the next CTO to be very successful.

Semantic ref|pingback for re-use notification

Sunday, May 15th, 2011

Going back probably all the way to 2003 (I can’t easily pinpoint, as obvious mail searches turn up lots of hand-wringing about structured data in/for web pages, something which persists to this day) people have suggested using something like trackback to notify that someone has [re]used a work, as encouraged under one of the Creative Commons licenses. Such notification could be helpful, as people often would like to know someone is using their work, and might provide much better coverage than finding out by happenstance or out-of-band (e.g., email) notification and not cost as much as crawling a large portion of the web and performing various medium-specific fuzzy matching algorithms on the web’s contents.

In 2006 (maybe 2005) Victor Stone implemented a re-use notification (and a bit more) protocol he called the Sample Pool API. Several audio remix sites (including ccMixter, for which Victor developed the API; side note: read his ccMixter memoir!), but it didn’t go beyond that, probably in part because it was tailored to a particular genre of sites, and another part because it wasn’t clear how to do correctly, generally, get adoption, sort out dependencies (see hand-wringing above), and resource/prioritize.

I’ve had in mind to blog about re-use notification for years (maybe I already have, and forgot), but right now I’m spurred to by skimming Henry Story and Andrei Sambra’s Friending on the Social Web, which is largely about semantic notifications. Like them, I need to understand what the OStatus stack has to say about this. And I need to read their paper closely.

Ignorance thusly stated, I want to proclaim the value of refback. When one follows a link, one’s user agent (browser) often will send with the request for the linked page (or other resource) the referrer (the page with the link one just followed). In some cases, a list of pages linking to one’s resources that might be re-used can be rather valuable if one wants to bother manually looking at referrers for evidence of re-use. For example, Flickr provides a daily report on referrers to one’s photo pages. I look at this report for my account occasionally and have manually populated a set of my re-used photos largely by this method. This is why I recently noted that the (super exciting) MediaGoblin project needs excellent reporting.

Some re-use discovery via refback could be automated. My server (and not just my server, contrary to Friending on the Social Web; could be outsourced via javascript a la Google Analytics and Piwik) could crawl the referrer and look for structured data indicating re-use at the referrer (e.g., my page or a resource on it is subject or object of relevant assertions, e.g., dc:source) and automatically track re-uses discovered thusly.

A pingback would tell my server (or service I have delegated to) affirmatively about some re-use. This would be valuable, but requires more from the referring site than merely publishing some structured data. Hopefully re-use pingback could build upon the structured data that would be utilized by re-use refback and web agents generally.

After doing more reading, I think my plan to to file the appropriate feature requests for MediaGoblin, which seems the ideal software to finally progress these ideas. A solution also has obvious utility for oft-mooted [open] data/education/science scenarios.

Collaborative Futures 3

Thursday, January 21st, 2010

Day 3 of the Collaborative Futures book sprint and we’re close to 20,000 words. I added another chapter intended for the “future” section, current draft copied below. It is very much a scattershot survey based on my paying partial attention for several years. There’s nothing remotely new apart from recording a favorite quote from my colleague John Wilbanks that doesn’t seem to have been written down before.

Continuing a tradition, another observation about the sprint group and its discussions: an obsession with attribution. A current drafts says attribution is “not only socially acceptable and morally correct, it is also intelligent.” People love talking about this and glomming on all kinds of other issues including participation and identity. I’m counter-obsessed (which Michael Mandiberg pointed out means I’m still obsessed).

Attribution is only interesting to me insofar as it is a side effect (and thus low cost) and adds non-moralistic value. In the ideal case, it is automated, as in the revision histories of wiki articles and version control systems. In the more common case, adding attribution information is a service to the reader — nevermind the author being attributed.

I’m also interested in attribution (and similar) metadata that can easily be copied with a work, making its use closer to automated — Creative Commons provides such metadata if a user choosing a license provides attribution information and CC license deeds use that metadata to provide copy&pastable attribution HTML, hopefully starting a beneficient cycle.

Admittedly I’ve also said many times that I think attribution, or rather requiring (or merely providing in the case of public domain content) attribution by link specifically, is an undersold term of the Creative Commons licenses — links are the currency of the web, and this is an easy way to say “please use my work and link to me!”

Mushon Zer-Aviv continues his tradition for day 3 of a funny and observant post, but note that he conflates attribution and licensing, perhaps to make a point:

The people in the room have quite strong feelings about concepts of attribution. What is pretty obvious by now is that both those who elevate the importance of proper crediting to the success of collaboration and those who dismiss it all together are both quite equally obsessed about it. The attribution we chose for the book is CC-BY-SA oh and maybe GPL too… Not sure… Actually, I guess I am not the most attribution obsessed guy in the room.

Science 2.0

Science is a prototypical example of collaboration, from closely coupled collaboration within a lab to the very loosely coupled collaboration of the grant scientific enterprise over centuries. However, science has been slow to adopt modern tools and methods for collaboration. Efforts to adopt or translate new tools and methods have been broadly (and loosely) characterized as “Science 2.0″ and “Open Science”, very roughly corresponding to “Web 2.0″ and “Open Source”.

Open Access (OA) publishing is an effort to remove a major barrier to distributed collaboration in science — the high price of journal articles, effectively limiting access to researchers affiliated with wealthy institutions. Access to Knowledge (A2K) emphasizes the equality and social justice aspects of opening access to the scientific literature.

The OA movement has met with substantial and increasing success recently. The Directory of Open Access Journals (see lists 4583 journals as of 2010-01-20. The Public Library of Science’s top journals are in the first tier of publications in their fields. Traditional publishers are investing in OA, such as Springer’s acquisition of large OA publisher BioMed Central, or experimenting with OA, for example Nature Precedings.

In the longer term OA may lead to improving the methods of scientific collaboration, eg peer review, and allowing new forms of meta-collaboration. An early example of the former is PLoS ONE, a rethinking of the journal as an electronic publication without a limitation on the number of articles published and with the addition of user rating and commenting. An example of the latter would be machine analysis and indexing of journal articles, potentially allowing all scientific literature to be treated as a database, and therefore queryable — at least all OA literature. These more sophisticated applications of OA often require not just access, but permission to redistribute and manipulate, thus a rapid movement to publication under a Creative Commons license that permits any use with attribution — a practice followed by both PLoS and BioMed Central.

Scientists have also adopted web tools to enhance collaboration within a working group as well as to facilitate distributed collaboration. Wikis and blogs have been purposed as as open lab notebooks under the rubric of “Open Notebook Science”. Connotea is a tagging platform (they call it “reference management”) for scientists. These tools help “scale up” and direct the scientific conversation, as explained by Michael Nielsen:

You can think of blogs as a way of scaling up scientific conversation, so that conversations can become widely distributed in both time and space. Instead of just a few people listening as Terry Tao muses aloud in the hall or the seminar room about the Navier-Stokes equations, why not have a few thousand talented people listen in? Why not enable the most insightful to contribute their insights back?

Stepping back, what tools like blogs, open notebooks and their descendants enable is filtered access to new sources of information, and to new conversation. The net result is a restructuring of expert attention. This is important because expert attention is the ultimate scarce resource in scientific research, and the more efficiently it can be allocated, the faster science can progress.

Michael Nielsen, “Doing science online”,

OA and adoption of web tools are only the first steps toward utilizing digital networks for scientific collaboration. Science is increasingly computational and data-intensive: access to a completed journal article may not contribute much to allowing other researcher’s to build upon one’s work — that requires publication of all code and data used during the research used to produce the paper. Publishing the entire “resarch compendium” under apprpriate terms (eg usually public domain for data, a free software license for software, and a liberal Creative Commons license for articles and other content) and in open formats has recently been called “reproducible research” — in computational fields, the publication of such a compendium gives other researches all of the tools they need to build upon one’s work.

Standards are also very important for enabling scientific collaboration, and not just coarse standards like RSS. The Semantic Web and in particular ontologies have sometimes been ridiculed by consumer web developers, but they are necessary for science. How can one treat the world’s scientific literature as a database if it isn’t possible to identify, for example, a specific chemical or gene, and agree on a name for the chemical or gene in question that different programs can use interoperably? The biological sciences have taken a lead in implementation of semantic technologies, from ontology development and semantic databsases to inline web page annotation using RDFa.

Of course all of science, even most of science, isn’t digital. Collaboration may require sharing of physical materials. But just as online stores make shopping easier, digital tools can make sharing of scientific materials easier. One example is the development of standardized Materials Transfer Agreements accompanied by web-based applications and metadata, potentially a vast improvement over the current choice between ad hoc sharing and highly bureaucratized distribution channels.

Somewhere between open science and business (both as in for-profit business and business as usual) is “Open Innovation” which refers to a collection of tools and methods for enabling more collaboration, for example crowdsourcing of research expertise (a company called InnoCentive is a leader here), patent pools, end-user innovation (documented especially by Erik von Hippel in Democratizing Innovation), and wisdom of the crowds methods such as prediction markets.

Reputation is an important question for many forms of collaboration, but particularly in science, where careers are determined primarily by one narrow metric of reputation — publication. If the above phenomena are to reach their full potential, they will have to be aligned with scientific career incentives. This means new reputation systems that take into account, for example, re-use of published data and code, and the impact of granular online contributions, must be developed and adopted.

From the grand scientific enterprise to business enterprise modern collaboration tools hold great promise for increasing the rate of discovery, which sounds prosaic, but may be our best tool for solving our most vexing problems. John Wilbanks, Vice President for Science at Creative Commons often makes the point like this: “We don’t have any idea how to solve cancer, so all we can do is increase the rate of discovery so as to increase the probability we’ll make a breakthrough.”

Science 2.0 also holds great promise for allowing the public to access current science, and even in some cases collaborate with professional researchers. The effort to apply modern collaboration tools to science may even increase the rate of discovery of innovations in collaboration!


Wednesday, December 17th, 2008

December 16 marked six years since the release of the first Creative Commons licenses. Most of the celebrations around the world have already taken place or are going on right now, though San Francisco’s is on December 18. (For CC history before 2002-12-16, see video of a panel recorded a few days ago featuring two of CC’s founding board members and first executive director or read the book Viral Spiral, available early next year, though my favorite is this email.)

I’ve worked for CC since April, 2003, though as I say in the header of this blog, I don’t represent any organization here. However, I will use this space to ask for your support of my and others’ work at CC. We’re nearing the end of our fourth annual fall public fundraising campaign and about halfway to our goal of raising US$500,000. We really need your support — past campaigns have closed out with large corporate contributions, though one has to be less optimistic about those given the financial meltdown and widespread cutbacks. Over the longer term we need to steadily decrease reliance on large grants from visionary foundations, which still contribute the majority of our funding.

Sadly I have nothing to satisfy a futarchist donor, but take my sticking around as a small indicator that investing in Creative Commons is a highly leveraged way to create a good future. A few concrete examples follow.

became a W3C Recommendation on October 14, the culmination of a 4+ year effort to integrate the Semantic Web and the Web that everyone uses. There were several important contributors, but I’m certain that it would have taken much longer (possibly never) or produced a much less useful result without CC’s leadership (our motivation was first to describe CC-licensed works on the web, but we’re also now using RDFa as infrastructure for building decoupled web applications and as part of a strategy to make all scientific research available and queryable as a giant database). For a pop version (barely mentioning any specific technology) of why making the web semantic is significant, watch Kevin Kelly on the next 5,000 days of the web.

Wikipedia seems to be on a path to migrating to using the CC BY-SA license, clearing up a major legal interoperability problem resulting from Wikipedia starting before CC launched, when there was no really appropriate license for the project. The GNU FDL, which is now Wikipedia’s (and most other Wikimedia Foundation Projects’) primary license, and CC BY-SA are both copyleft licenses (altered works must be published under the same copyleft license, except when not restricted by copyright), and incompatible widely used copyleft licenses are kryptonite to the efficacy of copyleft. If this migration happens, it will increase the impact of Wikipedia, Creative Commons, free culture, and the larger movement for free-as-in-freedom on the world and on each other, all for the good. While this has basically been a six year effort on the part of CC, FSF, and the Wikimedia Foundation, there’s a good chance that without CC, a worse (fragmented, at least) copyleft landscape for creative works would result. Perhaps not so coincidentally, I like to point out that since CC launched, there has been negative in the creative works space, the opposite of the case in the software world.

Retroactive copyright extension cripples the public domain, but there are relatively unexplored options for increasing the effective size of the public domain — instruments to increase certainty and findability of works in the public domain, to enable works not in the public domain to be effectively as close as possible, and to keep facts in the public domain. CC is pursuing all three projects, worldwide. I don’t think any other organization is placed to tackle all of these thorny problems comprehensively. The public domain is not only tremendously important for culture and science, but the only aesthetically pleasing concept in the realm of intellectual protectionism (because it isn’t) — sorry, copyleft and other public licensing concepts are just necessary hacks. (I already said I’m giving my opinion here, right?)

CC is doing much more, but the above are a few examples where it is fairly easy to see its delta. CC’s Science Commons and ccLearn divisions provide several more.

I would see CC as a wild success if all it ever accomplished was to provide a counterexample to be used by those who fight against efforts to cripple digital technologies in the interest of protecting ice delivery jobs, because such crippling harms science and education (against these massive drivers of human improvement, it’s hard to care about marginal cultural production at all), but I think we’re on the way to accomplishing much more, which is rather amazing.

More abstractly, I think the role of creating “commons” (what CC does and free/open source software are examples) in nudging the future in a good direction (both discouraging bad outcomes and encouraging good ones) is horribly underappreciated. There are a bunch of angles to explore this from, a few of which I’ve sketched.

While CC has some pretty compelling and visible accomplishments, my guess is that most of the direct benefits of its projects (legal, technical, and otherwise) may be thought of in terms of lowering transaction costs. My guess is those benefits are huge, but almost never perceived. So it would be smart and good to engage in a visible transaction — contribute to CC’s annual fundraising campaign.