Creative Commons

Pinterest Exclusion Protocol

Tuesday, February 28th, 2012
1
<meta name="pinterest" content="nopin"/>

Weirdly vendor-specific and coarse at the same time. Will other sites follow this directive, which could mean something like “don’t repost images referenced in this page”, which does differ a bit from:

1
<meta name="robots" content="noimageindex"/>

Not to mention actually using the Robots Exclusion Protocol, and perish the thought, POWDER, or even annotating individual images with microdata/formats/RDFa.

Then there’s the Spam Pinterest Spam Protocol, I mean “pin this” button. I have not been following web actions/activities/intents development beyond the headlines, but please rid us of the so-called NASCAR effect.

Not entirely orthogonal to these vendor-specific exclusion and beg-for-inclusion protocols, are images released under public licenses — not entirely orthogonal as nopin seems to be aimed at countering copyright threats (supplementing DMCA takedown compliance), which public licenses, at least free ones, waive conditionally; and releasing work under a public license is a more general invitation to include.

As far as I can tell Pinterest relies on no public license, and thus complies with no public license condition (ie license notice and attribution). As it probably should not, given its strategy appears to be relying on safe harbors and making it possible for those who want to make an effort to opt-out entirely to do so: public licenses are superfluous. Obviously Pinterest could have taken a very different strategy, and relied on public copyright licenses and public domain materials — at a very high cost: pintereters(?) would need to know about copyright, which is hugely less fun than pinning(?) whatever looks cool.

Each of these (exclusion, inclusion, copyright mitigation strategy) are fine examples of simple-ugly-but-works vs. general-elegant-but-but-but…

I’m generally a fan of general-elegant-but-but-but, but let’s see what’s good and hopeful about reality:

  • “Don’t repost images referenced in this page” is a somewhat understandable preference; let’s assume people get something out of expressing and honoring it. nopin helps people realize some of this something, using a <meta> tag is not ridiculous, and if widely used, maybe provides some in-the-wild examples to inform more sophisticated approaches.
  • I can’t think of anything good about site-specific “share” buttons. But of the three items in this list, I have by far the highest hope for a general-elegant mechanism “winning” in the foreseeable future.
  • Using copyright exceptions and limitations is crucial to maintaining them, and this is wholly good. Now it’d be nice to pass along the availability of a public license, even if one is not relying on such, as a feature for users who may wish to rely on the license for other purposes, but admittedly providing this feature is not cost-free. But I also want to see more projects services (preferably also free and federated, but putting that issue aside) that do take the strategy of relying on public licenses (which does not preclude also pushing on exceptions and limitations) as such have rather different characteristics which I think have longer-term and large benefits for collaboration, policy, and preservation.

IMG_4346
Pin It

5 years of version 3.0 of Creative Commons licenses

Saturday, February 25th, 2012

img_0263.jpg

Version 3.0 of the main Creative Commons licenses were released 2007-02-23, that is 5 years and 2 days ago, remaining current much longer than previous versions* (1.0: 17 months, 2.0: 12 months; 2.5: 20 months). Hopefully the eventual version 4.0 licenses will remain current much longer still.

Probably the most important developments that 3.0 contributed to were the adoption of CC-BY-SA as the primary license used by Wikimedia projects and the use of various CC licenses (most often CC-BY) by governments and other policy-making entities. 3.0 was probably not absolutely necessary and certainly only a small part of these developments, but it surely helped, e.g., by addressing some of the Debian community’s concerns (which overlap with the views of many Wikimedians) and through further internationalization.

Congratulations to everyone who worked on 3.0 5-6 years ago, especially CC’s General Counsel at the time, Mia Garlick, CC affiliates, and especially (and with some overlap) people who participated in public discussions!

* A further bit of trivia: depending on how one ‘counts’ (eg, not if one counts each edit on Wikipedias!), version 2.0 licenses probably remains the most used, as those are baked into Flickr. During 4.0′s long tenure, I very much hope to see new tools and forms which make it the most used CC license version, even if Flickr does not change at all and continues to grow.

Permissions are job 0 for public licenses

Saturday, February 25th, 2012

Copyright permission is the only mechanism that almost unambiguously is required to maximize social value realized from sharing and collaboration around intangible goods (given that copyright exists):

  • Some people think the addition of conditions that are in effect non-copyright regulation are also required, but others disagree, and given widespread ignorance about and noncompliance with copyleft regulation, I put in the class of probably important (is there anyone conducting serious research around this question?) rather than that of unambiguously required. In any case, current copyleft conditions would be nonsensical if not layered on top of permissions.
  • I’ve heard the argument made that no mechanism is needed: culture aided by the net will route around copyright and other restrictions, just ignore them. I can’t find a good example, but some exhortations and the like of copyheart and kopimi are a subset of the genre. But unless one can make the case that the participation of wealthy litigation targets (any significant organization, from IBM to Wikimedia) is a net negative (and that’s only the first hurdle for such an argument to clear), a mechanism for permissions that appear legally sound to the copyright regime seem unambiguously necessary.
  • There are lots of other real and potential restrictions that permission can and may be possible to grant around, but so much progress has been made with only copyright permissions explicitly granted, and how other restrictions will play out largely a matter of speculation, that I put other permissions also in the class of probably important rather than unambiguously required.

Each of these merit much more experimentation and critique, but while any progress on the first two will inevitably be controversial, progress on the third ought be celebrated and demanded. (For completeness sake, progressive changes in social policy must also be celebrated and demanded, but out of scope for this post.) I see few excuses for new licenses and dedications to not aggressively grant every permission that might be possible or needed, nor for new projects to use instruments that are not so aggressive (with the gigantic constraints that use of existing works and the non-existence of perfect instruments impose), nor for communities that vet instruments to give a stamp of approval to such instruments — indeed if politics and path dependencies were not an issues, such communities ought to push non-aggressive instruments to some kind of legacy status.

In this context I am happy with the outcome of the submission of CC0 to the Open Source Initiative for approval: due to not only lack of, but explicit exclusion of patent permissions, Creative Commons has withdrawn the submission. Richard Fontana’s and Bruce Perens’ contributions to the thread are instructive.

I still think that CC0 is the best thing Creative Commons has ever done — indeed I think that largely because of the above considerations; I don’t know of an instrument that makes as thorough attempt to grant permission around all copyright, related, and neighboring restrictions (patents aren’t in any of those classes) — and remain very happy that the Free Software Foundation considers CC0 to be GPL-compatible (I put GPL-incompatibility in a class of avoidable failure separate from but not wholly unlike not granting all permissions that may be possible, unless one is experimenting with some really novel copyleft regulation).

From the OSI submission thread, I also highly recommend Carl Boettiger’s plea for a public domain instrument appropriate for heterogeneous (code/data/other) products. It will (and ought to) take Creative Commons a long time to vet any potential new version of CC0, but fortunately as I’ve pointed out before, there is plenty of room for experimentation with public domain mechanisms, especially around branding (as incompatibility is less of an issue; compare with copyleft (although if one made explicit compatibility a requirement, there is plenty of potentially beneficial exploration to be done there, too)). An example of such that attempts to include a patent waiver is the Ampify Unlicense (background post).

I hope that the CC0/OSI discussion prompts a race to the bottom for public domain instruments, as new ones attempt to carve out every possible permission. This also ought beneficially affect future permissive and copyleft licenses, which also ought grant every permission possible, whatever conditions they layer on top. Note that adding one such permission — around sui generis database restrictions, is probably the most pressing reason for Creative Commons to have started working on version 4.0 of its licenses. I also hope that the discussion leads to increased collaboration and knowledge sharing (at the very least) across domains in which public licenses are used, taking into account Boettiger’s plea and the realities that such licenses are very often used across several domains (a major point of my recent FOSDEM talk, see especially slides 8-11) and that knowledge concerning commons governance is very thin in every domain.

But keep in mind that most of this post concerns very small potential gains relative to merely granting copyright permission (assuming no non-free conditions are present) and even those are quite a niche subject.☻

FOSDEM 2012 and computational diversity

Saturday, February 11th, 2012

I spent day 1 of FOSDEM in the legal devroom and day 2 mostly talking to a small fraction of the attendees I would’ve liked to meet or catch up with. I didn’t experience the thing I find in concept most interesting about FOSDEM: devrooms (basically 1-day tracks in medium sized classrooms) dedicated to things that haven’t been hyped in ~20 years but are probably still doing very interesting things technically and otherwise, eg microkernels and Ada.

Ada has an interesting history that I’d like to hear more about, with the requirement of highly reliable software (I suspect an undervalued characteristic; I have no idea whether Ada has proven successful in this regard, would appreciate pointers) and fast execution (on microbenchmarks anyway), and even an interesting free software story in that history, some of which is mentioned in a FOSDEM pre-interview.

I suppose FOSDEM’s low cost (volunteer run, no registration) and largeness (5000 attendees) allows for such seemingly surprising, retro, and maybe important tracks — awareness of computational diversity is good at least for fun, and for showing that whatever systems people are attached to or are hyping at the moment are not preordained.

I also wanted to mention one lightning talk I managed to see — Mike Sheldon on Libre.fm [update 20120213: video], which I think is one of the most important software projects for free culture — because it facilitates not production or sharing of “content”, but of popularity (I’ve mentioned as “peer production of [free] cultural relevance”). Sheldon (whose voice you can hear on the occasional Libre.fm podcast) stated that GNU FM (the software libre.fm runs) will support sharing of listener tastes across installations, so that a user of libre.fm or a personal instance might tell another instance (say one set up for a local festival) to recommend music that instance knows about based on a significant history. Sounds neat. You can see what libre music I listen to at alpha.libre.fm/user/mlinksva and more usefully get recommendations for yourself.

Addendum: In preemptive defense of this post’s title, of course I realize neither microkernels nor Ada are remotely weird, retro, alternative, etc. and that there are many other not quite mainstream but still relevant and modern systems and paradigms (hmm, free software desktops)…

2012-02-03%2008.26.46
2012-02-04%2002.44.16
2012-02-05%2001.44.49

It started snowing as soon as I arrived in Brussels, and was rather cold.

2012-02-06%2002.44.32

I got on the wrong train to the airport and got to see the Leuven train station. I made it to the airport half an hour before my flight, and arrived at the gate during pre-boarding. Try that in a US airport.

FOSDEM 2012 Legal Issues DevRoom

Thursday, February 9th, 2012

I attended and spoke at the FOSDEM 2012 Legal Issues DevRoom (Update 20120217: slides, blog posts) organized by Tom Marble, Bradley Kuhn, Karen Sandler, and Richard Fontana. I understand the general idea was to gather people for advanced discussions of free/libre/open source software legal and policy issues, bypassing the “what is copyright?” panel that apparently afflicts such conferences (I haven’t noticed, but don’t go to many FLOSS conferences; I bet presenters usually get the answer only superficially correct). I thought the track mostly succeeded (consider this high praise) — presentations did cover contemporary issues that mostly only people following FLOSS policy would have heard of, but I wished for just a bit more that would be news or really provocative to such people. In part I think 30 minute time slots were to blame — long enough for presenters to belabor background points, short enough for no substantive discussion. Given only 30 minutes, I personally probably would have benefited from a 15 minute speaking limit, thus being forced to state only important points, and leaving a little time for participants to tear those apart. Yes, I should have imposed that discipline on myself, but did not think of it until now.

Philippe Laurent gave an overview of cases involving “Open Licences before European Courts”. He did not list one recent “open content” case, Gerlach vs. DVU.

Ambjörn Elder on “The Methods of FOSS Activism” spoke about political activism; a worthy topic, but I hope for more discussion of activism for software freedom, rather than against ever worse policy.

In place of Armijn Hemel’s “Goes into an Executable? Identifying a Binary’s Sources by Tracing Build Processes” (missed flight) Kuhn and Sandler excerpted from a presentation on and took questions regarding nonprofit homes for free software projects. Writing this reminded me to make a donation to Software Freedom Conservancy, of which Kuhn and Sandler are respectively ED and Secretary of. Somewhat tangentially, I don’t find the topic boring, but I do find the lack of information, informed-ness (including mine), and tools regarding it boring. I don’t know of any libre documentation on running a nonprofit — I’d love to see a series of FLOSS Manuals on this. OneClickOrgs is a fairly new free software project to handle some aspects of governing a small organization, but I don’t know how useful it is at this point. Related to lack of documentation, some of the Q&A emphasized how little people know of these topics across jurisdictions — nevermind rule minutiae, even the existence of relevant “home” organizations.

Dave Neary on “Grey Areas of Software Licensing” questioned whether one could legally do various things, using examples largely drawn from GIMP development. The answer is always maybe. Fortunately developers sometimes take that as yes.

Allison Randal gave an overview of FLOSS history with a focus on legal arrangements in “FLOSSing for Good Legal Hygiene: Stories from the Trenches”.

Michael Meeks on “Risks vs. Benefits on Copyright Assignment” made the case that assignment (and some non-assingment contributor agreements) is harmful to participation, and proprietary re-licensing has not proven a good business, so a corporate sponsored software project ought to either be free (sans assignment and potential for propreitary relicensing) or proprietary, and fully enjoy the benefits of one or the other, rather than neither. He also indicated that permissive licensing can be better than copyleft for a free software project with copyrights held by a corporation, as the former gives all effectively equal rights, while the latter abets proprietary relicensing and ridiculous claims that the corporate sponsor will protect the community. Meeks repeatedly called on the FSF to abandon assingment, as for-profits disingenuously cite FSF’s practice in support of their own (FSF ED John Sullivan responded that they are getting corrections made where FSF practice is inappropriately cited and will work on explaining their practice better). Finally, Meeks requested an “ALGPL” which would require sharing of modified sources used to provide a network service, like the AGPL, but allow modifications that only link to or the equivalent ALGPL codebase to not be shared. I don’t know whether he wants GPL or LGPL behavior if such modificaitons are distributed. I was somewhat chagrined (but understanding; just not enough time, and maybe nobody submitted a decent proposal) that this was the only1 discussion of network services!

Loïc Dachary on “Can for-profit companies enforce copyleft without becoming corrupt like MySQL AB?” said yes, if they aren’t the sole copyright holders; on projects he is hired to work on, he seeks out additional contributors who will hold copyright independently.

John Sullivan in “Is copyleft being framed?” presented some new data, apparently replicable (based on Debian package metadata), showing that GPL-family licenses are used in the vast majority (did I hear 87%?) of Debian packages. Update 20120217: I did hear 87%, in 2009, and 93% in 2011. Note some software available under multiple licenses. Slides.

Richard Fontana on “The (possible) decline of the GPL, and what to do about it” suggested the need to start thinking about GPLv4, but I’m not sure for what issues2 — doesn’t matter; if the particulars of licenses can make a big difference, requirements for the next version of important ones should always be a relevant topic, even if there is no expectation of creating another version for many years. Fontana also indicated that perhaps the next (massively adopted, presumably) copyleft might not be created by an existing steward3 (meaning the FSF, or obviously CC in many non-software fields), which I take as an indication that license innovation is possibly more important than compatibility and non-proliferation.

I don’t remember much of panels with Hugo Roy, Giovanni Battista Gallus, Bradley Kuhn, Richard Fontana on application stores and Ciarán O’Riordan, Benjamin Henrion, Deb Nicholson, Karen Sandler on software patents, as I was probably preparing for my talk, but I trust that free software is still important if mode of delivery changes slightly and that software patents ought be abolished.

I spoke on “⊂ (FLOSS legal/policy ∩ CC [4.0])” (slides: odp, pdf, slideshare). Contrary to my apology I didn’t blog much of the talk beforehand. I will get to all of the topics eventually.

Most of the slides from the day should be available soon on the DevRoom’s page. Some audio might be available as well eventually.

Kuhn demonstrated his qualifications for another fallback career: crowd crontol. Fontana blogged a summary of the devroom. Sandler gave the most important talk on FLOSS policy (but not at FOSDEM). Marble apparently did almost all the organizing. Thanks to all! There will be another legal/policy devroom next year.

Addendum 20120210: Richard Fontana offered these corrections:

1“re network services, I mentioned rise as factor in possible GPL decline, coupled with AGPL pwned by dual-license hucksters”

2“main reason for GPLv4 right now is GPLv3 is needlessly complex, limiting popularity of strong copyleft.”

3“growing concern that anti-license-proliferationism concentrates power in privileged Establishment organizations”

8 year Refutation Blog

Saturday, February 4th, 2012

I first posted to this blog exactly 8 years ago, after a few years of dithering over which blog software to use (WordPress was the first that made me not feel like I had to write my own; maybe I was waiting for 1.0, released January 2004).

A little over two years ago I had the idea for a “refutation blog”: after some number of years, a blogger might attempt to refute whatever they wrote previously. In some cases they may believe they were wrong and/or stupid, in all cases, every text and idea is worthy of all-out attack, given enough resources to carry out such, and passing of time might allow attacks to be carried out a bit more honestly. I have little doubt this has been done before, and analogously for pre-blog forms; I’d love pointers.

The last two Februaries have passed without adequate time to start refuting. In order to get started (I could also write software to manage and render refutations, and figure out what vocabulary to use to annotate them, and unlikely but might in the fullness of time, but I won’t accept the excuse for years more of delay right now) I’m lowering my sights from “all-out attack” to a very brief attack on the substance of a previous post, and will do my best to avoid snarky asides.

I have added a refutation category. I will probably continue non-refutation posts here (and hope to refute those 8 years after posting). I may eventually move my current blogging or something similar to another site.

Back to that first post, See Yous at Etech. “Alpha geeks” indeed. With all the unintended at the time, but fully apparent in the name, implication of status seeking and vaporware over deep technical substance and advancement. The “new CC metadata-enhanced application” introduced there was a search prototype. The enhancement was a net negative. Metadata is costly, and usually crap. Although implemented elsewhere since then, as far as I can tell a license filter added to text-based search has never been very useful. I never use it, except as a curiosity. I do search specific collections, where metadata, including license, is a side effect of other collection processes. Maybe as and if sites automatically add annotations to curated objects, aggregation via search with a license and other filters will become useful.

Copyleft regulates

Tuesday, January 31st, 2012

Copyleft as a pro-software-freedom regulatory mechanism, of which more are needed.

Existing copyleft licenses include conditions that would not exist (unless otherwise implemented) if copyright were abolished. In other words, copyleft does not merely neutralize copyright. But I occasionally1 see claims that copyleft merely neutralizes copyright.

A copyleft license which only neutralized copyright would remove all copyright restrictions on only one condition: that works building upon a copyleft licensed work (usually as “adaptations” or “derivative works”, though other scopes are possible) be released under terms granting the same freedoms. Existing copyleft licenses have additional conditions. Here is a summary of some of those added by the most important (and some not so important) copyleft licenses:

License Provide modifiable form2 Limit DRM Attribution Notify upstream3
BY-SA y y
FDL y y y
EPL y y
EUPL y y
GPL (including LGPL and AGPL) y y
LAL y
MPL (and derivatives) y y
ODbL y y y
OFL y
OSL y y
OHL y y y

I’ve read each of the above licenses at some point, but could easily misremember or misunderstand; please correct me.

There’s a lot more variation among them than is captured above, including how each condition is implemented. But my point is just that these coarse conditions would not be present in a purely copyright neutralizing license. To answer two obvious objections: “attribution”4 in each license above goes beyond the bare minimum license notice that would be required to satisfy the condition of releasing under sufficient terms, and “limit DRM” refers only to conditions prohibiting DRM or requiring parallel distribution (which all of those requiring modifiable form do in a way, indirectly; I’ve only called out those that explicitly mention DRM), not permissions5 granted to circumvent.

I’m not sure there’s a source for the idea that copyleft only neutralizes copyright. Probably it is just an intuitive reading of the term that has been arrived at independently many times. The English Wikipedia article on copyleft doesn’t mention it, and probably more to the point, none of the main FSF articles on copyleft do either. The last includes the following:

Proprietary software developers use copyright to take away the users’ freedom; we use copyright to guarantee their freedom. That’s why we reverse the name, changing “copyright” into “copyleft.”

Copyleft is a way of using of the copyright on the program. It doesn’t mean abandoning the copyright; in fact, doing so would make copyleft impossible. The “left” in “copyleft” is not a reference to the verb “to leave”—only to the direction which is the inverse of “right”.

Copyleft is a general concept, and you can’t use a general concept directly; you can only use a specific implementation of the concept.

This is very clear — the point of copyleft is to promote and protect (“guarantee” is an exaggeration) users’ freedom, and that includes their access to source. The major reason I like to frame copyleft as regulation6 is that if access to source is important to software freedom (or otherwise socially valuable), it probably makes sense to look for additional regulatory mechanisms which might (and appreciate ones that do) contribute to promoting and protecting access to source, as well as other aspects of software freedom. Such mechanisms mostly aren’t/wouldn’t be “copyleft” (though at this point, some of them would simply mandate a copyleft license), but the point is not a relationship with copyright, but promoting and protecting software freedom.

If software freedom is important, surely it makes sense to look for additional mechanisms to promote and protect it. As others have said, licenses are difficult to enforce and/or few people are interested in doing it, and copyleft can be made irrelevant through independent non-copyleft implementation, given enough desire and resources (which the largest corporations have), not to mention the vast universe of cases in which there is no free software alternative, copyleft or not. I leave description and speculation about such mechanisms for a future post.


1For example, yesterday Rob Myers wrote:

Copyleft is a general neutralization of copyright (rather than a local neutralization, like permissive licences). Nothing more.

Only slightly more ambiguously, late last year Jason Self wrote:

Copyright gives power to restrict what other people can do with their own copies of things. Copyleft is about restoring those rights: It takes this oppressive law, which normally restricts people and takes their rights away, and make those rights inalienable.

Well said…but not exactly. I point these out merely as examples, not to make fun of Myers, who is one of the sharpest libre thinkers there is, or Self, who as far as I can tell is an excellent free software advocate.

2Note it is possible to have copyleft that doesn’t require source. As far as I know, such only exists in licenses not intended for software. But I think source for non-software is very interesting. The other obvious permutations — a copyleft license for software that does not include a source requirement, and a non-copyleft license that does include a source requirement, are curiosities that do not seem to exist at all — probably for the better, although one can imagine questionable use cases (e.g., self-modifying object code and transparency as only objective).

3As I’ve mentioned previously, requiring upstream notification likely makes the TAPR OHL non-free/open. But I list the license and condition here because it is an interesting regulation.

4One could further object that one ought to consider so-called “economic” and “moral” aspects of copyright separately, and only neutralize the former; attribution perhaps being the best known and least problematic of the former.

5Although existing copyleft licenses don’t only neutralize restrictions (one that did would be another curiosity; perhaps the License Art Libre/Free Art License currently comes closest), it is important that copyright and other restrictions are adequately neutralized — in particular modern public software licenses include patent grants, and GPLv3 permits DRM circumvention (made illegal by some copyright-related legislation such as the DMCA), while version 4.0 of CC licenses will probably grant permissions around “sui generis” restrictions on databases. Such neutralization is only counter-regulatory (if one sees copyright as a regulation), not pro-regulatory, as are source and other conditions discussed above.

6Regulation in the broadest sense, including at a minimum typical “government” and “market” regulation, as I’ve said before. By the way, it could be said that those who advocate only permissive licenses are anti-regulatory, and I imagine that if lots of people thought about copyleft as regulation, this claim would be made — but it would be a problematic claim, as permissive licenses don’t do much (or only do so “locally”, as Myers obliquely put it in the quote above) against the background regulation of copyright restrictions.

Someday knowing the ins and outs of copyright will be like knowing the intricate rules of internal passports in Communist East Germany

Thursday, January 26th, 2012

Said Evan Prodromou, who I keep quoting.

I repeat Evan as a reminder and apology. I’ve blogged many times about copyright licenses in the past, and will have a few detailed posts on the subject soon in preparation for a short talk at FOSDEM.

Given current malgovernance of the intellectual commons, public copyright licenses are important for freedom. They’re probably also important trials for post-copyright regulation (meant in the broadest sense, including at least “market” and “government” regulatory mechanisms), eg of ability to inspect and modify complete and corresponding source.

At the same time, the totemic and contentious role copyright licenses (and sometimes assignment or contributor agreements, and sometimes covering related wrongs and patents) play in free/libre/open works, projects, and communities often seems an unfortunate misdirection of energy at best, and probably looks utterly ridiculous to casual observers. I suspect copyright also takes at least some deserved limelight, and perhaps much more, from other aspects of governance, plain old getting things done, and activism around other issues (regarding the first, some good recent writings includes those by Simon Phipps and Bradley Kuhn, but the prominence of copyright arrangements therein reinforces my point). But this all amounts to an additional reason it is important to get the details of public copyright licenses right, in particular compatibility between them where it can be achieved — so as to minimize the amount of time and energy projects put into considering and arguing about the options.

Obviously the energy put into public licenses is utterly insignificant against that spent on other copyright/patent/trademark complex activities. But I’m not going to write about that in the near future, so it isn’t part of my apology and rationalization.

Someday I hope that knowing the ins and outs of both Internal Passports of the mind and international passports will be like knowing the rules of internal passports in Communist East Germany (presumably intricate; I did not look for details, but hopefully they exist not many hops from a Wikipedia article on Eastern Bloc emigration and defection).

Web Data Common[s] Crawl Attribution Metadata

Monday, January 23rd, 2012

Via I see Web Data Commons which has “extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010″. WDC publishes the extracted data as N-Quads (the fourth item denotes the immediate provenance of each subject/predictate/object triple — the URL the triple was extracted from).

I thought it would be easy and fun to run some queries on the WDC dataset to get an idea of how annotations associated with Creative Commons licensing are used. Notes below on exactly what I did. The biggest limitation is that the license statement itself is not part of the dataset — not as xhv:license in the RDFa portion, and for some reason rel=license microformat has zero records. But cc:attributionName, cc:attributionURL, and cc:morePermissions are present in the RDFa part, as are some Dublin Core properties that the Creative Commons license chooser asks for (I only looked at dc:source) but are probably widely used in other contexts as well.

Dataset URLs Distinct objects
Common Crawl 2010 corpus 5,000,000,000a
1% sampled by WDC ~50,000,000
with RDFa 158,184b
with a cc: property 26,245c
cc:attributionName 24,942d 990e
cc:attributionURL 25,082f 3,392g
dc:source 7,235h 574i
cc:morePermissions 4,791j 253k
cc:attributionURL = dc:source 5,421l
cc:attributionURL = cc:morePermissions 1,880m
cc:attributionURL = subject 203n

Some quick takeaways:

  • Low ratio of distinct attributionURLs probably indicates HTML from license chooser deployed without any parameterization. Often the subject or current page will be the most useful attributionURL (but 203 above would probably be much higher with canonicalization). Note all of the CC licenses require that such a URL refer to the copyright notice or licensing information for the Work. Unless one has set up a side-wide license notice somewhere, a static URL is probably not the right thing to request in terms of requiring licensees to provide an attribution link; nor is a non-specific attribution link as useful to readers as a direct link to the work in question. As (and if) support for attribution metadata gets built into Creative Commons-aware CMSes, the ratio of distinct attributionURLs ought increase.
  • 79% of subjects with both dc:source and cc:attributionURL (6,836o) have the same values for both properties. This probably means people are merely entering their URL into every form field requesting a URL without thinking, not self-remixing.
  • 47% of subjects with both cc:morePermissions and cc:attributionURL (3,977p) have the same values for both properties. Unclear why this ratio is so much lower than previous; it ought be higher, as often same value for both makes sense. Unsurprising that cc:morePermissions least provided property; in my experience few people understand it.

I did not look at the provenance item at all. It’d be interesting to see what kind of assertions are being made across authority boundaries (e.g. a page on example.com makes a statements with an example.net URI as the subject) and when to discard such. I barely looked directly at the raw data at all; just enough to feel that my aggregate numbers could possibly be accurate. More could probably be gained by inspecting smaller samples in detail, informing other aggregate queries.

I look forward to future extracts. Thanks indirectly to Common Crawl for providing the crawl!

Please point out any egregious mistakes made below…

# a I don't really know if the October 2010 corpus is the
# entire 5 billion Common Crawl corpus

# download RDFa extract from Web Data Commons
wget -c https://s3.amazonaws.com/ccrdf1p/data/ccrdf.html-rdfa.nq

# Matches number stated at
# http://page.mi.fu-berlin.de/muehleis/ccrdf/stats1p.html#html-rdfa
wc -l ccrdf.html-rdfa.nq
1047250

# Includes easy to use no-server triplestore
apt-get install redland-utils

# sanity check
grep '<http://creativecommons.org/ns#attributionName>' ccrdf.html-rdfa.nq |wc -l
26404 

# Import rejects a number of triples for syntax errors
rdfproc xyz parse ccrdf.html-rdfa.nq nquads

# d Perhaps syntax errors explains fewer triples than above grep might
# indicate, but close enough
rdfproc xyz query sparql - 'select ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |wc -l
24942

# These replicated below with 4store because...
rdfproc xyz query sparql - 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |wc -l
990
rdfproc xyz query sparql - 'select ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |wc -l
25082
rdfproc xyz query sparql - 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |wc -l
3392
rdfproc xyz query sparql - 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o }' |wc -l
203
rdfproc xyz query sparql - 'select ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |wc -l
4791
rdfproc xyz query sparql - 'select distinct ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |wc -l
253
rdfproc xyz query sparql - 'select ?o where { ?o <http://creativecommons.org/ns#morePermissions> ?o }' |wc -l
12

# ...this query takes forever, hours, and I have no idea why
rdfproc xyz query sparql - 'select ?s, ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o ; <http://creativecommons.org/ns#attributionURL> ?o }'

# 4store has a server, but is lightweight
apt-get install 4store

# 4store couldn't import with syntax errors, so export good triples from
# previous store first
rdfproc xyz serialize > ccrdf.export-rdfa.rdf

# import into 4store
curl -T ccrdf.export-rdfa.rdf 'http://localhost:8080/data/wdc'

# egrep is to get rid of headers and status output prefixed by ? or #
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |egrep -v '^[\?\#]' |wc -l
24942

#f
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |egrep -v '^[\?\#]' |wc -l
25082

#j
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |egrep -v '^[\?\#]' |wc -l
4791

#h
#Of course please use http://purl.org/dc/terms/source instead.
#Should be more widely deployed soon.
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://purl.org/dc/elements/1.1/source> ?o}' |egrep -v '^[\?\#]' |wc -l
7235

4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://purl.org/dc/terms/source> ?o}' |egrep -v '^[\?\#]' |wc -l
4

#e
4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |egrep -v '^[\?\#]' |wc -l
990

#g
4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |egrep -v '^[\?\#]' |wc -l
3392

#k
4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |egrep -v '^[\?\#]' |wc -l
253

#i
4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://purl.org/dc/elements/1.1/source> ?o}' |egrep -v '^[\?\#]' |wc -l
574

#n
4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o}' |egrep -v '^[\?\#]' |wc -l
203

4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#morePermissions> ?o}' |egrep -v '^[\?\#]' |wc -l
12

4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://purl.org/dc/elements/1.1/source> ?o}' |egrep -v '^[\?\#]' |wc -l
120

#m
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l
1880

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l
122

4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l
8

#l
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o }' |egrep -v '^[\?\#]' |wc -l
5421

4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o }' |egrep -v '^[\?\#]' |wc -l
358

4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o }' |egrep -v '^[\?\#]' |wc -l
11

#p
4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?n }' |egrep -v '^[\?\#]' |wc -l
3977

#o
4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?n }' |egrep -v '^[\?\#]' |wc -l
6836

4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n, ?m where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?n ; <http://creativecommons.org/ns#morePermissions> ?m }' |egrep -v '^[\?\#]' |wc -l
2946
4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l
1604

#c
4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <http://creativecommons.org/ns#attributionURL> ?o } UNION { ?s <http://creativecommons.org/ns#attributionName> ?n } UNION { ?s <http://creativecommons.org/ns#morePermissions> ?m }  }' |egrep -v '^[\?\#]' |wc -l
26245

4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <http://creativecommons.org/ns#attributionURL> ?o } UNION { ?s <http://creativecommons.org/ns#attributionName> ?n }}' |egrep -v '^[\?\#]' |wc -l
25433

#b note subjects not the same as pages data extracted from (158,184)
4s-query wdc -s '-1' -f text 'select distinct ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l
264307

# Probably less than 1047250 claimed due to syntax errors
4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?o }'  |egrep -v '^[\?\#]' |wc -l
968786

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?s }'  |egrep -v '^[\?\#]' |wc -l
2415

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?s }'  |egrep -v '^[\?\#]' |wc -l
0

4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?o }'  |egrep -v '^[\?\#]' |wc -l
0

SOPA/PIPA protests on-message or artless?

Wednesday, January 18th, 2012

Go Internet! Instantly message the U.S. Congress! (Tell them to kill the so-called Research Works Act too!)

Another, much bigger, tiresome rearguard action. I’m impressed by protesters’ nearly universal and exclusive focus on encouraging readers to contact U.S. Congresspeople. I hope it works. SOPA and PIPA really, really deserve to die.

But the protest also bums me out.

1) Self-censorship (in the case of sites completely blacked out, as opposed to those prominently displaying anti-SOPA messages) is not the Internet at its best. If that claim weren’t totally ridiculous, the net wouldn’t be worth defending. It isn’t even the net at its political best — that would be creating systems which disrupt and obviate power — long term offensives, not short-term defenses.

2) Near exclusive focus on supplication before 535 [Update: 536] ultra-powerful individuals is kinda disgusting. But it needs to be done, as effectively as possible.

3) I haven’t looked at a huge number of sites, but I haven’t seen much creativity in the protest. Next time it would be fun to see an appropriate site (Wikipedia? Internet Archive?) take what Flickr has done and add bidding for the “right” to darken particular articles or media as a fundraiser. Art would be nice too — I’d love to hear about anything really great (and preferably libre) from this round.

4) While some prominent bloggers have made the point that “piracy” is not a legitimate problem, overwhelmingly the protest has stuck to defense — SOPA and PIPA would do bad things to the net, and wouldn’t “work” anyway. Google goes much further, saying “End Piracy, Not Liberty” and “Fighting online piracy is important.” Not possible, wrong, and gives away the farm.

5) Nobody making the point that everyone can help with long-term offensives which will ultimately stop ratcheting protectionism, if it is to be stopped. Well, this nobody has attempted:

[I]magine a world in which most software and culture are free as in freedom. Software, culture, and innovation would be abundant, there would be plenty of money in it (just not based on threat of censorship), and there would be no constituency for attacking the Internet. (Well, apart from dictatorships and militarized law enforcement of supposed democracies; that’s a fight intertwined with SOPA, but those aren’t the primary constituencies for the bill.) Now, world dominationliberation by free software and culture isn’t feasible now. But every little bit helps reduce the constituency that wishes to attack the Internet to possibly protect their censorship-based revenue streams, and to increase the constituency whose desire to protect the Internet is perfectly aligned with their business interests and personal expression.

I’d hope that at least some messages tested convey not only the threat SOPA poses to Wikimedia, but the long-term threat the Wikimedia movement poses to censorship.

Also:

Bad legislation needs to be stopped now, but over the long term, we won’t stop getting new bad legislation until policymakers see broad support and amazing results from culture and other forms of knowledge that work with the Internet, rather than against it. Each work or project released under a CC license signals such support, and is an input for such results.

And:

Finally, remember that CC is crucial to keeping the Internet non-broken in the long term. The more free culture is, the less culture has an allergy to and deathwish for the Internet.

Of the five items I list above, the first three are admittedly peevish. Four and five represent not so much problems with the current protest as they do severe deficiencies in movements for intellectual freedom. Actually they are flipsides of the same deficiency: lack of compelling explanation that intellectual freedom, however constructed and protected, really matters, really works, and is really for the good. If such were well enough researched and explained so as to become conventional wisdom, rather than contentious and seemingly radical, net freedom activists could act much more proactively, provocatively, and powerfully, rather than as they do today: with supplication and genuflection.

I am not at all well read, but my weak understanding is that the withdrawal of economists from studying intellectual protectionism in the late 1800s was a great tragedy. To begin the encourage rectification of that century plus of relative neglect, today is a good day to start reading Against Intellectual Monopoly.

In the meantime, the actual and optimal counterfactual drift further apart, without any help from SOPA and PIPA.