Post Semantic Web

AcaWiki non-summary

Sunday, October 25th, 2015

Six years ago I helped launch AcaWiki, a site based on Semantic MediaWiki (software for which I had very high expectations, mostly transferred to Wikidata) for summarizing academic research.

A substantial community failed to materialize. I’ve probably been the only semi-consistent contributor over its entire six years. The best contributions have come from Jodi Schneider, who summarized a bunch of papers related to her research on the semantic web and online discourse, Benjamin Mako Hill, who summarized his PhD qualification exam readings, and Nate Matias who did the same and added a bunch of summaries related to online harassment. Students of an archaeology course taught by Ben Marwick summarized many papers as part of the class. Thank you Jodi, Mako, Nate, Ben, and a bunch of people who have each contributed one or a few summaries.

I’m not going to try to enumerate the deficiencies of AcaWiki here. They boil down to lack of time dedicated to outreach and to improving the site, and zero effort to raise funds to support such work, following a small startup grant obtained by AcaWiki’s founder Neeru Paharia, who has since been busy earning a doctorate and becoming a professor. With Neeru I’ve been the organization’s other long-term director so bear responsibility for this lack of effort. In retrospect dedicating more time to AcaWiki these last years at a cost to non-collaborative activity (e.g., this blog) would have been wise. I haven’t moved to take the other obvious course of shutting down the site, because I still believe something like it is badly needed, not least by me, as I wrote in 2009:

This could be seen as an end-run around access and copyright restrictions (the Open Access movement has made tremendous progress though there is still much to be done), but AcaWiki is a very partial solution to that problem — sometimes an article summary (assuming AcaWiki has one) would be enough, though often a researcher would still need access to the full paper (and the full dataset, but that’s another battle).

More interesting to me is the potential for AcaWiki summaries to increase the impact of research by making it more accessible in another way — comprehensible to non-specialists and approachable by non-speedreaders. I read a fair number of academic papers and many more get left on my reading queue unread. A “human readable” distillation of the key points of articles (abstracts typically convey next to nothing or are filled with jargon) would really let me ingest more.

This has held true even given AcaWiki’s tiny size to date: I regularly look back at summaries I’ve written to remember what I’ve read, and wish I summarized much more of what I’ve read, because most of it I’ve almost totally forgotten! I recommend summarizing papers even though it is hard.

Much harder still and more valuable are literature reviews. These were envisioned to be a part of AcaWiki, but I now think that every Wikipedia article should effectively be a literature review (and more). A year ago I blogged about an example of Wikipedia article as literature review led by James Heilman. Earlier this year Heilman wrote a call to action around a genre of literature review, Open Access to a High-Quality, Impartial, Point-of-Care Medical Summary Would Save Lives: Why Does It Not Exist? (which of course I summarized on AcaWiki). I have a partially written commentary on this piece but for now I can only urge you to read Heilman, or start with and improve my summary.

This brings me to one of my excuses for not dedicating more time to AcaWiki: hope that it would be superseded by a project directly under the Wikimedia umbrella, benefiting from that organization’s and movement’s scale. But, I’ve done almost nothing to make this happen, either. I imagine the current effort that could lead in that direction is WikiProject Open Signalling OA-ness, as I’ve noted at the top of a page on AcaWiki listing similar projects. By far the best project on the list is Journalist’s Resource, also launched in 2009, with vastly greater resources. The projects listed so far as “similar” must only the tip of an iceberg of efforts to summarize academic research, for it’s widely recognized (yes, citation needed; I just created a placeholder on AcaWiki for gathering these) that summarization in various forms is valuable and much more is needed.

If this hasn’t been enough of a ramble already, I’ll close with miscellaneous notes about and unsorted to-dos AcaWiki:

  • Very brief summaries, perhaps 140 character or not much longer, would be useful complements to longer summaries. It would be easy to add a short summary field to AcaWiki.
  • For summaries of articles which are themselves freely licensed, it might be useful to include the author’s abstract in AcaWiki. Again, it would be easy to add a field.
  • There’s lots of research on automated summarization, some of it producing open source tools. These could be applied to initialize summaries, either for human summaries, or en masse bot summary creation.
  • I have added a field for an article’s Wikidata identifier. AcaWiki is one of a handful of sites potentially using Wikidata for authority control. There will be many more. But it’d be far more useful to do something with that identifier, most obviously to ingest article metadata from Wikidata and create Wikidata items/push metadata to Wikidata where items corresponding to summarized articles do not exist. I’ve not yet seriously looked into how much of this can be currently accomplished using Wikibase Client.
  • Last month there was debate about a program giving some Wikipedia contributors gratis access to closed academic journals. Does this program help improve Wikipedia as a free resource, or promote non-free literature? It must do some of both; which is the bigger impact on long-term free knowledge outcomes probably depends on one’s perspective. My bias is that improving and promoting free resources is vastly more important than suppressing non-free ones. But I also think that free academic summaries could help in both respects. For Wikipedia readers, a reference with an immediately available summary would be more useful than one without. The summary would also reduce the need to access the original non-free article. AcaWiki in its current state is inadequate, but perhaps the the debate ought motivate more work on free academic summaries, here or elsewhere.
  • Has any closed access publisher freed only article abstracts (including a free license; abstracts are almost always gratis access)? This would be useful to a site like AcaWiki at the least, especially if abstracts were more consistently useful.
  • Should the scope of AcaWiki be explicitly expanded to include summarizing material that is somehow academic but is not in the form of a peer-reviewed paper published in an academic journal? Some of the summaries I’ve contributed are for books or grey literature.
  • Periodically it’s been suggested to change the default license for AcaWiki summaries from CC-BY to CC-BY-SA. I should add updated thoughts at the link.
  • Some time ago in order to put a stop to the creation of spam accounts, I enabled the ConfirmAccount extension, which forces users who want to contribute to fill out an account request form. I admit this is hugely annoying. I have done zero research into it, but I would love to have an extension which auto-enables account creation based on some external authentication and reputation, e.g., Wikimedia wiki accounts or even users followed/subscribed to/endorsed by existing AcaWiki users on other sites, e.g., social networks.
  • Upgrade site to https when Let’s Encrypt becomes generally available. Alternatively, see if it is possible to move hosting (currently a $10/month Digital Ocean VPS) to Miraheze, which mandates https.
  • I intended to write an update on AcaWiki for Open Access Week (October 19-25). I only realized after beginning that AcaWiki was recently 6 years old.
  • I’m going to ping the people who have contributed to AcaWiki so far to look at this post and provide feedback. What would it take for them to feel good about recommending others do what they’ve done, e.g., summarizing PhD or research program readers, or assigning contributing or improving AcaWiki summaries to their classes? Or if something else entirely should be done to push forward free summarization of academic literature, what is that something?
  • For some time Fabricatorz did a bit of work on and hosted AcaWiki. From my email correspondence I see that Bassel Khartabil did some of that. As I’ve blogged before (1, 2, 3), Bassel has been detained by the Syrian government since 2012. Recently he has gone missing and presumably is in grave danger. Props to his Frabricatorz and many other friends who have done more to raise awareness of Bassel’s plight than I would have imagined possible when writing those previous posts. See freebassel.org for info and links, and spread the word. I’ll add a note about #freebassel to the AcaWiki home page (which badly needs a general revamp) shortly.

If any of this interests you, get in touch or merely watch for updates on the acawiki-general mailing list, AcaWiki on pump.io, Twitter, or Facebook, or blog comments below, or the AcaWiki site.

wikidata4research4all

Sunday, December 21st, 2014

Recently I’ve uncritically cheered for Wikidata as “rapidly fulfilling” hopes to “turn the universal encyclopedia into the universal database while simultaneously improving the quality of the encyclopedia.” In April I uncritically cheered for Daniel Mietchen’s open proposal for research on opening research proposals.

Let’s combine the two: an open proposal for work toward establishing Wikidata (including its community, data, ontologies, practices, software, and external tools) as a “collaborative hub around research data” responding to a European Commission call on e-infrastructures. That would be Wikidata for Research (WD4R), instigated by Mietchen, who has already assembled an impressive set of partner institutions and an outline of work packages. The proposal is being drafted in public (you can help) and will be submitted January 14.

4all

The proposal will be strong on its own merits, and very well aligned with the stated desired outcomes from the EC call, and the open proposal dogfood angle is also great. I added for all to this post’s title because I suspect WD4R will be a great for pushing Wikidata toward realizing aforementioned “universal database” hopes (which again means not just the data, but community, tools, etc.; “virtual research environment” is one catch-all term) and will make Wikidata much more useful “research” most broadly construed (e.g., by students, journalists, knowledge workers, anyone), potentially much faster than would happen otherwise.

My suspicion has two bases (please correct me if I’m wrong about either):

  1. A database or virtual environment “for research” might give the impression of someplace to dump data from or perform experiments. Maybe that would be appropriate for Wikidata in some instance, but the overwhelming research-supporting use would seem to be mass collaboration in consolidating, annotating, and correcting data and ontologies which many researchers (and researchers-broadly-construed, everyone) can benefit from, either querying or referencing directly, or extracting and using elsewhere. The pre-existing Gene Wiki project which is beginning to use Wikidata is an example of such useful-to-all collections (as referenced in the WD4R pages).
  2. One of the proposed work packages is to identify and work on features needed for research but not on, or not prioritized on, the Wikidata development plan. I suspect other Wikimedia projects can tremendously benefit from Wikidata integration without Wikidata itself or external tools supporting complex queries and reporting that would be called for by a virtual research environment — and also called for to realize “universal database” hopes. Wikidata’s existing plan looks good to me; here I’m just saying WD4R might help it be even better, faster.

The previously linked Gene Wiki post includes:

For more than a decade many different groups have proposed and many have implemented solutions to this challenge using standards and techniques from the Semantic Web. Yet, today, the vast majority of biological data is still accessed from individual databases such as Entrez Gene that make no attempt to use any component of the Semantic Web or to otherwise participate in the Linked Open Data movement. With a few notable exceptions, the data silos have only gotten larger and problems of fragmentation worse.
[…]
Now, we are working to see if Wikidata can be the bridge between the open community-driven power of Wikipedia and the structured world of semantic data integration. Can the presence of that edit button on a centralized knowledge base associated with Wikipedia help the semantic web break through into everyday use within our community?

I agree that massive centralized commons-oriented resources are needed for decentralization to progress (link analogous but not the same — linked open data : federation :: data silos : messaging silos).

Check out Mietchen’s latest WD4R blog post and the WD4R project page.

Wikidata II

Thursday, October 30th, 2014


Wikidata went live two years ago, but the II in the title is also a reference to the first page called Wikidata on meta.wikimedia.org which for years collected ideas for first class data support in Wikipedia. I had linked to Wikidata I writing about the most prominent of those ideas, Semantic MediaWiki (SMW), which I later (8 years ago) called the most important software project and said would “turn the universal encyclopedia into the universal database while simultaneously improving the quality of the encyclopedia.”

SMW was and is very interesting and useful on some wikis, but turned out to be not revolutionary (the bigger story is wikis turned out to be not revolutionary, or only revolutionary on a small scale, except for Wikipedia) and not quite a fit for Wikipedia and its sibling projects. While I’d temper “most” and “universal” now (and should have 8 years ago), the actual Wikidata project (created by many of the same people who created SMW) is rapidly fulfilling general wikidata hopes.

One “improving the encyclopedia” hope that Wikidata will substantially deliver on over the next couple years and that I only recently realized the importance of is increasing trans-linguistic collaboration and availability of the sum of knowledge in many languages — when facts are embedded in free text, adding, correcting, and making available facts happens on a one-language-at-a-time basis. When facts about a topic are in Wikidata, they can be exposed in every language so long as labels are translated, even if on many topics nothing has ever been written about in nor translated into many languages. Reasonator is a great demonstrator.

Happy 2nd to all Wikidatians and Wikidata, by far the most important project for realizing Wikimedia’s vision. You can and should edit the data and edit and translate the schema. Browse Wikidata WikiProjects to find others working to describe topics of interest to you. I imagine some readers of this blog might be interested in WikiProjects Source MetaData (for citations) and Structured Data for Commons (the media repository).

For folks concerned about intellectual parasites, Wikidata has done the right thing — all data dedicated to the public domain with CC0.

API commons

Thursday, May 29th, 2014

Notes for panel The API Copyright Emergency: What’s Next? today at API Con SF. The “emergency” is the recent decision in Oracle v. Google, which I don’t discuss directly below, though I did riff on the ongoing case last year.

I begin with and come back to a few times Creative Commons licenses as I was on the panel as a “senior fellow” for that organization, but apart from such emphasis and framing, this is more or less what I think. I got about 80% of the below in on the panel, but hopefully still worth reading even for attendees.

A few follow-up thoughts after the notes.

Creative Commons licenses, like other public licenses, grant permissions around copyright, but as CC’s statement on copyright reform concludes, licenses “are not a substitute for users’ rights, and CC supports ongoing efforts to reform copyright law to strengthen users’ rights and expand the public domain.” In the context of APIs, default policy should be that independent implementation of an API never require permission from the API’s designer, previous implementer, or other rightsholder.

Without such a default policy of permission-free innovation, interoperability and competition will suffer, and the API community invites late and messy regulation at other levels intending to protect consumers from resulting lock-in.

Practically, there are things API developers, service providers, and API consumers can do and demand of each other, both to protect the community from a bad turn in default policy, and to go further in creating a commons. But using tools such as those CC provides, and choosing the right tools, requires looking at what an API consists of, including:

  1. API specification
  2. API documentation
  3. API implementations, server
  4. API implementations, client
  5. Material (often “data”) made available via API
  6. API metadata (e.g, as part of API directory)

(depending on construction, these could all be generated from an annotated implementation, or could each be separate works)

and what restrictions can be pertinent:

  1. Copyright
  2. Patent

(many other issues can arise from providing an API as a service, e.g., privacy, though those are usually not in the range of public licenses and are orthogonal to API “IP”, so I’ll ignore them here)

1-4 are clearly works subject to copyright, while 5 and 6 may or may not be (e.g., hopefully not if purely factual data). Typically only 3 and 4 might be restricted by patents.

Standards bodies typically do their work primarily around 1. Relatively open ones, like the W3C, obtain agreement from all contributors to the standard to permit royalty-free implementation of the standard by anyone, typically including a patent license and permission to prepare and perform derivative works (i.e., copyright, to extent such permission is necessary). One option you have is to put your API through an existing standards organization. This may be too heavyweight, or may be appropriate yet if your API is really a multi-stakeholder thing with multiple peer implementations; the W3C now has a lightweight community group venue which might be appropriate. The Open Web Foundation’s agreements allow you to take this approach for your API without involvement of an existing standards body​. Lawrence Rosen has/will talk about this.

Another approach is to release your API specification (and necessarily 2-4 to the extent they comprise one work, ideally even if they are separate) under a public copyright license, such as one of the CC licenses, the CC0 public domain dedication, or an open source software license. Currently the most obvious choice is the Apache License 2.0, which grants copyright permission as well as including a patent peace clause. One or more of the CC licenses are sometimes suggested, perhaps because specification and documentation are often one work, and the latter seems like a “creative” work. But keep in mind that CC does not recommend using its licenses for software, and instead recommends using an open source software licenses (such as Apache): no CC license includes explicit patent permission, and depending on the specific CC license chosen, it may not be compatible with software licenses, contrary to goal of granting clear permission for independent API implementation, even in the face of a bad policy turn.

One way to go beyond mitigating “API copyrightability” is to publish open source implementations, preferably production, though reference implementations are better than nothing. These implementations would be covered by whatever copyright and patent permissions are granted by the license they are released under — again Apache 2.0 is a good choice, and for software implementation CC licenses should not be used; other software licenses such as [A]GPL might be pertinent depending on business and social goals.

Another way to create a “thick” API commons is to address material made available via APIs, and metadata about APIs. There, CC tools are likely pertinent, e.g., use CC0 for data and metadata to ensure that “facts are free”, as they ought be in spite of other bad policy turns.

To get even thicker, consider the architecture, for lack of a better term, around API development, services, and material accessed and updated via APIs. Just some keywords: Linked Open Data, P2P, federation, Lots of Copies Keep Stuff Safe, collaborative curation.

The other panelists were Pamela Samuelson, Lawrence Rosen, and Annette Hurst, moderated by David Berlind.

I’m fairly familiar with Samuelson’s and Rosen’s work and don’t have comments on what they said on the panel. If you want to read more, I recommend among Samuelson’s papers The Strange Odyssey of Software Interfaces and Intellectual Property Law which shows that the “API copyright emergency” of the panel title is recurrent and intertwined with patent, providing several decades of the pertinent history up to 2008. Contrary to my expectation in the notes above, Rosen didn’t get a chance to talk about the Open Web Foundation agreements, but you can read his 2010 article Implementing Open Standards in Open Source which covers OWF.

Hurst is a lawyer for Orrick representing Oracle in the Oracle v. Google case, so understandably advocated for API copyright, but in the process made several deeply flawed assertions could have consumed the entire duration of the panel, but Berlind did a good job of keeping the conversation moving forward. Still, I want to mention two high level ones here, my paraphrases and responses:

Without software copyright the software economy would go away. This is refuted by software development not for the purposes of selling licenses (which is the vast majority of it), especially free/open source software development, and services (e.g., API provision, the source of which is often never published, though it ought be, see “going beyond” recommendations above). Yes the software economy would change, with less winner-take-all monopoly and less employment for Intellectual Parasite lawyers. But the software economy would be huge and very competitive. Software is eating the world, remember? One way to make it help rather than pejoratively eat the world is to eject the parasites along for the ride.

Open source can’t work without software copyright. This is refuted by 1) software source sharing before software copyright; 2) preponderance of permissively licensed open source software, in which the terms do not allow suing downstream developers who do not share back; 3) the difficulty of enforcing copyleft licenses which do allow for suing downstream developers who do not share back; 4) the possibility of non-copyright regulation to force sharing of source (indeed I see the charitable understanding of copyleft as prototyping such regulation; for perspective on the Oracle v. Google case from someone with a more purely charitable understanding of copyleft, see Bradley Kuhn); and 5) demand and supply mechanisms for mandating sharing of source (e.g., procurement policies, distribution policies such as Debian’s).

These came up because Hurst seemed to really want the audience to conflate software copyright in general (not at issue in the case, settled in a bad place since the early 1980s) and API copyright specifically. Regarding the latter, another point which could have been made is the extent to which free/open source software has been built around providing alternatives to proprietary software, often API-compatible. If API copyright could prevent compatible implementation without permission of any sort, open source, competition, and innovation would all be severely hampered.

There is a recent site called API Commons, which seems to be an API directory (Programmable Web, which ran the conference, also has one). My general suggestion to both would be to implement and facilitate putting all elements of APIs listed above in my notes in the commons. For example, they could clarify that API metadata they collect is in the public domain, publish it as Linked Open Data, and encourage API developers and providers they catalog to freely license specifications, documentation, implementations, and data, and note such in the directories.

In order to get a flavor for the conference, I listened to yesterday morning’s keynotes, both of which made valiant attempts to connect big picture themes to day to day API development and provision. Allow me to try to make connections back to “API commons”.

Sarah Austin, representing the San Francisco YMCA, pointed out that the conference is near the Tenderloin neighborhood, the poorest in central San Francisco. Austin asked if kids from the Tenderloin would be able to find jobs in the “API economy” or would they be priced out of the area (many tech companies have moved nearby in the last years, Twitter perhaps the best known).

Keith Axline claimed The Universe Is Programmable. We Need an API for Everything, or to some extent, that learning about the universe and how to manipulate it is like programming. Axline’s talk seemed fairly philosophical, but could be made concrete with reference to the Internet of Things, programmable matter, robots, nanobots, software eating the world … much about the world will indeed soon be software (programmable) or obsolete.

Axline’s conclusion was in effect largely about knowledge policy, including mourning energy wasted on IP, and observing that we should figure out public support for science or risk a programmable world dominated by IP. That might be part of it, but keeps the focus on funding, which is just where IP advocates want it — IP is an off-the-balance-sheets, “free” taking. A more direct approach is needed — get the rules of knowledge policy right, put freedom and equality as its top goals, reject freedom infringing regimes, promote commons (but mandating all these as a condition of public and publicly interested funding is a reasonable starting place) — given these objectives and constraints, then argue about market, government, or other failure and funding.

Knowledge policy can’t directly address the Austin’s concerns in the Tenderloin, but it does indirectly affect them, and over the long term tremendously affect them, in the Tenderloin and many other places. As the world accelerates its transition from an industrial to a knowledge dominated economy, will that economy be dominated by monopoly and inequality or freedom and equality? Will the former concentrations continue to abet instances of what Jane Jacobs called “catastrophic money” rushing into ill-prepared neighborhoods, or will the latter tendencies spread the knowledge, wealth, and opportunity?

WWW next 25: Universal, Secure, Resilient?

Wednesday, March 12th, 2014

Today folks seem to be celebrating the 25th anniversary of a 1989 proposal for what is now the web — implementation released to the public in August, 1991.

Q&A with web inventor Timothy Berners-Lee: 25 years on, the Web still needs work.

The web is pretty great, much better than easily imagined alternatives. Three broad categories it could improve in:

  • Universality. All humans should be able to access the web, and this should be taken to include being able to publish, collaborate, do business, and run software on the web, in any manner, in any language or other interface. Presently, billions aren’t on the net at all, activity outside of a handful of large services is very expensive (in money, expertise, or marketing), and machine translation and accessibility are very limited.
  • Security. All of the above, securely, without having to understand anything technical about security, and with lots of technical and cultural guards against technical and non-technical attacks of all kinds.
  • Resilience. All of the above, with minimal interruption and maximal recovery from disaster, from individual to planetary scale.

Three pet outcomes I wish for:

  • Collective wisdom. The web helps make better decisions, at all scales.
  • Commons dominance. Most top sites are free-as-in-freedom. Presently, only Wikipedia (#5) is.
  • Freedom, equality, etc.

Two quotes from the Berners-Lee Q&A that are on the right track:

Getting a nice user interface to a secure system is the art of the century.

Copyright law is terrible.

RDFa initial context & one dc:

Tuesday, February 4th, 2014

One of the nice things to come out of RDFa 1.1 is its initial context — a list of vocabularies with prefixes which may be used without having to define locally. In other words, just write, e.g., property="dc:title" without having to first write prefix="dc: http://purl.org/dc/terms/".

In addition to making RDFa a lot less painful to use, the list is a good starting place for figuring out what vocabularies to use (if you must), perhaps even for non-RDFa applications — the list is machine-readable of course; I was reminded to write this post when giving feedback on a friend’s proposal to use prefix:property headers in a CSV file for a custom application, and by a recent announcement of the addition of three new predefined prefixes.

Survey data such as Linked Open Vocabularies can also help figure out what to use. Unfortunately LOV and the RDFa 1.1 initial context don’t agree 100% on prefix naming, and neither provides much in the way of guidance. I think there’s room for a highly opinionated and regularly updated guide to what vocabularies to use. I’m no expert, it probably already exists — please inform me!

dc:

The first thing I’d put in such an opinionated guide is to start one’s vocabulary search with Dublin Core. Trivial, right? But there is an under-documented subtlety which I find myself pointing out when a friend runs something like the aforementioned by me — DC means DC Terms. While it’s obvious that DC Terms is a superset of DC Elements, it’s harder to find evidence that using the former is best practice for new applications, and that the latter is not still the canonical vocabulary to start with. What I’ve gathered on this follows. I realize that the URIs for individual properties and classes, the prefixes used to abbreviate those URIs, and the documents which define (in English and RDF) properties and classes are distinct but interdependent. Prefixes are surely the most trivial and uninteresting, but for most people I imagine they’re important signals and documentation, thus I go on about them…

Namespace Policy for the Dublin Core Metadata Initiative (DCMI) (emphasis added):

The DCMI namespace URI for the collection of legacy properties that make up the Dublin Core Metadata Element Set, Version 1.1 [DCMES] is: http://purl.org/dc/elements/1.1/

Dublin Core Metadata Element Set, Version 1.1 (emphasis added):

Since 1998, when these fifteen elements entered into a standardization track, notions of best practice in the Semantic Web have evolved to include the assignment of formal domains and ranges in addition to definitions in natural language. Domains and ranges specify what kind of described resources and value resources are associated with a given property. Domains and ranges express the meanings implicit in natural-language definitions in an explicit form that is usable for the automatic processing of logical inferences. When a given property is encountered, an inferencing application may use information about the domains and ranges assigned to a property in order to make inferences about the resources described thereby.

Since January 2008, therefore, DCMI includes formal domains and ranges in the definitions of its properties. So as not to affect the conformance of existing implementations of “simple Dublin Core” in RDF, domains and ranges have not been specified for the fifteen properties of the dc: namespace (http://purl.org/dc/elements/1.1/). Rather, fifteen new properties with “names” identical to those of the Dublin Core Metadata Element Set Version 1.1 have been created in the dcterms: namespace (http://purl.org/dc/terms/). These fifteen new properties have been defined as subproperties of the corresponding properties of DCMES Version 1.1 and assigned domains and ranges as specified in the more comprehensive document “DCMI Metadata Terms” [DCTERMS].

Implementers may freely choose to use these fifteen properties either in their legacy dc: variant (e.g., http://purl.org/dc/elements/1.1/creator) or in the dcterms: variant (e.g., http://purl.org/dc/terms/creator) depending on application requirements. The RDF schemas of the DCMI namespaces describe the subproperty relation of dcterms:creator to dc:creator for use by Semantic Web-aware applications. Over time, however, implementers are encouraged to use the semantically more precise dcterms: properties, as they more fully follow emerging notions of best practice for machine-processable metadata.

The first two paragraphs explain why a new vocabulary was minted (so that the more precise definitions of properties already in DC Elements do not change the behavior of existing implementations; had only new terms and classes been added, maybe they could have been added to the DC Elements vocabulary, but maybe this is ahistoric, as many of the additional “qualified” DC Terms existed since 2000). The third paragraph explains that DC Terms should be used for new applications. Unfortunately the text informally (the prefixes aren’t used anywhere) notes the prefixes dc: and dcterms:, which I’ve found is not helpful in getting people to focus only on DC Terms.

Expressing Dublin Core metadata using the Resource Description Framework also notes the dc: and dcterms: prefixes for use in the document’s examples (which don’t ever actually use dc:).

Some of these documents have been updated slightly, but I believe their current versions are little changed from about 2008, a year after the proposal of the DC Terms refinements.

How to use DCMI Metadata as linked data uses the dc: and dcterms: prefixes and is clear about the ranges of properties of each: there is no incorrect usage of, e.g., purl.org/dc/elements/1.1/creator because it has no defined range nor domain, while purl.org/dc/terms/creator must be a non-literal, a purl.org/dc/terms/Agent. Perhaps this makes DC Terms seem scarier and partially explains the persistence of DC Elements. More likely I’d guess few know about the difference and lots of use of the DC Terms with non-literal ranges are used with literals in the wild (I might be guilty on occasion).

FAQ/DC and DCTERMS Namespaces:

It is not incorrect to continue using dc:subject and dc:title — alot of Semantic Web data still does — and since the range of those properties is unspecified, it is not actually incorrect to use (for example) dc:subject with a literal value or dc:title with a non-literal value. However, good Semantic Web practice is to use properties consistently in accordance with formal ranges, so implementers are encouraged to use the more precisely defined dcterms: properties.
Update, December 2011: It is worth noting that the Schema.org initiative is taking a pragmatic approach towards the formal ranges of their properties:

We also expect that often, where we expect a property value of type Person, Place, Organization or some other subClassOf Thing, we will get a text string. In the spirit of “some data is better than none”, we will accept this markup and do the best we can.

What constitutes “best practice” in this area is bound to evolve with implementation experience over time.

There you have people supplying literals for properties expecting non-literals. Schema.org RDF mappings do not formally condone this pragmatic approach, otherwise you’d see the likes of (addition in bold):

schema:creator a rdf:Property;
    rdfs:label "Creator"@en;
    rdfs:comment "The creator/author of this CreativeWork or UserComments. This is the same as the Author property for CreativeWork."@en;
    rdfs:domain [ a owl:Class; owl:unionOf (schema:UserComments schema:CreativeWork) ];
    rdfs:range [ a owl:Class; owl:unionOf (schema:Organization schema:Person xsd:string) ];
    rdfs:isDefinedBy ;
    rdfs:isDefinedBy ;

Also from 2011, a discussion of what prefixes to use in the RDFa initial context. Decision (Ivan Herman):

For the records: after having also discussed on yesterday’s telecom, I have made the changes on the profile files yesterday evening. The prefix set in the profile for http://purl.org/dc/terms/ is set to ‘dc’.

Read the expert input of Dan Brickley, Mikael Nilsson, and Thomas Baker. The initial context defines both dc: and dcterms: as prefixes for DC Terms, relegating DC Elements to dc11::

dc http://purl.org/dc/terms/ Dublin Core Metadata Terms DCMI Metadata Terms
dcterms http://purl.org/dc/terms/ Dublin Core Metadata Terms DCMI Metadata Terms
dc11 http://purl.org/dc/elements/1.1/ Dublin Core Metadata Element Set, Version 1.1 Dublin Core Metadata Element Set, Version 1.1

I found the above discussion on LOV’s entries for DC Terms and DC Elements, which use dcterms: and dce: prefixes respectively:

(2013-03-07) Bernard Vatant: Prefix restored to dcterms

(2013-06-17) Bernard Vatant: Although “dc” is often used as the prefix for this vocabulary, it’s also sometimes used for DC terms, so we preferred to use the less ambiguous “dce” and “dcterms” in LOV. See usage at http://prefix.cc/dc, http://prefix.cc/dce, http://prefix.cc/dcterms, and more discussion at http://bit.ly/uPuUTT.

I think the discussion instead supports using dc: and dc11: (because that’s what the RDFa initial context uses) instead. LOV doesn’t have a public source repository or issue tracker currently, but I understand it eventually will.

Now I have this grab-bag blog post to send to friends who propose using DC Elements. Please correct me if I’m wrong, and especially if a more concise (on this topic) and credible document exists, so I can send that instead; perhaps something like an opinionated guide to metadata mentioned way above.

Another topic such a guide might cover, perhaps as a coda, would be what to do if you really need to develop a new vocabulary. One thing is you really need to ask for help. The W3C now provides some infrastructure for doing this. Or, some qualified dissent from a hugely entertaining blogger called Brinxmat.

Some readers of my blog who have bizarrely read through this post, or skipped to the end, might enjoy Brinxmat’s Attribution licences for data and why you shouldn’t use them (another future issue report for LOV, which uses CC-BY?); I wrote a couple posts in the same blogversation; also a relevant upgrade exhortation.

[Semi]Commons Coordinations & Copyright Choices 4.0

Monday, December 9th, 2013

CC0 is superior to any of the Creative Commons (CC) 4.0 licenses, because CC0 represents a superior policy (public domain). But if you’re unable or unwilling to upgrade to CC0, the CC 4.0 licenses are a great improvement over the 3.0 licenses. The people who did the work, led by Diane Peters (who also led CC0), many CC affiliates (several of whom were also crucial in making CC0 a success), and Sarah Pearson and Kat Walsh, deserve much praise. Bravo!

Below read my idiosyncratic take on issues addressed and not addressed in the 4.0 licenses. If that sounds insufferable, but you want to know about details of the 4.0 licenses, skip to the excellent version 4 and license versions pages on the CC wiki. I don’t bother linking to sections of those pages pertinent to issues below, but if you want detailed background beyond my idiosyncratic take on each issue, it can be found there.

Any criticism I have of the 4.0 licenses concerns policy choices and is not a criticism of the work done or people involved, other than myself. I fully understand that the feasible choices were and are highly constrained by previous choices and conditions, including previous versions of the CC licenses, CC’s organizational history, users of CC licenses, and the overall states of knowledge commons and info regulation and CC’s various positions within these. I always want CC and other “open” organizations to take as pro-commons of a stance as possible, and generally judge what is possible to be further than that of the conventional wisdom of people who pay any attention to this scene. Sometimes I advocated for more substantial policy changes in the 4.0 licenses, though just as often I deemed such advocacy futile. At this point I should explain that I worked for CC until just after the 4.0 licenses process started, and have consulted a bit on 4.0 licenses issues since then as a “fellow”. Not many people were in a better position to influence the 4.0 licenses, so any criticisms I have are due to my failure to convince, or perhaps incorrect decision to not try in some cases. As I’ve always noted on this blog, I don’t represent any organization here.

Desiderata

Pro-commons? As opposed to what? The title of the CC blog post announcing the formal beginning of work on the new licenses:

Copyright Experts Discuss CC License Version 4.0 at the Global Summit

My personal blog post:

Commons experts to develop version 4.0 of the CC licenses

The expertise that CC and similar organizations ought to bring to the world is commons coordination. There are many copyright experts in the world, and understanding public copyright licenses, and drafting more, are no great intellectual challenges. The copyright expertise needed to do so ought be purely instrumental, serving the purpose of commons coordination. Or so I think.

Throughout CC’s existence, it has presented itself, and been perceived as, to varying extents, an organization which provides tools for copyright holders to exercise their copyrights, and an organization which provides tools for building a commons. (What it does beyond providing tools adds another dimension, not unrelated to “copyright choice” vs. “commons coordination”; there’s some discussion of these issues in a video included in my personal post above.)

I won’t explain in this post, but I think the trend through most of CC’s history has been very slow movement in the “commons coordination” direction, and the explicit objectives of the 4.0 versioning process fit that crawl.

“Commons coordination” does not directly imply the usual free/open vs. proprietary/closed dichotomy. I think it does mostly fall out that way, in small part due to “license interoperability” practicalities, but probably mostly because I think the ideal universal copyregulation policy corresponds to the non-discriminatory commons that “free/open” terms and communities carve out on a small scale, including the pro-sharing policy that copyleft prototypes, and excluding any role for knowledge enclosure, monopoly, property, etc. But it is certainly possible, indeed usual, to advocate for a mixed regime (I enjoy the relatively new term “semicommons”, but if you wish to see it everywhere, try every non-demagogic call for “balance”), in which case [semi]commons tools reserving substantial exclusivity (e.g., “commercial use”) make perfect sense for [semi]commons coordination.

Continuing to ignore the usual [non-]open dichotomy, I think there still are a number of broad criteria for would-be stewards of any new commons coordinating license (and make no mistake, a new version of a license is a new license; CC introduced 6 new licenses with 4.0) to consider carefully, and which inform my commentary below:

  • Differentiation: does the new license implement some policy not currently available in existing licenses, or at least offer a great improvement in implementation (not to provide excuses for new licenses, but the legal text is just one part of implementation; also consider branding/positioning, understandability, and stewardship) of policy already available?
  • Permissions: does the new license grant all permissions needed to realize its policy objective?
  • Regulation: how does the license’s policy objective model regulation that ought be adopted at a wider scale, e.g., how does it align with usual “user rights” and “copyright reform” proposals?
  • Interoperability: is the new license maximally compatible with existing licenses, given the constraints of its policy objectives, and indeed, to the expense of its immediate policy objectives, given that incompatibility, non-interoperability, and proliferation must fragment and diminish the value of commons?
  • Cross-domain impact: how does the license impact license interoperability and knowledge sharing across fields/domains/communities (e.g., software, data, hardware, “content”, research, government, education, culture…)? Does it further silo existing domains, a tragedy given the paucity of knowledge about governing commons in the world, or facilitate sharing and collaboration across domains?

Several of these are merely a matter of good product design and targeting, and would also apply to an organization that really had a primary goal of offering copyright holders additional choices the organization deems are under-provided. I suspect there is plenty of room for innovation in “copyright choice” tools, but I won’t say more in this post, as such have little to do with commons, and whatever CC’s history of copyright choice rhetoric and offering a gaggle of choices, creating such tools is distant from its immediate expertise (other than just knowing lots about copyright) and light years from much of its extended community.

Why bother?

Apart from amusing myself and a few others, why this writeup? The CC 4.0 licenses won’t change, and hopefully there won’t be CC 4.1 or 4.5 or 5.0 licenses for many years. Longevity was an explicit goal for 4.0 (cf. 1.0: 17 months, 2.0: 12 months; 2.5: 20 months; 3.0: 81 months). Still, some of the issues covered here may be interesting to people choosing to use one of the CC 4.0 licenses, and people creating other licenses. Although nobody wants more licenses, often called license proliferation, as an end in itself, many more licenses is the long term trend, of which the entire history of CC is just a part. Further, more licenses can be a good, to the extent they are significantly different from and better than, and as compatible as possible with, existing licenses.

To be totally clear: many new licenses will be created and used over the next 10 years, intended for various domains. I would hope, some for all domains. Proliferators, take heed!

Development tools

A 4.0 wiki page and a bunch of pages under that were used to lay out objectives, issues and options for resolution, and link to drafts. Public discussion was on the cc-licenses list, with tangential debate pushed to cc-community. Drafts and changes from previous drafts were published as redlined word processor files. This all seems to have worked fairly well. I’d prefer drafts as plain text files in a git repository, and an issue tracker, in addition to a mailing list. But that’s a substantially different workflow, and word processor documents with track changes and inline comments do have advantages, not limited to lawyers being familiar with those tools.

100% wiki would also work, with different tradeoffs. In the future additional tools around source repositories, or wikis, or wikis in source repositories, will finally displace word processor documents, but the tools aren’t there yet. Or in the bad future, all licenses will be drafted in word processors in the cloud.

(If it seems that I’m leaving a a lot out, e.g., methodology for gathering requirements and feedback, in-person and teleconferences, etc., I merely have nothing remotely interesting to say, and used “tools” rather than “process” to narrow scope intentionally.)

Internationalization

The 4.0 licenses were drafted to be jurisdiction neutral, and there will be official, equivalent, verbatim language translations of the licenses (the same as CC0, though I don’t think any translations have been made final yet). Legal “porting” to individual jurisdictions is not completely ruled out, but I hope there will be none. This is a wholly positive outcome, and probably the most impactful change for CC itself (already playing out over the past few years, e.g., in terms of scope and composition of CC affiliates), though it is of small direct consequence to most users.

Now, will other license drafters and would-be drafters follow CC’s lead and stop with the vanity jurisdiction license proliferation already?

Databases

At least the EU, Mexico, Russia, and South Korea have created “database rights” (there have been attempts in other jurisdictions), copyright-like mechanisms for entities that assemble databases to persecute others who would extract or copy substantial portions of said databases. Stupid policies that should be abolished, copyright-like indeed.

Except for CC0 and some minor and inconsistent exceptions (certain within-EU jurisdiction “port” versions), CC licenses prior to 4.0 have not “covered” database rights. This means, modulo any implied license which may or may not be interpreted as existing, that a prior-to-4.0 (e.g., CC-BY-3.0) licensee using a database subject to database restrictions (when this occurs is a complicated question) would have permission granted by the licensor around copyright restrictions, but not around database restrictions. This is a pretty big fail, considering that the first job of a public license is to grant adequate permissions. Actual responses to this problem:

  • Tell all database publishers to use CC0. I like this, because everyone should just use CC0. But, it is an inadequate response, as many will continue to use less permissive terms, often in the form of inadequate or incompatible licenses.
  • Only waive or license database restrictions in “ports” of licenses to jurisdictions in which database restrictions exist. This is wholly inadequate, as in the CC scheme, porting involves tailoring the legal language of a license to a jurisdiction, but there’s no guarantee a licensor or licensee in such jurisdictions will be releasing or using databases under one of these ports, and in fact that’s often not the case.
  • Have all licenses waive database restrictions. This sounds attractive, but is mostly confusing — it’s very hard to discern when only database and not copyright restrictions apply, such that a licensee could ignore a license’s conditions — and like “tell database publishers to use CC0” would just lead many to use different licenses that do purport to conditionally license database rights.
  • Have all licenses grant permissions around database restrictions, under whatever conditions are present in the license, just like copyright.

I think the last is the right approach, and it’s the one taken with the CC 4.0 licenses, as well as by other licenses which would not exist but for CC 3.0 licenses not taking this approach. I’m even more pleased with their generality, because other copyright-like restrictions are to be expected (emphasis added):

Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.

The exclusions of 2(b)(1)-(2) are a mixed bag; see moral and personality rights, and patents below.

CC0 also includes a definition with some generality:

Copyright and Related Rights include, but are not limited to, the following:

  1. the right to reproduce, adapt, distribute, perform,
    display, communicate, and translate a Work;
  2. moral rights retained by the original author(s) and/or
    performer(s);
  3. publicity and privacy rights pertaining to a person’s
    image or likeness depicted in a Work;
  4. rights protecting against unfair competition in regards
    to a Work, subject to the limitations in paragraph 4(a),
    below;
  5. rights protecting the extraction, dissemination, use and
    reuse of data in a Work;
  6. database rights (such as those arising under Directive
    96/9/EC of the European Parliament and of the Council of 11
    March 1996 on the legal protection of databases, and under
    any national implementation thereof, including any amended
    or successor version of such directive); and
  7. other similar, equivalent or corresponding rights
    throughout the world based on applicable law or treaty, and
    any national implementations thereof.

As does GPLv3:

“Copyright” also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.

Do CC0 and CC 4.0 licenses cover semiconductor mask restrictions (best not to use for this purpose anyway, see patents)? Does GPLv3 cover database restrictions? I’d hope the answer is yes in each case, and if the answer is no or ambiguous, future licenses further improve on the generality of restrictions around which permissions are granted.

There is one risk in licensing everything possible, and culturally it seems, specifically in licensing database rights — the impression that licensee which do so ‘create obligations’ related to those rights. I find this an odd way to think of a conditional permission as the creation of an obligation, when the user’s situation without said permission is unambiguously worse, i.e., no permission. Further, this impression is a problem for non-maximally-permissive licenses around copyright, not only database or other copyright-like rights.

In my opinion the best a public license can do is to grant permissions (conditionally, if not a maximally permissive license) around restrictions with as much generality as possible, and expressly state that a license is not needed (and therefore conditions to not apply) if a user can ignore underlying restrictions for some other reason. Can the approach of CC version 4.0 licenses to the latter be improved?

For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.

These are all trivialities for license nerds. For publishers and users of databases: Data is free. Free the data!

Moral and personality rights

CC 4.0 licenses address them well:

Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.

To understand just how well, CC 3.0 licenses say:

Except as otherwise agreed in writing by the Licensor or as may be otherwise permitted by applicable law, if You Reproduce, Distribute or Publicly Perform the Work either by itself or as part of any Adaptations or Collections, You must not distort, mutilate, modify or take other derogatory action in relation to the Work which would be prejudicial to the Original Author’s honor or reputation. Licensor agrees that in those jurisdictions (e.g. Japan), in which any exercise of the right granted in Section 3(b) of this License (the right to make Adaptations) would be deemed to be a distortion, mutilation, modification or other derogatory action prejudicial to the Original Author’s honor and reputation, the Licensor will waive or not assert, as appropriate, this Section, to the fullest extent permitted by the applicable national law, to enable You to reasonably exercise Your right under Section 3(b) of this License (right to make Adaptations) but not otherwise.

Patents and trademark

Prior versions were silent, CC 4.0 licenses state:

Patent and trademark rights are not licensed under this Public License.

Perhaps some potential licensor will be reassured, but I consider this unnecessary and slightly harmful, replicating the main deficiency of CC0. The explicit exclusion makes it harder to see an implied license. This is especially troublesome when CC licenses are used in fields in which patents can serve as a barrier. Software is one, for which CC has long disrecommended use of CC licenses largely because software is already well-covered by licenses with which CC licenses are mostly incompatible with; the explicit patent exclusion in the CC 4.0 licenses makes them even less suitable. Hardware design is another such field, but one with fragmented licensing, including use of CC licenses. CC should now explicitly disrecommend using CC licenses for hardware designs and declare CC-BY-SA-4.0 one-way compatible with GPLv3+ so that projects using one of the CC-BY-SA licenses for hardware designs have a clear path to a more appropriate license.

Patents of course can be licensed separately, and as I pointed out before regarding CC0, there could be curious arrangements for projects using such licenses with patent exclusions, such as only accepting contributions from Defensive Patent License users. But the better route for “open hardware” projects and the like to take advantage of this complementarity is to do both, that is use a copyright and related rights license that includes a patent peace clause, and join the DPL club.

DRM

CC 4.0 licenses:

The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures.

This is a nice addition, which had been previously suggested for CC 3.0 licenses and rejected — the concept copied from GPLv3 drafts at the time. I would have preferred to also remove the limited DRM prohibition in the CC licenses.

Attribution

The CC 4.0 licenses slightly streamline and clarify the substance of the attribution requirement, all to the good. The most important bit, itself only a slight streamlining and clarification of similar in previous versions:

You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.

This pulls in the wild use from near zero to-the-letter compliance to fairly high.

I’m not fond of the requirement to remove attribution information if requested by the licensor, especially accurate information. I don’t know whether a licensor has ever made such a request, but that makes the clause only pointless rather than harmful. Not quite though, as it does make for a talking point.

NonCommercial

not primarily intended for or directed towards commercial advantage or private monetary compensation. For purposes of this Public License, the exchange of the Licensed Material for other material subject to Copyright and Similar Rights by digital file-sharing or similar means is NonCommercial provided there is no payment of monetary compensation in connection with the exchange.

Not intended to be a substantive change, but I’ll take it. I’d have preferred a probably more significantly narrowed definition and a re-branding so as to increase the range of and differentiation among the licenses that CC stewards. But at the beginning of the 4.0 licenses process, I expected no progress, so am not disappointed. Branding and other positioning changes could come post-launch, if anyone is so inclined.

I think the biggest failure of the range of licenses with an NC term (and there are many preceding CC) is not confusion and pollution of commons, very roughly the complaints of people who would like NC to have a more predictable meaning and those who think NC offers inadequate permissions, respectively, but lack of valuable use. Licenses with the NC term are certainly used for hundreds of millions of photos and web pages, and some (hundreds of?) thousands of songs, videos, and books, but few where either the licensor or the public gains significant value above what would have been achieved if the licensor had simply offered gratis access (i.e., put stuff on the web, which is incredibly valuable even with no permissions granted). As far as I know, NC licenses haven’t played a significant role in enabling (again, relative to gratis access) any disruptive product or policy, and their use by widely recognized artists and brands is negligible (cf. CC-BY-SA, which Wikipedia and other mass collaboration projects rely on to exist, and CC-BY and CC0, which are part of disruptive policy mandates).

CC is understandably somewhat stuck between free/open norms, which make licenses with the NC an embarrassment, and their numerically large but low value uses. A license steward or would-be steward that really believed a semicommons license regime could do much more would try to break out of this rut by doing a complete rethink of the product (or that part of the product line), probably resulting in something much more different from the current NC implementation than the mere definitional narrowing and rebranding that I started out preferring. This could be related to my commentary on innovation in “copyright choice” tools above; whether the two are really the same thing would be a subject for inquiry.

NoDerivatives

If there were licenses that should not have been brought to version 4.0, at least not under the CC brand, it would have been CC-BY-NC-ND and CC-BY-ND.

Instead, an express permission to make derivatives so long as they are not shared was added. This change makes so-called text/content/data mining of any work under any of the CC version 4.0 licenses unambiguously permitted, and makes ND stick out a tiny bit less as an aberration from the CC license suite modeling some moderate copyright reform baseline.

There are some costs to this approach: surprise that a “no derivatives” license permits derivatives, slight reduction in scope and differentiation among licenses that CC stewards, giving credence to ND licenses as acceptable for scholarship, and abetting the impression that text/content/data mining requires permission at all. The last is most worrisome, but (as with similar worries around licensing databases) can be turned into a positive to the extent CC and everyone knowledgeable emphasizes that you ought not and probably don’t need a license; we’re just making sure you have the freedoms around CC licensed works that you ought to have anyway, in case the info regulation regime gets even worse — but please, mine away.

ShareAlike

This is the most improved named (BY/NC/ND/SA) elements in CC 4.0 licenses, and the work is not done yet. But first, I wish it had been improved even more, by making more uses unambiguously “trigger” the SA provision. This has been done once, starting in 2.0:

For the avoidance of doubt, where the Work is a musical composition or sound recording, the synchronization of the Work in timed-relation with a moving image (“synching”) will be considered a Derivative Work for the purpose of this License.

The obvious next expansion would have been use of images (still or moving) in contextual relation to other material, eg illustrations used in a text. Without this expansion, CC-BY-SA and CC-BY-NC-SA are essentially identical to CC-BY and CC-BY-NC respectively for the vast majority of actual “reuse” instances. Such an expansion would have substantially increased the range of and differentiation among licenses that CC stewards. The main problem with such an expansion (apart from specifying it exactly) would be increasing the cost of incompatibility, where texts and images use different licenses. This problem would be mitigated by increasing compatibility among copyleft licenses (below), or could be eliminated by broadening the SA licensing requirement for uses triggered by expansion, eg any terms granting at least equivalent permissions, such that a CC-BY-SA illustration could still be used in a text licensed under CC-BY or CC0. Such an expansion did not make the cut, but I think together with aforementioned broadening of licensing requirements, such a modulation (neither strictly “stronger” nor “weaker”) would make for an interesting and heretofore unimplemented approach to copyleft, in some future license.

Apart from a subtle improvement that brings SA closer to a full “or later versions” license, and reflects usual practice and understanding (incidentally, “no sublicensing” in non-SA licenses remains pointless, is not to be found in most non-CC permissive licenses, and should not be replicated), the big improvements in CC 4.0 licenses with the SA element are the addition of the potential for one-way compatibility to CC-BY-SA, adding the same compatibility mechanism to CC-BY-NC-SA, and discussions with stewards of potentially compatible licenses which make the realization of compatibility more likely. (I would have included a variation on the more complex but in my view elegant and politically advisable mechanism introduced in MPL 2.0, which allows for continued use under the donor compatible license as long as possible. Nobody demanded such, so not adding the complexity was perhaps a good thing.)

I hope that in 2014 CC-BY-SA-4.0 will be declared bilaterally compatible with the Free Art License 1.3, or if a new FAL version is required, it is being worked on, with achieving bilateral compatibility as a hard requirement, and more importantly, that CC-BY-SA-4.0 is declared one-way compatible (as a donor) with GPLv3+. An immediate step toward those ends will be finalizing an additional statement of intent regarding the stewardship of licenses with the ShareAlike element.

Though I’ll be surprised if any license appears as a candidate for compatibility with CC-BY-NC-SA-4.0, adding the mechanism to that license is a good thing: as a matter of general license stewardship, reducing the barriers to someone else creating a better NC license (see above), and keeping “porting” completely outside the 4.0 license texts (hopefully there will be no porting, but if there is any, compatibility with the international versions in licenses with the SA element would be exclusively via the compatibility mechanism used for any potentially compatible license).

Tech

All license clauses have id attributes, allowing direct linking to a particular clause. These direct links are used for references within the licenses. These are big usability improvements.

I would have liked to see an expansive “tech” (including to some extent design) effort synchronized with the 4.0 licenses, from the practical (e.g., a canonical format for license texts, from which HTML, plain text, and others are generated; that may be HTML, but the current license HTML is inadequate for the task) to the impractical (except for increasing CC’s reputation, e.g., investigating whether any semantic annotation and structure, preferably building on existing research, would be useful, in theory, for the license texts, and possibly even a practical aid to translation), to testing further upgrades to the ‘legal user interface’ constituted by the license texts and “deed” summaries (e.g., combining these), to just bringing various CC tooling and documentation up to date with RDFa 1.1 Lite. But, some of these things could be done post-launch if anyone is so inclined, and my understanding is that CC has only a single technology person on staff, dedicated to creating other products, and most importantly, the ability to directly link to any license clause probably has more practical benefits than anything on my wishlist.

Readability

One of the best things about the CC 4.0 licenses is their increased understandability. This is corroborated by crude automated readability metrics below, but I suspect these do not adequately characterize the improvement, for they include three paragraphs of explanatory text not present in previous versions, probably don’t fully reflect the improvement of splitting hairball paragraphs into lists, and have no mechanism for accounting for how the improved usability of linking to individual clauses contributes to understandability.

CC-BY-NC-SA (the license with the most stuff in it, usually used as a drafting template for others) from version 1.0 through 4.0, including 4.0 drafts (lower numbers indicate better readability, except in the case of Flesch; Chars/(Flesch>=1) is my gross metric for how painful it is to read a document; see license automated readability metrics for an explanation):

SHA1 License Characters Kincaid ARI Coleman-Liau Fog Lix SMOG Flesch Chars/(Flesch>=1)
39b2ef67be9e5b4e743e5269a31ad1691515eede CC-BY-NC-SA-1.0 10228 13.3 16.3 14.2 17.0 59.7 14.2 48.4 211
5800ac2d32e35ace035cdcae693423cd9ff5bb6f CC-BY-NC-SA-2.0 11927 13.3 16.2 14.7 17.1 60.0 14.4 47.0 253
e5f44c2df6b1391d1ddb6efb2db6f90670e4ae67 CC-BY-NC-SA-2.5 12013 13.1 16.0 14.6 16.9 59.6 14.2 47.7 251
a63b7e81e7b9e30df5d253aed1d2991af47992df CC-BY-NC-SA-3.0 17134 16.4 19.7 14.2 20.6 67.0 16.3 38.8 441
8b36c30ed0510d9ca9c69a2ef826b9fd52992474 by-nc-sa-4.0d1 12465 13.0 15.0 14.9 16.3 57.4 14.0 43.9 283
4a87c7af5cde7729e2e456ee0e8958f8632e3005 by-nc-sa-4.0d2 11583 13.1 14.8 14.2 16.8 56.2 14.4 44.7 259
bb6f239f7b39343d62440bff00de24da2b3d256f by-nc-sa-4.0d3 14422 14.1 15.8 15.1 18.2 61.0 15.4 38.6 373
cf5629ae38a745f4f9eca429f7b26af2e71eb109 by-nc-sa-4.0d4 14635 13.8 15.6 15.5 17.8 60.2 15.2 38.6 379
a5e1b9829fd287cbe255df71eb9a5aad7fb19dbc by-nc-sa-4.0d4v2 14808 14.0 15.8 15.5 18.0 60.6 15.2 38.1 388
887f9a5da675cf681421eab3ac6d61f82cf34971 CC-BY-NC-SA-4.0 14577 13.1 14.7 15.7 17.1 58.6 14.7 40.1 363

Versions 1.0 through 4.0 of each of the six CC licenses brought to version 4.0, and CC0:

SHA1 License Characters Kincaid ARI Coleman-Liau Fog Lix SMOG Flesch Chars/(Flesch>=1)
74286ae0dfea38c489437bf659b209737945145c CC0-1.0 5116 16.2 19.5 15.0 19.5 66.3 15.6 36.8 139
c766cc6d5e63277e46a3d83c6254e3528082587b CC-BY-1.0 8867 12.6 15.5 14.1 16.4 57.8 13.8 51.3 172
bf23729bec8ffd0de4d319fb33395c595c5c762b CC-BY-2.0 9781 12.1 14.9 14.3 16.1 56.7 13.7 51.9 188
024bb6d37d0a17624cf532bd14fbd42e15c5a963 CC-BY-2.5 9867 11.9 14.7 14.2 15.8 56.3 13.6 52.6 187
20dc61b94cfe1f4ba5814b340095b4c3fa23e801 CC-BY-3.0 14956 16.1 19.4 14.1 20.4 66.1 16.2 40.0 373
00b29551deee9ced874ffb9d29379b92f1487045 CC-BY-4.0 13003 13.0 14.5 15.4 16.9 57.9 14.6 41.1 316
e0c4b13ec5f9b5702d2e8b88d98b803e07d65cf8 CC-BY-NC-1.0 9313 13.2 16.2 14.3 17.0 59.3 14.1 49.3 188
970421995789d2e8189bb12071ab838a3fcf2a1a CC-BY-NC-2.0 10635 13.1 16.1 14.6 17.2 59.5 14.4 48.1 221
08773bb9bc13959c6f00fd49fcc081d69bda2744 CC-BY-NC-2.5 10721 12.9 15.8 14.5 16.9 59.0 14.2 48.9 219
9639556280637272ace081949f2a95f9153c0461 CC-BY-NC-3.0 15732 16.5 19.9 14.1 20.8 67.2 16.4 38.7 406
afcbb9791897e1e2f949d9d56ba64164746e0828 CC-BY-NC-4.0 13520 13.2 14.8 15.6 17.2 58.6 14.8 39.8 339
9ab2a3818e6ccefbc6ffdd48df7ecaec25e32e41 CC-BY-NC-ND-1.0 8729 12.7 15.8 14.4 16.4 58.6 13.8 51.0 171
966c97357e3b529e9c8bb8166fbb871c5bc31211 CC-BY-NC-ND-2.0 10074 13.0 16.1 14.7 17.0 59.7 14.3 48.8 206
c659a0e3a5ee8eba94aec903abdef85af353f11f CC-BY-NC-ND-2.5 10176 12.8 15.9 14.6 16.8 59.2 14.2 49.3 206
ad4d3e6d1fb6f89bbd28a44e263a89430b575dfa CC-BY-NC-ND-3.0 14356 16.3 19.7 14.1 20.5 66.8 16.2 39.7 361
68960bdf512ff5219909f932b8a81fdb255b4642 CC-BY-NC-ND-4.0 13350 13.3 14.8 15.7 17.2 58.4 14.8 39.4 338
39b2ef67be9e5b4e743e5269a31ad1691515eede CC-BY-NC-SA-1.0 10228 13.3 16.3 14.2 17.0 59.7 14.2 48.4 211
5800ac2d32e35ace035cdcae693423cd9ff5bb6f CC-BY-NC-SA-2.0 11927 13.3 16.2 14.7 17.1 60.0 14.4 47.0 253
e5f44c2df6b1391d1ddb6efb2db6f90670e4ae67 CC-BY-NC-SA-2.5 12013 13.1 16.0 14.6 16.9 59.6 14.2 47.7 251
a63b7e81e7b9e30df5d253aed1d2991af47992df CC-BY-NC-SA-3.0 17134 16.4 19.7 14.2 20.6 67.0 16.3 38.8 441
887f9a5da675cf681421eab3ac6d61f82cf34971 CC-BY-NC-SA-4.0 14577 13.1 14.7 15.7 17.1 58.6 14.7 40.1 363
e4851120f7e75e55b82a2c007ed98ffc962f5fa9 CC-BY-ND-1.0 8280 12.3 15.5 14.3 16.1 57.9 13.6 52.4 158
f1aa9011714f0f91005b4c9eb839bdb2b4760bad CC-BY-ND-2.0 9228 11.9 14.9 14.5 15.8 56.9 13.5 52.7 175
5f665a8d7ac1b8fbf6b9af6fa5d53cecb05a1bd3 CC-BY-ND-2.5 9330 11.8 14.7 14.4 15.6 56.5 13.4 53.2 175
3fb39a1e46419e83c99e4c9b6731268cbd1591cd CC-BY-ND-3.0 13591 15.8 19.2 14.1 20.0 65.6 15.9 41.2 329
ac747a640273815cf3a431be0afe4ec5620493e3 CC-BY-ND-4.0 12830 13.0 14.4 15.4 16.9 57.6 14.6 40.7 315
dda55573a1a3a80d294b1bb9e1eeb3a6c722968c CC-BY-SA-1.0 9779 13.1 16.1 14.2 16.8 59.1 14.0 49.5 197
9cceb80d865e52462983a441904ef037cf3a4576 CC-BY-SA-2.0 11044 12.5 15.3 14.4 16.2 57.9 13.8 50.2 220
662ca9fce7fed61439fcbc27ca0d6db0885718d9 CC-BY-SA-2.5 11130 12.3 15.0 14.4 16.0 57.5 13.6 50.9 218
4a5bb64814336fb26a9e5d36f22896ce4d66f5e0 CC-BY-SA-3.0 17013 16.4 19.8 14.1 20.5 67.2 16.2 38.9 437
8632363dcc2c9fc44f582b14274259b3a35744b2 CC-BY-SA-4.0 14041 12.9 14.4 15.4 16.8 57.8 14.5 41.4 339

It’s good for automated readability metrics that from 3.0 to 4.0 CC-BY-SA is most improved (the relevant clause was a hairball paragraph; CC-BY-NC-SA should have improved less, as it gained the compatibility mechanism) and CC-BY-ND is least improved (it gained express permission for private adaptations).

Next

I leave a list of recommendations (many already mingled in or implied by above) to a future post. But really, just use CC0.

Metadata is technical debt

Monday, September 9th, 2013

Rob Kaye of MusicBrainz writes about their RDFa dilemma. My summary of the short post and comments:

  • Someone paid to have RDFa added to MusicBrainz pages a few years ago.
  • The code adding RDFa is brittle, hasn’t been maintained through MusicBrainz schema changes, thus is now broken.
  • There are no known consumers the RDFa in MusicBrainz pages.
  • Unless someone volunteers to fix and maintain the RDFa, “we’re ready to remove the broken code from our pages in an effort to remove technical debt that has accumulated over the past few years.”
  • A few people want RDFa in MusicBrainz pages maintained because “Very long term I think this is a sensible way forward – the web site as its own API” and compatibility with other semantic web initiatives.
  • Some people tentatively volunteer to help.

Kaye’s post is a model for how to remove features — inform the relevant community, ask if anyone cares and is willing to maintain the feature in question. This could be applied in a commercial context, eg asking customers if they’re willing to pay to maintain a feature or to keep a service alive. It is somewhat odd that transaction costs are high enough/coordination poor enough that such is not as commonplace as feature removal and service shutdown.

I’ve long liked the notion of the web site as its own API, to the extent of feeling a strong dislike for many RPC APIs for web applications, and like RDFa, but mostly I think most metadata implementation is premature, and as with choosing a metadata format, it is best to just ignore it till there’s an unambiguous and immediate gain to be had from implementation.

People pitching metadata as a solution, public or private good, are frighteningly like SEO pushers, except usually not evil. The likeness is that the benefits are vague, confusing, apparently require experts to discern and implement, and almost everyone would be better off wholly ignoring the pitchers/pushers.

I apologize for doing a bit of pitching over the years, wasting people’s time, making whatever I was actually trying to sell more difficult, lowering my intelligence (had I woken up on the other side of the bed some day, I’d have been pitching another layer of snakeoil) and adding technical debt to the world.

Mitigating all this: there’s often no clear separation of “data” and “metadata”. It’s all data of course..

Perhaps people who prefix with “meta” are another class deserving a punch in the face. Note that Kaye’s post does not include the string “meta”; I’m just exploiting the appearance of “technical debt” and “RDFa” in the same text here!

Life in the possibly bright future of the federated social indieweb

Saturday, June 8th, 2013

After about five years (2.5 year update) it’s hard not to be disappointed in the state of the federated social web. Legacy silos have only increased their dominance, abetting mass spying, and interop among federated social web experiments looks bleak (link on different topic, but analogous).

In hindsight it was disappointing 5 years ago that blogs and related (semweb 1.0?) technologies hadn’t formed the basis of the federated social web (my pet theory is that the failure is in part due to the separation of blog post/comment writing and feed reading).

Another way of looking at it is that despite negligible resources focused on the problem, much progress has been made in figuring out how to do the federated social web over the past five years. Essentially nothing recognizable as a social web application federated five years ago. There are now lots of experiments, and two of the pioneers have learned enough to determine a rewrite was necessary — Friendica→Red and the occasion for this post, StatusNet→pump.io.

Right now is a good time to try out a federated social web service (hosted elsewhere, or run your own instance) again, or for the first time:

My opinion, at the moment: pump.io has the brightest future, Diaspora appears the most featureful (inclusive of looking nice) to users, and Friendica is the best at federating with other systems. Also see a comparison of software and protocols for distributed social networking and the Federated Social Web W3C community group.

The Indie Web movement is complementary, and in small part might be seen as taking blog technologies and culture forward. When I eventually rebuild a personal site, or a new site for an organization, indieweb tools and practices will be my first point of reference. Their Publish (on your) Own Site, Syndicate Elsewhere and Publish Elsewhere, Syndicate (to your) Own Site concepts are powerful and practical, and I think what a lot of people want to start with from federated social web software.

*Running StatusNet as I write, to be converted to pump.io over the next hours. The future of StatusNet is to be at GNU social.

List of Wikimania 2013 Submissions of Interest

Saturday, May 4th, 2013

Unlikely I’ll attend Wikimania 2013 in Hong Kong (I did last year in DC). In lieu of marking myself as an interested attendee of proposed sessions, my list of 32 particularly interesting-to-me proposals follows. I chose by opening the proposal page for each of the 331 submissions that looked interesting at first glance (about 50) and weeded out some of those.

I suspect many of these proposals might be interesting reading for anyone generally curious about possible futures of Wikipedia and related, similar, and complementary projects, but not following any of these things closely.