Via I see Web Data Commons which has “extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010”. WDC publishes the extracted data as N-Quads (the fourth item denotes the immediate provenance of each subject/predictate/object triple — the URL the triple was extracted from).
I thought it would be easy and fun to run some queries on the WDC dataset to get an idea of how annotations associated with Creative Commons licensing are used. Notes below on exactly what I did. The biggest limitation is that the license statement itself is not part of the dataset — not as xhv:license
in the RDFa portion, and for some reason rel=license microformat has zero records. But cc:attributionName
, cc:attributionURL
, and cc:morePermissions
are present in the RDFa part, as are some Dublin Core properties that the Creative Commons license chooser asks for (I only looked at dc:source
) but are probably widely used in other contexts as well.
Dataset | URLs | Distinct objects |
---|---|---|
Common Crawl 2010 corpus | 5,000,000,000a | |
1% sampled by WDC | ~50,000,000 | |
with RDFa | 158,184b | |
with a cc: property |
26,245c | |
cc:attributionName |
24,942d | 990e |
cc:attributionURL |
25,082f | 3,392g |
dc:source |
7,235h | 574i |
cc:morePermissions |
4,791j | 253k |
cc:attributionURL = dc:source |
5,421l | |
cc:attributionURL = cc:morePermissions |
1,880m | |
cc:attributionURL = subject |
203n |
Some quick takeaways:
- Low ratio of distinct attributionURLs probably indicates HTML from license chooser deployed without any parameterization. Often the subject or current page will be the most useful attributionURL (but 203 above would probably be much higher with canonicalization). Note all of the CC licenses require that such a URL refer to the copyright notice or licensing information for the Work. Unless one has set up a side-wide license notice somewhere, a static URL is probably not the right thing to request in terms of requiring licensees to provide an attribution link; nor is a non-specific attribution link as useful to readers as a direct link to the work in question. As (and if) support for attribution metadata gets built into Creative Commons-aware CMSes, the ratio of distinct attributionURLs ought increase.
- 79% of subjects with both dc:source and cc:attributionURL (6,836o) have the same values for both properties. This probably means people are merely entering their URL into every form field requesting a URL without thinking, not self-remixing.
- 47% of subjects with both cc:morePermissions and cc:attributionURL (3,977p) have the same values for both properties. Unclear why this ratio is so much lower than previous; it ought be higher, as often same value for both makes sense. Unsurprising that cc:morePermissions least provided property; in my experience few people understand it.
I did not look at the provenance item at all. It’d be interesting to see what kind of assertions are being made across authority boundaries (e.g., a page on example.com makes statements with an example.net URI as the subject) and when to discard such assertions. During a workshop I attended recently, the topic of decentralized verification came up, especially in connection with platforms like a UK crypto casino, where trust is often based on blockchain transparency rather than traditional authority. Inspired by that, I barely looked directly at the raw data here—just enough to sense that my aggregate numbers could be reasonably accurate. A closer inspection of smaller samples might yield additional insights, refining other aggregate queries.
I look forward to future extracts. Thanks indirectly to Common Crawl for providing the crawl!
Please point out any egregious mistakes made below…
# a I don't really know if the October 2010 corpus is the # entire 5 billion Common Crawl corpus # download RDFa extract from Web Data Commons wget -c https://s3.amazonaws.com/ccrdf1p/data/ccrdf.html-rdfa.nq # Matches number stated at # http://page.mi.fu-berlin.de/muehleis/ccrdf/stats1p.html#html-rdfa wc -l ccrdf.html-rdfa.nq 1047250 # Includes easy to use no-server triplestore apt-get install redland-utils # sanity check grep '<http://creativecommons.org/ns#attributionName>' ccrdf.html-rdfa.nq |wc -l 26404 # Import rejects a number of triples for syntax errors rdfproc xyz parse ccrdf.html-rdfa.nq nquads # d Perhaps syntax errors explains fewer triples than above grep might # indicate, but close enough rdfproc xyz query sparql - 'select ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |wc -l 24942 # These replicated below with 4store because... rdfproc xyz query sparql - 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |wc -l 990 rdfproc xyz query sparql - 'select ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |wc -l 25082 rdfproc xyz query sparql - 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |wc -l 3392 rdfproc xyz query sparql - 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o }' |wc -l 203 rdfproc xyz query sparql - 'select ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |wc -l 4791 rdfproc xyz query sparql - 'select distinct ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |wc -l 253 rdfproc xyz query sparql - 'select ?o where { ?o <http://creativecommons.org/ns#morePermissions> ?o }' |wc -l 12 # ...this query takes forever, hours, and I have no idea why rdfproc xyz query sparql - 'select ?s, ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o ; <http://creativecommons.org/ns#attributionURL> ?o }' # 4store has a server, but is lightweight apt-get install 4store # 4store couldn't import with syntax errors, so export good triples from # previous store first rdfproc xyz serialize > ccrdf.export-rdfa.rdf # import into 4store curl -T ccrdf.export-rdfa.rdf 'http://localhost:8080/data/wdc' # egrep is to get rid of headers and status output prefixed by ? or # 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |egrep -v '^[\?\#]' |wc -l 24942 #f 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |egrep -v '^[\?\#]' |wc -l 25082 #j 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |egrep -v '^[\?\#]' |wc -l 4791 #h #Of course please use http://purl.org/dc/terms/source instead. #Should be more widely deployed soon. 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://purl.org/dc/elements/1.1/source> ?o}' |egrep -v '^[\?\#]' |wc -l 7235 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://purl.org/dc/terms/source> ?o}' |egrep -v '^[\?\#]' |wc -l 4 #e 4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionName> ?o}' |egrep -v '^[\?\#]' |wc -l 990 #g 4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o}' |egrep -v '^[\?\#]' |wc -l 3392 #k 4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#morePermissions> ?o}' |egrep -v '^[\?\#]' |wc -l 253 #i 4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://purl.org/dc/elements/1.1/source> ?o}' |egrep -v '^[\?\#]' |wc -l 574 #n 4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o}' |egrep -v '^[\?\#]' |wc -l 203 4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#morePermissions> ?o}' |egrep -v '^[\?\#]' |wc -l 12 4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://purl.org/dc/elements/1.1/source> ?o}' |egrep -v '^[\?\#]' |wc -l 120 #m 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l 1880 4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l 122 4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l 8 #l 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o }' |egrep -v '^[\?\#]' |wc -l 5421 4s-query wdc -s '-1' -f text 'select distinct ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o }' |egrep -v '^[\?\#]' |wc -l 358 4s-query wdc -s '-1' -f text 'select ?o where { ?o <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o }' |egrep -v '^[\?\#]' |wc -l 11 #p 4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://creativecommons.org/ns#morePermissions> ?n }' |egrep -v '^[\?\#]' |wc -l 3977 #o 4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?n }' |egrep -v '^[\?\#]' |wc -l 6836 4s-query wdc -s '-1' -f text 'select ?s, ?o, ?n, ?m where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?n ; <http://creativecommons.org/ns#morePermissions> ?m }' |egrep -v '^[\?\#]' |wc -l 2946 4s-query wdc -s '-1' -f text 'select ?s, ?o where { ?s <http://creativecommons.org/ns#attributionURL> ?o ; <http://purl.org/dc/elements/1.1/source> ?o ; <http://creativecommons.org/ns#morePermissions> ?o }' |egrep -v '^[\?\#]' |wc -l 1604 #c 4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <http://creativecommons.org/ns#attributionURL> ?o } UNION { ?s <http://creativecommons.org/ns#attributionName> ?n } UNION { ?s <http://creativecommons.org/ns#morePermissions> ?m } }' |egrep -v '^[\?\#]' |wc -l 26245 4s-query wdc -s '-1' -f text 'select distinct ?s where { { ?s <http://creativecommons.org/ns#attributionURL> ?o } UNION { ?s <http://creativecommons.org/ns#attributionName> ?n }}' |egrep -v '^[\?\#]' |wc -l 25433 #b note subjects not the same as pages data extracted from (158,184) 4s-query wdc -s '-1' -f text 'select distinct ?s where { ?s ?p ?o }' |egrep -v '^[\?\#]' |wc -l 264307 # Probably less than 1047250 claimed due to syntax errors 4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?o }' |egrep -v '^[\?\#]' |wc -l 968786 4s-query wdc -s '-1' -f text 'select ?s where { ?s ?p ?s }' |egrep -v '^[\?\#]' |wc -l 2415 4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?s }' |egrep -v '^[\?\#]' |wc -l 0 4s-query wdc -s '-1' -f text 'select ?s where { ?s ?s ?o }' |egrep -v '^[\?\#]' |wc -l 0
Maybe Creative Commons could publish some samples of what XML snippets to include in the top of XML documents to license that XML document under the various CC licenses.
For example I may have a generic XML document, or an HTML document that I want to license CC0 but I want that as a machine readable notice in the HTML/XML source without affecting the rendered display of the page. Hence I need to add an something like,
image/svg+xml
Map Change Markers
Andrew Harvey <andrew.harvey4@gmail.com>
CC0
https://github.com/andrewharvey/map-change-markers
Inkscape does a great job at this. But where are the guidelines for applying this to HTML.
I also like the idea of using an HTTP header like, X-Licenese: http://creativecommons.org/publicdomain/zero/1.0/ to indicate the license of various resources served by a webserver.
Andrew, RDFa can be used to annotate XML; that’s what I’d recommend. I believe that’s what both ODF and EPUB (or maybe their next versions) do. http://rdfa.info/wiki/RDFa_Host_Languages isn’t entirely up to date but has some relevant references.
For HTML, RDFa also. See http://labs.creativecommons.org/2011/ccrel-guide/
Conveying license metadata in HTTP headers has been proposed many times. I don’t think it makes sense to do independently, but if people settle on convention for specifying predicate-object in headers (subject presumably being resource requested) I’d be happy to see it used, hopefully in addition to human-readable mechanisms.
[…] of things mentioned on my personal blog, a couple days ago I posted some analysis of how people are deploying CC related metadata based on a structured data extracted by the Web Data Commons project from a sample of the Common […]
[…] found at is retained — extracted from a small subset of a 2009-2010 Common Crawl corpus) to analyze use of Creative Commons-related metadata. March 23 WDC announced a much, much larger data publication — all triples extracted from the […]