January I used WebDataCommons first data publication (triples — actually quads, as URL triple found at is retained — extracted from a small subset of a 2009-2010 Common Crawl corpus) to analyze use of Creative Commons-related metadata. March 23 WDC announced a much, much larger data publication — all triples extracted from the entire Common Crawl 2009/2010 and 2012 corpora.
I’m not going to have time to perform a similar Creative Commons-related analysis on these new datasets (I really hope someone else does; perhaps see my previous post for inspiration) but aggregate numbers about formats found by the two extractions are interesting enough, and not presented in an easy to compare format on the WDC site, for me to write the remainder of this post.
Notes:
- 2009/10 data taken from https://s3.amazonaws.com/webdatacommons/stats/stats.html and 2012 data taken from https://s3.amazonaws.com/webdatacommons-2/stats/stats.html. Calculated values italicized. Available as a spreadsheet.
- The next points, indeed all comparisons, should be treated with great skepticism — it is unknown how comparable the two Common Crawl corpora are.
- Publication of structured data on the web is growing rapidly.
- Microdata barely existed in 2009/2010, so it is hardly surprising that it has grown tremendously.
- Overall microformats adoption seems to have stagnated but still hold the vast majority of extracted data. It is possible however that the large decreases in hlisting and hresume and increase in hrecipe use are due to one or a few large sites or CMSes run by many sites. This (indeed everything) bears deeper investigation. What about deployment of micrformats-2 with prefixed class names that I don’t think would be matched by the WebDataCommons extractor?
- Perhaps the most generally interesting item below doesn’t bear directly on HTML data — the proportion of URLs in the Common Crawl corpora parsed as HTML declined by 4 percentage points. Is this due to more non-HTML media or more CSS and JS?
2009/10 | 2012 | Change (% points or per URL) | |||||||
Total Data (compressed) | 28.9 Terabyte | 20.9 Terabyte | |||||||
Total URLs | 2,804,054,789 | 1,700,611,442 | |||||||
Parsed HTML URLs | 2,565,741,671 | 1,486,186,868 | |||||||
Domains with Triples | 19,113,929 | 65,408,946 | |||||||
URLs with Triples | 251,855,917 | 302,809,140 | |||||||
Typed Entities | 1,546,905,880 | 1,222,563,749 | |||||||
Triples | 5,193,276,058 | 3,294,248,652 | |||||||
% Total URLs parsed as HTML | 91.50% | 87.39% | -4.11% | ||||||
% HTML URLs with Triples | 9.82% | 20.37% | 10.56% | ||||||
Typed Entities/HTML URL | 0.60 | 0.82 | 0.22 | ||||||
Triples/HTML URL | 2.02 | 2.22 | 0.19 | ||||||
2009/10 Extractor | Domains with Triples | URLs with Triples | Typed Entities | Triples | % HTML URLs | % HTML URLs with Triples | % Domains with Triples | % Typed Entities | % Triples |
html-rdfa | 537,820 | 14,314,036 | 26,583,558 | 293,542,991 | 0.56% | 5.68% | 2.81% | 1.72% | 5.65% |
html-microdata | 3,930 | 56,964 | 346,411 | 1,197,115 | 0.00% | 0.02% | 0.02% | 0.02% | 0.02% |
html-mf-geo | 244,838 | 5,051,622 | 7,326,516 | 28,831,795 | 0.20% | 2.01% | 1.28% | 0.47% | 0.56% |
html-mf-hcalendar | 226,279 | 2,747,276 | 21,289,402 | 65,727,393 | 0.11% | 1.09% | 1.18% | 1.38% | 1.27% |
html-mf-hcard | 12,502,500 | 83,583,167 | 973,170,050 | 3,226,066,019 | 3.26% | 33.19% | 65.41% | 62.91% | 62.12% |
html-mf-hlisting | 31,871 | 1,227,574 | 25,660,498 | 88,146,122 | 0.05% | 0.49% | 0.17% | 1.66% | 1.70% |
html-mf-hresume | 10,419 | 387,364 | 1,501,009 | 12,640,527 | 0.02% | 0.15% | 0.05% | 0.10% | 0.24% |
html-mf-hreview | 216,331 | 2,836,701 | 8,234,850 | 84,411,951 | 0.11% | 1.13% | 1.13% | 0.53% | 1.63% |
html-mf-species | 3,244 | 25,158 | 152,621 | 391,911 | 0.00% | 0.01% | 0.02% | 0.01% | 0.01% |
html-mf-hrecipe | 13,362 | 115,345 | 695,838 | 1,228,925 | 0.00% | 0.05% | 0.07% | 0.04% | 0.02% |
html-mf-xfn | 5,323,335 | 37,526,630 | 481,945,127 | 1,391,091,386 | 1.46% | 14.90% | 27.85% | 31.16% | 26.79% |
html-mf-* | 1,519,975,911 | 4,898,536,029 | 98.26% | 94.32% | |||||
2012 Extractor | Domains with Triples | URLs with Triples | Typed Entities | Triples | % HTML URLs | % HTML URLs with Triples | % Domains with Triples | % Typed Entities | % Triples |
html-rdfa | 16,976,232 | 67,901,246 | 49,370,729 | 456,169,126 | 4.57% | 22.42% | 25.95% | 4.04% | 13.85% |
html-microdata | 3,952,674 | 26,929,865 | 90,526,013 | 404,413,915 | 1.81% | 8.89% | 6.04% | 7.40% | 12.28% |
html-mf-geo | 897,080 | 2,491,933 | 4,787,126 | 11,222,766 | 0.17% | 0.82% | 1.37% | 0.39% | 0.34% |
html-mf-hcalendar | 629,319 | 1,506,379 | 27,165,545 | 65,547,870 | 0.10% | 0.50% | 0.96% | 2.22% | 1.99% |
html-mf-hcard | 30,417,192 | 61,360,686 | 865,633,059 | 1,837,847,772 | 4.13% | 20.26% | 46.50% | 70.80% | 55.79% |
html-mf-hlisting | 69,569 | 197,027 | 8,252,632 | 20,703,189 | 0.01% | 0.07% | 0.11% | 0.68% | 0.63% |
html-mf-hresume | 9,890 | 20,762 | 92,346 | 432,363 | 0.00% | 0.01% | 0.02% | 0.01% | 0.01% |
html-mf-hreview | 615,681 | 1,971,870 | 7,809,088 | 50,475,411 | 0.13% | 0.65% | 0.94% | 0.64% | 1.53% |
html-mf-species | 4,109 | 14,033 | 139,631 | 224,847 | 0.00% | 0.00% | 0.01% | 0.01% | 0.01% |
html-mf-hrecipe | 127,381 | 422,289 | 5,516,036 | 5,513,030 | 0.03% | 0.14% | 0.19% | 0.45% | 0.17% |
html-mf-xfn | 11,709,819 | 26,004,925 | 163,271,544 | 441,698,363 | 1.75% | 8.59% | 17.90% | 13.35% | 13.41% |
html-mf-* | 1,082,667,007 | 2,433,665,611 | 88.56% | 73.88% | |||||
2009/10 – 2012 Change (% points) | % HTML URLs | % HTML URLs with Triples | % Domains with Triples | % Typed Entities | % Triples | ||||
html-rdfa | 4.01% | 16.74% | 23.14% | 2.32% | 8.20% | ||||
html-microdata | 1.81% | 8.87% | 6.02% | 7.38% | 12.25% | ||||
html-mf-geo | -0.03% | -1.18% | 0.09% | -0.08% | -0.21% | ||||
html-mf-hcalendar | -0.01% | -0.59% | -0.22% | 0.85% | 0.72% | ||||
html-mf-hcard | 0.87% | -12.92% | -18.91% | 7.89% | -6.33% | ||||
html-mf-hlisting | -0.03% | -0.42% | -0.06% | -0.98% | -1.07% | ||||
html-mf-hresume | -0.01% | -0.15% | -0.04% | -0.09% | -0.23% | ||||
html-mf-hreview | 0.02% | -0.48% | -0.19% | 0.11% | -0.09% | ||||
html-mf-species | 0.00% | -0.01% | -0.01% | 0.00% | 0.00% | ||||
html-mf-hrecipe | 0.02% | 0.09% | 0.12% | 0.41% | 0.14% | ||||
html-mf-xfn | 0.29% | -6.31% | -9.95% | -17.80% | -13.38% | ||||
html-mf-* | -9.70% | -20.45% | |||||||
2009/10 – 2012 Change (%%) | % HTML URLs | % HTML URLs with Triples | % Domains with Triples | % Typed Entities | % Triples | ||||
html-rdfa | 718.95% | 294.55% | 822.40% | 134.99% | 144.98% | ||||
html-microdata | 81515.61% | 39220.30% | 29290.79% | 32965.42% | 53156.82% | ||||
html-mf-geo | -14.84% | -58.97% | 7.07% | -17.33% | -38.64% | ||||
html-mf-hcalendar | -5.34% | -54.39% | -18.73% | 61.45% | 57.22% | ||||
html-mf-hcard | 26.74% | -38.94% | -28.91% | 12.55% | -10.19% | ||||
html-mf-hlisting | -72.29% | -86.65% | -36.21% | -59.31% | -62.97% | ||||
html-mf-hresume | -90.75% | -95.54% | -72.26% | -92.22% | -94.61% | ||||
html-mf-hreview | 20.01% | -42.18% | -16.83% | 19.99% | -5.73% | ||||
html-mf-species | -3.70% | -53.61% | -62.99% | 15.76% | -9.55% | ||||
html-mf-hrecipe | 532.05% | 204.50% | 178.58% | 903.02% | 607.21% | ||||
html-mf-xfn | 19.63% | -42.36% | -35.72% | -57.13% | -49.94% | ||||
html-mf-* | -9.87% | -21.68% |