Web data formats Δ 2009-2012

January I used WebDataCommons first data publication (triples — actually quads, as URL triple found at is retained — extracted from a small subset of a 2009-2010 Common Crawl corpus) to analyze use of Creative Commons-related metadata. March 23 WDC announced a much, much larger data publication — all triples extracted from the entire Common Crawl 2009/2010 and 2012 corpora.

I’m not going to have time to perform a similar Creative Commons-related analysis on these new datasets (I really hope someone else does; perhaps see my previous post for inspiration) but aggregate numbers about formats found by the two extractions are interesting enough, and not presented in an easy to compare format on the WDC site, for me to write the remainder of this post.

Notes:

  • 2009/10 data taken from https://s3.amazonaws.com/webdatacommons/stats/stats.html and 2012 data taken from https://s3.amazonaws.com/webdatacommons-2/stats/stats.html. Calculated values italicized. Available as a spreadsheet.
  • The next points, indeed all comparisons, should be treated with great skepticism — it is unknown how comparable the two Common Crawl corpora are.
  • Publication of structured data on the web is growing rapidly.
  • Microdata barely existed in 2009/2010, so it is hardly surprising that it has grown tremendously.
  • Overall microformats adoption seems to have stagnated but still hold the vast majority of extracted data. It is possible however that the large decreases in hlisting and hresume and increase in hrecipe use are due to one or a few large sites or CMSes run by many sites. This (indeed everything) bears deeper investigation. What about deployment of micrformats-2 with prefixed class names that I don’t think would be matched by the WebDataCommons extractor?
  • Perhaps the most generally interesting item below doesn’t bear directly on HTML data — the proportion of URLs in the Common Crawl corpora parsed as HTML declined by 4 percentage points. Is this due to more non-HTML media or more CSS and JS?
2009/10 2012 Change (% points or per URL)
Total Data (compressed) 28.9 Terabyte 20.9 Terabyte
Total URLs 2,804,054,789 1,700,611,442
Parsed HTML URLs 2,565,741,671 1,486,186,868
Domains with Triples 19,113,929 65,408,946
URLs with Triples 251,855,917 302,809,140
Typed Entities 1,546,905,880 1,222,563,749
Triples 5,193,276,058 3,294,248,652
% Total URLs parsed as HTML 91.50% 87.39% -4.11%
% HTML URLs with Triples 9.82% 20.37% 10.56%
Typed Entities/HTML URL 0.60 0.82 0.22
Triples/HTML URL 2.02 2.22 0.19
2009/10 Extractor Domains with Triples URLs with Triples Typed Entities Triples % HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa 537,820 14,314,036 26,583,558 293,542,991 0.56% 5.68% 2.81% 1.72% 5.65%
html-microdata 3,930 56,964 346,411 1,197,115 0.00% 0.02% 0.02% 0.02% 0.02%
html-mf-geo 244,838 5,051,622 7,326,516 28,831,795 0.20% 2.01% 1.28% 0.47% 0.56%
html-mf-hcalendar 226,279 2,747,276 21,289,402 65,727,393 0.11% 1.09% 1.18% 1.38% 1.27%
html-mf-hcard 12,502,500 83,583,167 973,170,050 3,226,066,019 3.26% 33.19% 65.41% 62.91% 62.12%
html-mf-hlisting 31,871 1,227,574 25,660,498 88,146,122 0.05% 0.49% 0.17% 1.66% 1.70%
html-mf-hresume 10,419 387,364 1,501,009 12,640,527 0.02% 0.15% 0.05% 0.10% 0.24%
html-mf-hreview 216,331 2,836,701 8,234,850 84,411,951 0.11% 1.13% 1.13% 0.53% 1.63%
html-mf-species 3,244 25,158 152,621 391,911 0.00% 0.01% 0.02% 0.01% 0.01%
html-mf-hrecipe 13,362 115,345 695,838 1,228,925 0.00% 0.05% 0.07% 0.04% 0.02%
html-mf-xfn 5,323,335 37,526,630 481,945,127 1,391,091,386 1.46% 14.90% 27.85% 31.16% 26.79%
html-mf-*

1,519,975,911 4,898,536,029


98.26% 94.32%
2012 Extractor Domains with Triples URLs with Triples Typed Entities Triples % HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa 16,976,232 67,901,246 49,370,729 456,169,126 4.57% 22.42% 25.95% 4.04% 13.85%
html-microdata 3,952,674 26,929,865 90,526,013 404,413,915 1.81% 8.89% 6.04% 7.40% 12.28%
html-mf-geo 897,080 2,491,933 4,787,126 11,222,766 0.17% 0.82% 1.37% 0.39% 0.34%
html-mf-hcalendar 629,319 1,506,379 27,165,545 65,547,870 0.10% 0.50% 0.96% 2.22% 1.99%
html-mf-hcard 30,417,192 61,360,686 865,633,059 1,837,847,772 4.13% 20.26% 46.50% 70.80% 55.79%
html-mf-hlisting 69,569 197,027 8,252,632 20,703,189 0.01% 0.07% 0.11% 0.68% 0.63%
html-mf-hresume 9,890 20,762 92,346 432,363 0.00% 0.01% 0.02% 0.01% 0.01%
html-mf-hreview 615,681 1,971,870 7,809,088 50,475,411 0.13% 0.65% 0.94% 0.64% 1.53%
html-mf-species 4,109 14,033 139,631 224,847 0.00% 0.00% 0.01% 0.01% 0.01%
html-mf-hrecipe 127,381 422,289 5,516,036 5,513,030 0.03% 0.14% 0.19% 0.45% 0.17%
html-mf-xfn 11,709,819 26,004,925 163,271,544 441,698,363 1.75% 8.59% 17.90% 13.35% 13.41%
html-mf-*

1,082,667,007 2,433,665,611


88.56% 73.88%
2009/10 – 2012 Change (% points)



% HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa



4.01% 16.74% 23.14% 2.32% 8.20%
html-microdata



1.81% 8.87% 6.02% 7.38% 12.25%
html-mf-geo



-0.03% -1.18% 0.09% -0.08% -0.21%
html-mf-hcalendar



-0.01% -0.59% -0.22% 0.85% 0.72%
html-mf-hcard



0.87% -12.92% -18.91% 7.89% -6.33%
html-mf-hlisting



-0.03% -0.42% -0.06% -0.98% -1.07%
html-mf-hresume



-0.01% -0.15% -0.04% -0.09% -0.23%
html-mf-hreview



0.02% -0.48% -0.19% 0.11% -0.09%
html-mf-species



0.00% -0.01% -0.01% 0.00% 0.00%
html-mf-hrecipe



0.02% 0.09% 0.12% 0.41% 0.14%
html-mf-xfn



0.29% -6.31% -9.95% -17.80% -13.38%
html-mf-*






-9.70% -20.45%
2009/10 – 2012 Change (%%)



% HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa



718.95% 294.55% 822.40% 134.99% 144.98%
html-microdata



81515.61% 39220.30% 29290.79% 32965.42% 53156.82%
html-mf-geo



-14.84% -58.97% 7.07% -17.33% -38.64%
html-mf-hcalendar



-5.34% -54.39% -18.73% 61.45% 57.22%
html-mf-hcard



26.74% -38.94% -28.91% 12.55% -10.19%
html-mf-hlisting



-72.29% -86.65% -36.21% -59.31% -62.97%
html-mf-hresume



-90.75% -95.54% -72.26% -92.22% -94.61%
html-mf-hreview



20.01% -42.18% -16.83% 19.99% -5.73%
html-mf-species



-3.70% -53.61% -62.99% 15.76% -9.55%
html-mf-hrecipe



532.05% 204.50% 178.58% 903.02% 607.21%
html-mf-xfn



19.63% -42.36% -35.72% -57.13% -49.94%
html-mf-*






-9.87% -21.68%

Leave a Reply