Archive for March, 2012

Web data formats Δ 2009-2012

Saturday, March 31st, 2012

January I used WebDataCommons first data publication (triples — actually quads, as URL triple found at is retained — extracted from a small subset of a 2009-2010 Common Crawl corpus) to analyze use of Creative Commons-related metadata. March 23 WDC announced a much, much larger data publication — all triples extracted from the entire Common Crawl 2009/2010 and 2012 corpora.

I’m not going to have time to perform a similar Creative Commons-related analysis on these new datasets (I really hope someone else does; perhaps see my previous post for inspiration) but aggregate numbers about formats found by the two extractions are interesting enough, and not presented in an easy to compare format on the WDC site, for me to write the remainder of this post.


  • 2009/10 data taken from and 2012 data taken from Calculated values italicized. Available as a spreadsheet.
  • The next points, indeed all comparisons, should be treated with great skepticism — it is unknown how comparable the two Common Crawl corpora are.
  • Publication of structured data on the web is growing rapidly.
  • Microdata barely existed in 2009/2010, so it is hardly surprising that it has grown tremendously.
  • Overall microformats adoption seems to have stagnated but still hold the vast majority of extracted data. It is possible however that the large decreases in hlisting and hresume and increase in hrecipe use are due to one or a few large sites or CMSes run by many sites. This (indeed everything) bears deeper investigation. What about deployment of micrformats-2 with prefixed class names that I don’t think would be matched by the WebDataCommons extractor?
  • Perhaps the most generally interesting item below doesn’t bear directly on HTML data — the proportion of URLs in the Common Crawl corpora parsed as HTML declined by 4 percentage points. Is this due to more non-HTML media or more CSS and JS?
2009/10 2012 Change (% points or per URL)
Total Data (compressed) 28.9 Terabyte 20.9 Terabyte
Total URLs 2,804,054,789 1,700,611,442
Parsed HTML URLs 2,565,741,671 1,486,186,868
Domains with Triples 19,113,929 65,408,946
URLs with Triples 251,855,917 302,809,140
Typed Entities 1,546,905,880 1,222,563,749
Triples 5,193,276,058 3,294,248,652
% Total URLs parsed as HTML 91.50% 87.39% -4.11%
% HTML URLs with Triples 9.82% 20.37% 10.56%
Typed Entities/HTML URL 0.60 0.82 0.22
Triples/HTML URL 2.02 2.22 0.19
2009/10 Extractor Domains with Triples URLs with Triples Typed Entities Triples % HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa 537,820 14,314,036 26,583,558 293,542,991 0.56% 5.68% 2.81% 1.72% 5.65%
html-microdata 3,930 56,964 346,411 1,197,115 0.00% 0.02% 0.02% 0.02% 0.02%
html-mf-geo 244,838 5,051,622 7,326,516 28,831,795 0.20% 2.01% 1.28% 0.47% 0.56%
html-mf-hcalendar 226,279 2,747,276 21,289,402 65,727,393 0.11% 1.09% 1.18% 1.38% 1.27%
html-mf-hcard 12,502,500 83,583,167 973,170,050 3,226,066,019 3.26% 33.19% 65.41% 62.91% 62.12%
html-mf-hlisting 31,871 1,227,574 25,660,498 88,146,122 0.05% 0.49% 0.17% 1.66% 1.70%
html-mf-hresume 10,419 387,364 1,501,009 12,640,527 0.02% 0.15% 0.05% 0.10% 0.24%
html-mf-hreview 216,331 2,836,701 8,234,850 84,411,951 0.11% 1.13% 1.13% 0.53% 1.63%
html-mf-species 3,244 25,158 152,621 391,911 0.00% 0.01% 0.02% 0.01% 0.01%
html-mf-hrecipe 13,362 115,345 695,838 1,228,925 0.00% 0.05% 0.07% 0.04% 0.02%
html-mf-xfn 5,323,335 37,526,630 481,945,127 1,391,091,386 1.46% 14.90% 27.85% 31.16% 26.79%

1,519,975,911 4,898,536,029

98.26% 94.32%
2012 Extractor Domains with Triples URLs with Triples Typed Entities Triples % HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples
html-rdfa 16,976,232 67,901,246 49,370,729 456,169,126 4.57% 22.42% 25.95% 4.04% 13.85%
html-microdata 3,952,674 26,929,865 90,526,013 404,413,915 1.81% 8.89% 6.04% 7.40% 12.28%
html-mf-geo 897,080 2,491,933 4,787,126 11,222,766 0.17% 0.82% 1.37% 0.39% 0.34%
html-mf-hcalendar 629,319 1,506,379 27,165,545 65,547,870 0.10% 0.50% 0.96% 2.22% 1.99%
html-mf-hcard 30,417,192 61,360,686 865,633,059 1,837,847,772 4.13% 20.26% 46.50% 70.80% 55.79%
html-mf-hlisting 69,569 197,027 8,252,632 20,703,189 0.01% 0.07% 0.11% 0.68% 0.63%
html-mf-hresume 9,890 20,762 92,346 432,363 0.00% 0.01% 0.02% 0.01% 0.01%
html-mf-hreview 615,681 1,971,870 7,809,088 50,475,411 0.13% 0.65% 0.94% 0.64% 1.53%
html-mf-species 4,109 14,033 139,631 224,847 0.00% 0.00% 0.01% 0.01% 0.01%
html-mf-hrecipe 127,381 422,289 5,516,036 5,513,030 0.03% 0.14% 0.19% 0.45% 0.17%
html-mf-xfn 11,709,819 26,004,925 163,271,544 441,698,363 1.75% 8.59% 17.90% 13.35% 13.41%

1,082,667,007 2,433,665,611

88.56% 73.88%
2009/10 – 2012 Change (% points)

% HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples

4.01% 16.74% 23.14% 2.32% 8.20%

1.81% 8.87% 6.02% 7.38% 12.25%

-0.03% -1.18% 0.09% -0.08% -0.21%

-0.01% -0.59% -0.22% 0.85% 0.72%

0.87% -12.92% -18.91% 7.89% -6.33%

-0.03% -0.42% -0.06% -0.98% -1.07%

-0.01% -0.15% -0.04% -0.09% -0.23%

0.02% -0.48% -0.19% 0.11% -0.09%

0.00% -0.01% -0.01% 0.00% 0.00%

0.02% 0.09% 0.12% 0.41% 0.14%

0.29% -6.31% -9.95% -17.80% -13.38%

-9.70% -20.45%
2009/10 – 2012 Change (%%)

% HTML URLs % HTML URLs with Triples % Domains with Triples % Typed Entities % Triples

718.95% 294.55% 822.40% 134.99% 144.98%

81515.61% 39220.30% 29290.79% 32965.42% 53156.82%

-14.84% -58.97% 7.07% -17.33% -38.64%

-5.34% -54.39% -18.73% 61.45% 57.22%

26.74% -38.94% -28.91% 12.55% -10.19%

-72.29% -86.65% -36.21% -59.31% -62.97%

-90.75% -95.54% -72.26% -92.22% -94.61%

20.01% -42.18% -16.83% 19.99% -5.73%

-3.70% -53.61% -62.99% 15.76% -9.55%

532.05% 204.50% 178.58% 903.02% 607.21%

19.63% -42.36% -35.72% -57.13% -49.94%

-9.87% -21.68%

Bad Ideas of 2004 March

Thursday, March 15th, 2012

Morte di Cesare IDEA

I’ve already covered the main idea of Creative Commons Search, useful to me but here I’ll just restate that license filtered crawl-based search has not turned out to be useful to me, except as a demo. On the other hand, license filtered media/repository-specific search has proven useful (that’s what most of the search options currently on do), and it’s plausible that crawl- or at least some form of aggregation-based search that takes into account finer grained metadata could effectively perform the same useful service. I also must question the utility of search for finding works for using and sharing: unless one is looking for something representative (e.g., a picture of a rose, a foreboding audio track) the finding is more likely to occur through curation, recommendation, marketing, advertising. “Search” has taken up too much oxygen. Finally, why bother limiting one’s discovery, use, and sharing based on internal passport parameters? The copyright industries and advocates don’t; why should you or I and our communities?

クリエイティブ・コモンズ (Creative Commons) mentioned the availability of “Creative Commons license ports to Japanese law” which is an accurate description, but what does it mean? I won’t answer that directly here, but an effect has been massive license proliferation (560 distinct licenses!) and mistakes — including in the 2.0 Japan licenses mentioned — version 2.1 Japan licenses exist, and version 2.1 was only used for a few “ports” in order to correct errors.

DirectConnect increment[al download verification]: My lack of interest in Direct Connect (which seems still active) was indicative of a fetish for fully decentralized schemes (DC has distinct hubs and clients). Client-server (the web) has won, without even P2P downloading, which DC had, and THEX constituted an improvement to. Hash trees have found use in ever more applications, it just turns out that improved downloading of crap isn’t one of the significant ones.

Client-side remixing isn’t so loopy also evinces a wildly impractical fetish for a kind of decentralization. Copies are cheap and work, coordination of references is expensive and broken.

Hello Austin, Night of Bowed Strings and Cambodian Surf, Texas Alien Abductions Up After Chunnel Completion, CC-Austin, and Walking Austin all concern a visit to Austin, Texas for SXSW. All make it apparent I did exactly what I enjoy, rather than what SXSW might be good for: networking. The “Music Sharing License” introduced there was a confusing name that fortunately has been forgotten. “Remix Ready” and “4th Wall Films” ideas to make “source” for cultural works available are ones that I liked and a concept I continue to advocate. However, clearly there is very little demand for the concept, perhaps for three reasons 1) it often isn’t clear what constitutes source 2) providing source, or even retaining it as one works (often destructively — the silly image at the top of this post is an example — I forgot to save .xcf files, but I did save two .png files for some reason, which is better than I’d usually do) is often expensive and 3) final published works have some usefulness as source. Finally regarding a panel on CC and music: clearly public licenses are neither necessary for distributing music online nor sufficient for engendering creative use and peer production of cultural relevance.

WikiTravel vs. World66: WikiTravel wins more concerns copying text between those two sites, which used the same license, in theory making such copying legal and easy. But such copying occurs constantly with no license and no knowledge of such. And I didn’t even comply with the license conditions, if it were necessary (no attribution or license notice added).

More bad ideas from February 2004.

Staten Joseland

Tuesday, March 6th, 2012

As a followup to a post comparing the population densities of Manhattan and Brooklyn to those of San Francisco and Oakland (not even close): if San Jose (945,942, 2,000/km2) had the density of Staten Island (468,730, 3,151.8/km2), San Jose would have 1,490,710 residents.

Another bit of San Jose trivia, which I’ve meant to blog ever since I briefly lived there (2005): it is the largest suburb in the U.S. As of 2000 (I haven’t seen newer), it was the only U.S. city with a population above 500,000 with an estimated daytime population significantly lower than its resident population (5.6% daytime population loss).

For ease of reference, the daytime population gains of New York City (obviously if broken out Manhattan’s would be far higher, and Staten Island’s far lower), San Francisco, and Oakland were 7%, 21.7%, and 2.7% respectively.

altmetrics moneyball

Sunday, March 4th, 2012

I read in 2004 and recall two things: a baseball team used statistics about individual contribution to team winning to build a winning team on the cheap (other teams at the time were selecting players based on gut feel and statistics not well correlated to team success; players that looked good in person to baseball insiders or looked good on paper to a naive analysis were overpaid) and some trivia about fan gathered versus team or league controlled game data.

The key thing for this post is that it was a team that was able to exploit better metrics, and profit (or at least win).

Just as many baseball enthusiasts were dissatisfied with simple player metrics like home runs and steals, and searched for metrics better correlated with team success (), many academia enthusiasts are dissatisfied with simple academic metrics like (actually all based upon) number of citations to journal articles, and are looking for metrics better correlated with an academic’s contribution to knowledge (altmetrics).

Among other things, altmetrics could lead researchers to spend time doing things that might be more useful than publishing journal articles, bring people into research who are good at doing these other things (creation of useful datasets is often given as an example) but not writing journal articles, and help break the lockin and extraordinary profits enjoyed by a few legacy publishers. Without altmetrics, these things are happening only very slowly as career advancement in academia is currently highly dependent on journal article citation metrics.

As far as I can tell, altmetrics are in their infancy at best: nobody knows how to measure contribution to knowledge, let alone innovation, and baseball enthusiasts faced a much, much more constrained problem: contribution to winning baseball games. But, given that so little is known, and current metrics so obviously inadequate and adhered to, some academics who do well on journal article citation metrics are vastly over-recruited and overpaid, while many academics and would-be-academics who don’t, aren’t. This ought mean there could be big wins from relatively crude improvements.

Who should gamble on potential crude improvements over journal article citation metrics? Entities that hire academics, in particular universities, perhaps even more particularly ones that are in the “big leagues” (considered a research university?) but nowhere near the top, and without cash to recruit superstars per gut feel or journal article citation metrics. I vaguely understand that universities make huge, conscious, expensive efforts to create superstar departments. Nearly all universities aren’t Columbia hoping to spend their way to recruiting superstars from Harvard and Princeton. Instead, why not make a long-term gamble on some plausible altmetric? At best, such a university or department will greatly outperform over the next decade and get credit beyond that for pioneering new practices that everyone copies. At worst, such a university or department will remain mediocre and perhaps slip a bit over the next decade, and get a bit of skeptical press about the effort (if made public). The benefits to society from such experimentation could be large.

Are there universities or departments pursuing such a strategy? I am in no position to know. I did search a bit and came up with What do the social sciences have in common with baseball? The story of Moneyball, academic impact, and quantitative methods. I’m pretty sure the author is writing about hiring social scientists who specialize in quantitative methods, not hiring social scientists based on speculative quantitative methods. What about universities outside wealthy jurisdictions?

Speaking of baseball players and academics, just yesterday I realized that academics have the equivalent of player statistics pages when I discovered my brother’s (my only close relative in academia, as far as I know) via his new site. I’ll have to ask him how he feels about giving such a public performance performance. My first reaction is that it is good for society. Such would be good for more professions — for most we have not conventional metrics like home runs or citations, needing improvement, but zero metrics, only gut feel or worse. Lots of fields, employment and otherwise, are ripe for disruption.

Addendum: Another view is that metrics lead to bad outcomes and that rather than using more sophisticated metrics, academia should become more like most employment and shun metrics altogether, hiring purely based on gut feel, and that other fields should continue as before, and fight any encroachment of metrics. Of course these theories may also be experimented with on a team-by-university-by-organization basis.

How many people can Sanhattan hold?

Saturday, March 3rd, 2012

There’s a medium length but not very informative article today titled Everybody Inhale: How Many People Can Manhattan Hold? Very speculatively, if Manhattan remains one of the premier cities in the world into the post-human future, perhaps trillions.

But I mostly use this as an excuse to harp on an old point, closer to home. How many people can San Francisco hold? Oakland? Currently these places are horribly underpopulated, semi-rural outposts, with populations of 805,235 (6,632.9/km2) and 390,724 (2,704.2/km2) respectively. At current Manhattan (27,394/km2) and Brooklyn (14,037/km2) densities respectively, San Francisco’s population would be 3,325,635 and Oakland’s 2,028,175.

That’s right, Brooklyn is twice as dense as San Francisco: this isn’t about skyscrapers.

Considering the immense benefits of density for both creativity and energy efficiency, it is a horrible shame that there does not exist a reasonably dense city in the U.S. outside of New York. Autonomous vehicles will be the next chance to significantly reconfigure cities, not least by vastly reducing the amount of space needed for cars. There are a couple obvious ways to get started in that direction now. Whether a city makes good on this opportunity for reconfiguration will globally be the most significant determinant of success or failure in the coming decades. Pity it is getting zero attention relative to circuses.

no copyright law in the universe is going to stop me [from demanding compliance with various UN human rights and cultural diversity declarations]

Saturday, March 3rd, 2012

Currently the first autocompletion result upon typing “no copyright” into YouTube’s search is “no copyright law in the universe is going to stop me”, which is apparently a string used in the description of 108 videos on YouTube, and the title of at least one. It seems this phrase is primarily an anti-SOPA expression rather than an admonition to not take down whatever video is described.

Andy Baio pointed out late last year that disclaimers of intent to infringe others’ copyrights and claims of fair use frequently appear in the descriptions of videos on YouTube. He noted 489,000 and 664,000 results for the queries "no copyright" and "copyright" "section 107". Those numbers may have grown significantly in the last nearly 3 months, but should be taken with a huge grain of salt. Yesterday for me, “no copyright” obtained 906,000, while today YouTube has said both 972,000 and 3,850,000 to the same query. For “copyright” “section 107″, yesterday 771,000, today 418,000. I don’t know how many videos were on YouTube 3 months ago, but yesterday an empty query claimed 567,000,000; today I’ve seen 537,000,000 and 550,000,000 — maybe on the order of 1% of videos have some sort of copyright disclaimer. But there are variations that might not be picked up by the queries Baio used, including for example two of the descriptions I posted a few days ago.

Although they’re probably completely useless in preventing automated takedowns and in court (though it’s not entirely clear they ought be useless in either case), as expression they’re at the very least interesting, and perhaps more. Though they can be seen as “voodoo charms”, so can the ubiquitous “all rights reserved”, and even meaningful public copyright licenses can be seen as such to the extent they are misunderstood or totemic. My main objection to the disclaimers Baio brought attention to is that they’re clutter to the extent they crowd out writing or reading other information about works; and just about anything else is more useful, from provenance to expressions of appreciation, eg “In my opinion, one of the greatest songs of the ’80s.”

But my first reaction to such disclaimers is the wish that they would be more expressive, even substantial. Regarding the latter, in many cases the uploader has added something to or rearranged the work in question — e.g., where the work is a song, the addition of images, or the performance of a cover. How often does the uploader grant permissions to use whatever expression they’ve added? (I don’t know; one aggregate tool for exploring such might be the addition of &creativecommons=1 to the aforementioned queries, which will limit results to those marked as CC-BY.) One fairly well known case of something like this is Girl Talk’s All Day:

All Day by Girl Talk is licensed under a Creative Commons Attribution-Noncommercial license. The CC license does not interfere with the rights you have under the fair use doctrine, which gives you permission to make certain uses of the work even for commercial purposes. Also, the CC license does not grant rights to non-transformative use of the source material Girl Talk used to make the album.

Too bad with the NonCommercial condition, and I really don’t like Girl Talk’s music (for something kind of similar that I prefer aesthetically and in terms of permissions granted, check out xmarx), but otherwise that’s a great statement.

Over the past few months someone or some people have made me aware of another example, one that replaces disclaimers with demands. You can see some of this on my English Wikipedia user talk page (start at “Common IP” — unfortunately doesn’t pass through internal links, so you’ll have to scroll down). It may appear that my correspondent is religious and communicating poorly through automated translation between Russian and English, but there’s a kernel of something interesting there. If I understand correctly, they think that without listening to the Beatles, one cannot develop morally (that comes from elsewhere, not on my talk page) and that per a variety of United Nations declarations concerning human rights and especially cultural diversity, anyone has the legal right and moral duty to share Beatles material. As far as I know they started this campaign at and moved on to other sites, including Wikipedia. It is pretty clear that they’re not looking for links to or some other site they control — I think they’re sincerely promoting something they believe in, not a money-making scam.

The flaws in their campaign are legion, not least of which is that there could hardly be a worse body of work than that of the Beatles around which to plead for rights to share in the name of cultural diversity. As the Beatles are one of if not the most popular acts ever, the most obvious conclusion is that more Beatles exposure must lower global cultural diversity. On the related issue of cultural preservation, super-famous material like that of the Beatles is going to survive for a long time in spite of copyright restrictions, even vigorously enforced (see James Joyce).

As to their persistent requests for some kind of permission from me to proceed with their campaign, I say two things:

  1. As far as the copyright regime is concerned, the permissions I have to grant to you are nil.
  2. As far as demands made in the name of human rights, no human requires permission from any other to pursue those. Godspeed to you, or perhaps I should say, Beatlespeed!

I want to thank my correspondent for causing me to look at the and subsequent documents. The way they address “intellectual property”, to the extent that they do, is more curious than I would’ve thought. I leave that to a future post.

p.s. My favorite Beatles.