Yesterday on the Creative Commons weblog:
Today we announce a search engine prototype exclusively for finding Creative Commons licensed and public domain works on the web.
Indexing only pages with valid Creative Commons metadata allows the search engine to display a visual indicator of the conditions under which works may be used as well as offer the option to limit results to works available licenses allowing for derivatives or commercial use.
This prototype partially addresses one of our tech challenges. It still needs lots of work. If you’re an interested developer you can obtain the code and submit bugs via the cctools project at SourceForge. The code is GNU GPL licensed and builds in part upon Nathan Yergler’s ccRdf library.
We also have an outstanding challenge to commercial search engines to build support for Creative Commons-enhanced searches.
And it hasn’t melted down yet.
Ben Adida wrote most of the code that needed to be written in Python (not much — PostgreSQL with tsearch2 full-text indexing does all of the heavy lifting). Former government employee Justin Palmer wrote an earlier prototype in AOLserver/Tcl, also using PG/tsearch2. (Turns out we needed the flexibility of running under Apache. I’ll miss AOLserver/Tcl when I last touch it, but I’ll also be glad to be rid of it.) I did a PHP hack job on the front end, and Matt Haughey made it look good (for end users, not code readers) in a matter of minutes.
Although everything possible sucks about this implementation, it is already a valuable tool for finding CC-licensed and public domain content — stuff you can reuse with permission already granted. Neeru Paharia was the visionary here, seeing that it would be valuable even if it sucked in every way technically.
Stephen Downes is exactly right about the long term goal:
Of course, this is only a step – such a search engine would not be useful for many purposed; copyright information needs, in the long run, to define a search field or a type of search, not a whole search engine.
With great justification major search enginges have ignored pure metadata for a long time, at least five years. Pure metadata, with no visibility, is nearly universally ill-maintained or fraudulent. I hope that this Creative Commons prototype inspires some people at major search engines to think again about metadata, but I think semantic HTML is what will finally prove useful to such folks, in no small part because it isn’t pure metadata. I’ll post on incremental semantic search engine features in the near future.
[…] already covered the main idea of Creative Commons Search, useful to me but here I’ll just restate that license filtered crawl-based search has not turned out to be […]