Google, Kosmix, and The Deep Web – A Love Triangle, by Abhishek Gattani, Kosmix (Nov 13)
Two different ways of penetrating the deep web are described here from a session at the SDForum Search SIG in Palo Alto where Google and Kosmix people spoke.
Defintion of deep web is simplified to "The Deep Web is simply the Web behind HTML forms".
[Of course there is still the "invisible" web of which deep web is a part.]
Google finds the html forms, feeds them questions and indexes the results - it's called "surfacing" the web.
Google’s DeepWeb Crawl was described in a 2008 paper:
"Our work illustrates three principles that can be leveraged in further investigations. First, the test of informativeness for a form input can be used as basic building block for. exploring techniques for indexing the DeepWeb. Second, we believe efforts should be made to crawl well chosen subsets of Deep-Web sites in order to maximize traffic to these sites, reduce the burden on the crawler, and alleviate possible concerns of sites about being completely crawled. Third, while devising domain-specific methods for crawling is unlikely to scale on the Web, developing heuristics for recognizing certain common data types of inputs is a fruitful endeavor. We believe that building on these three principles it is possible to offer even more Deep-Web content to users."
Kosmix does it differently It responds to queries with API calls on pre-determined sources of information.
"If you wanted to look up “Pumpkin Pie” on Kosmix, for example, the system would bring you fresh content from recipe sites like the Food Network, “How To” baking videos, real-time tweets about pumpkin pie from Twitter, and information about the caloric profile of pumpkin pie from diet sites like FatSecret."
Kosmix has been tagging the web - it seems.
Posted by Gwen at November 17, 2009 11:29 AM"Over the past three years, Kosmix has created a taxonomy of several million nodes, which we organized into a graph, using a combination of humans and algorithms. Editors discover, integrate, and tag Web services to taxonomy nodes in a semi-automated fashion. Algorithms route the user’s query through the set of taxonomy nodes, which enable the engine to decide which Web service to call."