Google launches tool for searching public data, by Tom Krazit, Relevant Results (Mar 8)
Could Google do for public-data search what it did for scholarly?
Public Data Explorer is new tool from Google Labs - "The site takes public data regarding schools, population, crime, and even names to construct charts and graphs that help illustrate trends."
There are several examples on show. International data comes from WorldBank, OECD, Eurostats. It also gets data from U.S. Bureau of Labor Statistics, and U.S. Census Bureau, U.S. Center for Disease Control, and the U.S. Bureau of Economic Analysis.
Google post - Statistics for a changing world: Google Public Data Explorer in Labs
Connotate lets users create applications to search deeply and analyze the findings. It's all in creating the agents to go to the right places, and mash up the information.
The Connotate website describes how this can be use for business intelligence, competitive intelligence, and industry applications. It will work with any source, internal or external. The data page notes that:
Barbara Brynko described the product in Information Today - Connotate: Letting Agents do the Work - (Nov 2009) available through AllBusiness - requires free registration.
There is a two-week free trial for monitoring and extracting data from any web source.
Researchers unleash crawlers into Deep Web data by Jennifer Foreshew, The Australian (Jan 19)
Professor Halevy, head of Google's structured data management research group in the US, discussed the difficulties in indexing structured data on the web - aka deep web. This summary of his keynote speech in Australia gives us some clues on what Google is doing.
"Google has two research projects on these problems.
The first, WebTables, compiles a huge collection of databases by crawling the web and finding small relational databases that use the HTML table tag.
"By performing data mining on the resulting extracted information, we can also introduce a number of brand-new data-centric applications," the paper says.The second project attempts to extract information from the Deep Web, which refers to data on the web that is only available by filling web forms, and therefore invisible to traditional search crawlers."
New Features in Wolfram|Alpha: Year-End Update, Wolfram Alpha (Dec 21)
Describes features and functionality added to Wolfram Alpha during the past half year in the following three areas. Some examples noted below.
+ mathematics, statistics and computation - advanced support for many mathematical functions
+ science, biology, health - physical exercise calculator, and carbon footprint data
+ socioeconomics and culture - FBI statistics on crime, international healthcare statistics, salary data in the US.
There's more to come in 2010.
"In 2010, we will continue to improve and update all of these domains, and to tackle entirely new areas of knowledge. Efforts are already underway to add data on an incredibly diverse array of subjects: automobiles, energy consumption and prices, fictional characters, wars and battles, and Academy Awards, to name just a small fraction."
Internous asks librarians of the tangled web to unite. It has a proposal for adoption of a Internet Search Environment Number that would be applied to all databases on the web and make searching the deep web possible.
"You know how the ISBN is assigned to books. Over 1 million books are assigned ISBNs each year. What ISEN plans to do is emulate that system for databases. We would assign over 1 million databases ISEN or Internet Search Environment Numbers once the system is in place in its first year. There may be as many as 5 million in the backlog for cataloging by a social nework of librarians. Life Science databases would be cataloged by life science librarians, law resources by law librarians, etc..."
Very ambitious - view the video at least to see the dream.
Go Beyond Search with Deep Web Engine BrightPlanet, Altsearchengines (Dec 7)
BrightPlanet is in the news again. Over 10 years ago it was doing deep indexing of databases and released papers on the Deep Web. The deep harvesting technology was showcased at its public site, CompletePlanet, but unfortunately the search interface was difficult to use, and it was unclear whether the indexing was fresh. I doubt that many used it.
But we might try again.
"BrightPlanet’s patented software harvests Deep Data from: (1) documents from the conventional (or surface) Web, (2) the much larger, more authoritative Deep Web, (3) proprietary data sources (such as LexisNexis and Dow Jones/Factiva), and (4) customers’ own internal data sources."
Today it serves the U.S. Intelligence Community with harvesting deep web content, and proposes to help businesses do the same.
BrightPlanet is a good starting point for learning about Deep Web. For example, the description of deep web in this video on The Virtual Private Library and Deep Web
I'm not convinced that CompletePlanet will be attractive to searchers. Firstly - the aim is to find the database rather than the particular page (as we do with Google), and secondly, it doesn't seem fresh. Harvested dates are March 05 - is that 2005, or March 5 in 2009? A query for social media (what could be hotter?) as a phrase finds two weak databases.
Google, Kosmix, and The Deep Web – A Love Triangle, by Abhishek Gattani, Kosmix (Nov 13)
Two different ways of penetrating the deep web are described here from a session at the SDForum Search SIG in Palo Alto where Google and Kosmix people spoke.
Defintion of deep web is simplified to "The Deep Web is simply the Web behind HTML forms".
[Of course there is still the "invisible" web of which deep web is a part.]
Google finds the html forms, feeds them questions and indexes the results - it's called "surfacing" the web.
Google’s DeepWeb Crawl was described in a 2008 paper:
"Our work illustrates three principles that can be leveraged in further investigations. First, the test of informativeness for a form input can be used as basic building block for. exploring techniques for indexing the DeepWeb. Second, we believe efforts should be made to crawl well chosen subsets of Deep-Web sites in order to maximize traffic to these sites, reduce the burden on the crawler, and alleviate possible concerns of sites about being completely crawled. Third, while devising domain-specific methods for crawling is unlikely to scale on the Web, developing heuristics for recognizing certain common data types of inputs is a fruitful endeavor. We believe that building on these three principles it is possible to offer even more Deep-Web content to users."
Kosmix does it differently It responds to queries with API calls on pre-determined sources of information.
"If you wanted to look up “Pumpkin Pie” on Kosmix, for example, the system would bring you fresh content from recipe sites like the Food Network, “How To” baking videos, real-time tweets about pumpkin pie from Twitter, and information about the caloric profile of pumpkin pie from diet sites like FatSecret."
Kosmix has been tagging the web - it seems.
"Over the past three years, Kosmix has created a taxonomy of several million nodes, which we organized into a graph, using a combination of humans and algorithms. Editors discover, integrate, and tag Web services to taxonomy nodes in a semi-automated fashion. Algorithms route the user’s query through the set of taxonomy nodes, which enable the engine to decide which Web service to call."
OCLC Ingests OAIster: Pearls to Follow, by Barbara Quint , Newsbreaks (Nov 12)
OAIster has been a project by the University of Michigan to provide links to the metadata of hard-to-find electronic scholarly resources such as books, articles, technical reports, preprints, white papers, as well as some multimedia for images of paintings, movies and audio files of speeches.
"It now has more than 23 million records from more than 1,100 organizations worldwide, including digitized books and journal articles, digital text, audio and video files, photographic images, data sets, and theses and research papers."
This database has been absorbed into OCLC and is searchable through Worldcat.org along with library holdings.
"The WorldCat.org service now includes all the OAIster records. Users of the old OAIster.org site will now be automatically shifted over to an OCLC-based site (www.oclc.org/oaister). This is just the beginning, however. OCLC has also merged the content of two other open access files-ArchiveGrid and CAMIO-into WorldCat.org. In January, OCLC will launch a separate OAIster file, allowing users to reach just this repository content guide. As with WorldCat.org, the new OAIster-only file will be accessible for free. The experience gained from handling OAIster has led to improvements in the flexibility of WorldCat.org's infrastructure itself. More improvements are in the offing for OAIster from applying other OCLC features."
Newsbreaks article mentions that search engines can crawl and index this material but that the data is behind a "CGI script and that made it [difficult] for harvesting. We didn't have an API interface. It was cumbersome."
This is a fine example of Deep Web - it's on the web but not easily accessible by search engines.
Other content in OCLC now includes
+ "ArchiveGrid helps locate historical documents, personal papers, and family histories held in archives across the world. "
+ "CAMIO (Catalog of Art Museum Images Online) identifies high-quality art images contributed and described by leading museums worldwide with all rights cleared for educational use."
+ CONTENTdm metadata - "This also allows users to download records to their local systems"
The Government Domain: A Handful of Classics by Peggy Garvin, LLRX (Oct 31)
Peggy Garvin is the author of e-Government and Web Directory: U.S. Federal Government Online, an annually updated directory to more than 2,000 Web site records, organized into 20 subject-themed chapters .
In this article for LLRX she describes seven important US government web resources that may not be universally known. In addition there is USA.gov as a specialty search tool for US federal, state, and local government sites.
She makes the point that these are resources in the "deep web". Google and others don't index them at all or not well. The searcher must know that they exist.
Deep Web Tech Relaunches ScienceResearch.com by Paula J. Hane, Newsbreaks (June 15)
Deep Web Technologies revamped its ScienceResearch.com for Chemistry, Earth and Environmental Sciences, Health and Medicine, and Physics.
"The ScienceResearch.com portal aims "to unify the World Wide Web's dispersed science to become the world's most comprehensive portal for science." Additionally, the portal seeks to make "long tail science," the very specialized science that may appear to be of limited interest, available to a larger audience through which applications may be found. Hopefully, the portal is designed to serve as a catalyst for scientific discoveries and innovative solutions. "Our goal is to make more science research available to more individuals than any other portal," says Abe Lederman, founder, president, and CTO of Deep Web Technologies."
Wolfram Alpha Research Secrets, Zack Stern, PC World (June 10)
A primer on using Wolfram Alpha - start small and grow the search. That's good advice at any time.
"You have to think differently to begin tapping into Wolfram Alpha’s abilities. Here are some tips on how to get started with this new kind of comparison engine."
DeepDyve announced in its newsletter two new features for its "deep web" search.
"We're pleased to introduce RSS Alerts and Email Alerts which will allow you to have your searches running ‘in the background' with new results brought to you by your RSS reader of choice, or via an email in your inbox. We're also pleased to announce many new publishers who have added their content to our growing index of deep web information: "
New Publishers:
American Association of Cancer Research American Association for Clinical Chemistry: American Psychiatric Publishing, Inc.: American Society for Nutrition American Society for Pharmacology and Experimental Therapeutics (Aspet) Information Sources, Inc. (TecTrends) International Union of Crystallography Journal of Bone & Joint Surgery National Academy of Sciences (PNAS) The Scientist:
DeepDyve does not put these announcements into its blogs or its page of news releases., or at least not immediately. And it has not listed all of these publishers or publications on its Experts page.
Resource of the Week — Fast Facts Anyone? A Brief Users Guide to Wolfram|Alpha, Gary Price, ResourceShelf (May 17)
Report and examples of search at Wolfram|Alpha - done by a search master.
5 problems Wolfram Alpha can solve for you, Pandia Search News (May 27)
Pandia has figured out five types of questions that Wolfram Alpha can handle well. WA is a "computational engine" that finds facts and figures, but not all of them.
Wolfram Alpha has many examples too - best to review them rather than taking a stab at something you think should be there.
Google I/O: New Advances In The Searchability of JavaScript and Flash, But Is It Enough?, by Vanessa Fox, Search Engine Land (May 29)
How searchable are the new "cloud" applications and web sites developed with HTML 5 and AJAX? Google can do some indexing of Flash, but it still has problems with it and Javascript. There are a host of issues described in this article on dealing with navigation at websites coded in javascript.
One thing - the search engines are now very good at interpreting dynamic URLs.
Wolfram Alpha searching for its niche by Tom Krazit, Webware (May 22)
Computational search engine Wolfram Alpha is disappointing people. In a CNET survey, respondents gave it a low rating (3.55 on a satisfaction scale where 5 was least).
"For the most part, readers were dissatisfied with Wolfram Alpha's ability to produce results for anything outside of a relatively narrow set of queries related to math, science, or statistics. Forty percent said they would not recommend Wolfram Alpha to friends, while 28 percent thought it was only appropriate for "serious data nerds." (Percentages based on 1,459 responses.)"
But, as the article points out, WA is not search as we know it with word matches and ranking. It works with data to compute things. But this is only week 1. It needs time to build the databases, and to teach people on how to use it.
Wolfram Alpha—Semantic Search Is Born, by Woody Evans, Newsbreaks (May 21)
Article describes some of the workings of Wolfram Alpha, which is described as "being a really smart way to access most of the best reference shelves on the planet."
"Wolfram Alpha relies on the data sets it has acquired (much of which are freely available from governments and public domain sources) and computers in data centers maintained by Wolfram Research. With all that power, the engine can interpret keywords such as weather and oakland into meaningful categories by way of "input interpretation." The term "weather" remains weather, but the term "oakland" is interpreted to mean "Oakland, California" by default, according to rules set by the program. Heavily symbolic and mathematical queries are interpreted and computed even more handily: "water 550C 3 atm," another sample search Wolfram demonstrated, becomes the substance water at 550° centigrade and at "3 atmospheres" of pressure. Alpha can then tell the user useful facts about the nature of water under such conditions, including density, molecular weight, and boiling point. And all this is presented in easy-to-read, aesthetically pleasing boxes with a dignified non-san-serif font and lots of white space."
Reminder - Alpha doesn't have all data. It won't answer every question. I found it impossible to get some detailed export trade data for Canada. But it's still worth trying.
WolframAlpha a whiz at numbers, stats and stocks, Chris Keall, Business Review (May 16)
Search examples and screenshots of what WolframAlpha can do. This is completely different from any search tool devised so far. It doesn't do matches and links, it does data.
"Its number and data-driven approach covers all manner of areas, from health, books, movies and sports, among many others - and for some narrowly defined seearches in these areas it will become people's first stop. Typing a movie name, for example, gives you its box office and a cast list, but not links to reviews, which is what many would be after."
One Man’s Answer By Barrett Sheridan, NEWSWEEK (May 16)
Wolfram Alpha -- "Wolfram says his creation is not so much a search engine as a "computational knowledge engine." It has a single input field, like a search engine, but users can pose complex questions. What is the date of the next total solar eclipse visible from Paris? (Answer: Sept. 23, 2090.)"
Background on the physicist Stephen Wolfram and his reasons for developing the Alpha - in his word, "a new paradigm for using computers and the Web."
Wolfram Alpha Live Review: The Un-Google, by Chris Sherman, Search Engine Land (May 15)
One of many reviews to come on the soon-to-be -aunched Wolfram Alpha.
I’ve been playing around with Wolfram Alpha for a couple of days. In all, I agree with Danny’s “impressive” verdict. And it functions in a way that’s very different than any other search engine I’ve ever used. Even beyond impressive, the words that best describe the experience of using Wolfram Alpha are “fun” and “enchanting.”Wolfram Alpha’s interface allows you to enter queries in natural language. It then tries to disambiguate your query and present relevant facts, charts, illustrations and other supporting “tools.” If it can’t understand your query, you’ll see a “Wolfram Alpha isn’t sure what to do with your input” message, Sometimes it doesn’t have enough information to work with; in these cases you see a “development of this topic is under investigation…” message.
Has several examples with screenshots. We need time to study this.
Google is taking on data search - not just text search, but numbers and patterns. This isn't available yet except for the demo on US population, but it could be a winner.
From: Google turns searches into spreadsheets , PCPro (May 13)
"Google will automatically generate detailed spreadsheets of data from search terms as part of a revamp of its market-leading service.Dubbed Google Squared, the forthcoming feature will deliver tables full of factual data on the topics people search for. "
Also described in Will Google Squared make GOOG a better research tool?, ZDnet
Wolfram Alpha shows data in a way Google can't, by Stephen Shankland and Rafe Needleman, Webware (May 5)
Has any new thing on the Web been this much anticipated? Shankland and Needleman respond to some key questions about Wolfram Alpha in a debate about its potential and use.
See Wolfram Alpha in Action: Our Screenshots, ReadWriteWeb (Apr 30)
Peek at the kinds of data the new Wolfram Alpha promises to produce as seen through screenshots. There is also a link to a demo video at Berkman Center.
New online search tool developed for legal research, Altsearchengines (Apr 290
MetaJuris - federated engine for 6 US law databases.
"The Information and Telecommunication Technology Center and the School of Law at the University of Kansas have developed a powerful online search tool for legal researchers. MetaJuris, a metasearch engine, simultaneously searches various legal databases for cases, statutes and literature citations. The free service, metajuris.ittc.ku.edu, is open to the public."
Google crashes Wolfram Alpha debut party by Stephen Shankland, Webware (Apr 28)
Several are trying very hard to dig into the web.
+ Stephen Wolfram is developing Wolfram Alpha engine - "designed to process data from controlled, vetted sources of data--many not on the Web--then present the results in a way that lets people dig deeper into the subject. It's something of a cross between a graphing calculator, repositories of scientific data, and a system to interpret questions posed in human terms. " Watch for it later in the year.
+ Google has just released a feature to search public data and graph it. This is US only. It's described more fully at the Official Google Blog - Adding search power to public data.
"Thus far, Google's service includes data only from U.S. Bureau of Labor Statistics and the U.S. Census Bureau's Population Division. "
DeepDyve has announced improvements to its deep web search engine.
From the announcement:
Today, we released a new version of our search engine containing several new features including:* Spell check: if your query looks a bit off, DeepDyve will suggest alternative words and spellings
* Bolded keywords: the matched keywords or phrases in your query are now bolded in your search results for easier identification
* Search auto-fill: as you type in your query, DeepDyve will auto-display other searches from your history that resemble your query
* Advanced filter: we've updated our Advanced filter experience to make it much more intuitive and easier to use.
* New Detail view: did you know that you can 'preview' your search result? By clicking Detail, you can see a couple sentences or more of your search result to make sure it's what you intended. In this release, we've refined the look and feel of this page and increased the amount of content we can now display
* Related searches: in your search results page, you will see related, popular searches
* Clickable publisher, journal and author name: you can now click on any of these fields and immediately generate a search on just those documents
Federated Search Blog (by Sol Lederman) has a series of interviews with "federated search luminaries": Erik Selberg, Michael Berman, Todd Miller. Kate Noerr of MuseGlobal, a fourth, is on her own page.
From the Michael Berman interview
"Search engines work best in the discovery phase, when searching is a fast, give-and-take, contact sport. Real-time performance is important and interaction and testing are the user mode. I frankly feel deep Web search is not terribly useful or helpful in this phase. Identifying candidate searchable databases can be very important in this phase, but that can be accomplished from a search engine for databases such as CompletePlanet or the DQM rather than going to the site directly (reserving deep Web search for the purposeful harvest mode.)
Once the researcher has got a good bead on their capture requirements, harvesting and the deep Web come to the fore. But, this can be scheduled, and need not meet a real-time criterion. "
Some presentations from the Computers in Library 2009 Conference are available.
Under Information Discovery & Search
+ A Super Searcher Shares 25 Search Tips/Thoughts by Mary Ellen Bates
+ Searching Google Earth by Ran Hock
+ Searching Conversations: Twitter, Facebook, & the Social Web - by Greg Notess [Not available yet[
+ Information Discovery: Science & Health by Walter Warmick - entering the deep web by using federated search of scientific databases. [Available at site]
+ Seeking Health by Tamas Doszkocs - lists several health search engines and identifies some with semantic search capabilities (medstory, healthline, goopubmed) [Available at site]
Conference also had a track on Search and Search Engines - federated search, mobile search, RSS, emerging search technologies.
Google Starts Ranking Twitter Search Results Pages by Patrick Altoft, Blogstorm (Mar 23)
Patrick Altoft has evidence that Google will pick up search pages results from Twitter probably on news related items.
The actual pages are likely being discovered in two different ways. The first is via the usual link discovery method where Google spots lots of links to a page and ranks it based on link data.The second method Google might be using is to generate the Twitter results pages themselves. If there is a particular keyword that Google wants more results for they can just plug that keyword into the Twitter search page and generate a brand new page to suit.
People Search Engines: They Know Your Dark Secrets…And Tell Anyone by JR Raphael, PC World (Mar 10)
This is a chilling article on how much the social search engines can find out about you - including Amazon wish list, music preferences, political contributions (in the US), photos of family (if you make them public) and much more.
The editor enriched the article with notes on actual discoveries of very personal information.
There is some "deep web" in this - the engines specialize in digging into the sources and some in using "linguistic analysis" to improve the results.
Spokeo goes a step further and will monitor activity of your contacts through blogs, video sharing, image sharing, playlists, wish lists, social networking services.
Engines mentioned:
+Spokeo - searches 41 social networks
+Pipl - claims use of "advanced language-analysis and ranking algorithms"
+CVGadget - kind of meta-searcher
+Also - Rapleaf, a for-fee service for gathering and consolidating information on a person - see demo of its use for contacts through SalesForce.om
Main message: "Whether they target businesses or individuals, the services have one thing in common: Unlike the public record-driven search tools of the past, the new people-tracking utilities build a highly detailed dossier about you solely from information that you yourself published--a circumstance that may give you a distinct feeling of discomfort."
Companion article: People Search Engines: Slam the Door on What Info They Can Collect, JR Raphael, PC World (Mar 10)
"Take these steps to stop the new generation social search engines from telling the world everything about you."
Federated Search: A Year of Blogging by Sol Lederman, FUMSI (
Sol Lederman talks about his Federated Search Blog, a must read for anyone following deep web. Sol used to work for Deep Web Technologies who also sponsors this blog. In this article he mentions the postings that have been most popular.
DeepDyve announced by email that it has "A new look and feel' and that it is now free for anyone to use. DeepDyve searches specialty databases in a "deep web" way mainly in the fields of science, engineering and some business.
From the email: "Specifically: We've simplified the user interface to make it easier, faster and more intuitive. You can quickly refine or add filters to your query with an easy to use drop-down menu directly from the search bar
By clicking on the "Details" button from any search result, you can now read an Abstract of every document as well as see the best matching portion of text from the document
You can now Share your results to email, Digg, MySpace, Facebook, Twitter and other channels.
And, we've removed the registration and login requirement and are now in "open" Beta""
This press release about “The Flu Season Is Coming – Tips to Research Prevention and Treatment” gives you an idea of how DeepDyve might be used.
DeepDyve invites your comments.
Reminder: Some documents you find at DeepDyve may be viewed directly, but many will require a for-fee subscription with the journal publisher.
A Federated Search Primer - Part III of III by Darcy Pedersen, Deep Web Technologies Blog (Feb 24, 2009)
Last of three parts on which defined federated search (submitting queries to databases to get information), described its value, identified features, and listed some examples (some of which come from Deep Web Technologies).
Other resources are listed including an excellent video on Searching the Deep Web from Office of Scientific and Technical Information.
Will the “Deep Web” Slay Google? by Greg Sterling, Search Engine Land (Feb 23)
What will happen to web search? Greg Sterling refers to two articles from the New York Times
Everyone Loves Google, Until It’s Too Big by Randall Stross (Feb 21)
Theme: "“You almost feel sorry for Google,” said Danny Sullivan, editor in chief of Search Engine Land. “They’re doing a good job and people are turning to them. But when they pass 70 percent share, people are going to be uncomfortable about Google becoming a monopoly.”"
Exploring a ‘Deep Web’ That Google Can’t Grasp by Alex Wright (Feb 22)
What deep web search is: "“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources."
What Google Does: "Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains."
It can get quite complicated. Will we even recognize deep web results? Will the engine show us the patterns?
Sterling says, "Microsoft and Yahoo (assuming it doesn’t sell search to Redmond) will continue to make improvements in their algorithms, indexes and interfaces. The more competition the better because search will only become more important as the “deep web” is unlocked."
DeepDyve, the new service that promises to get into deeper web content, has a new blog - http://blog.deepdyve.com/
First posting on Jan 29 described the vision and approach.
"The vision behind our company is that Search is in its infancy and today’s ‘traditional’ search engines meet only our most basic, albeit common, needs – the fat part of the long tail. "
It uses its blog to post announcements about new content sources - such as The Scientist.
How Google crawls the deep web by Greg Linden (Jan 31)
Refers to a paper in which Google describes how it fills in web forms to query databases.
"This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index."
Interestingly, there was also this today --
Google: "We're Not Doing a Good Job with Structured Data" by Sarah Perez, Read Write Web (Feb 2)
"Google's Alon Halevy admitted that the search giant has "not been doing a good job" presenting the structured data found on the web to its users. By "structured data," Halevy was referring to the databases of the "deep web" - those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means."
Yahoo and Google are both working on automating the extraction of information from databases on the Web.
Darcy Pedersen, an expert in federated search engines, describes the workings and benefits of federated search engines in a three-part series at AltSearchEngines.
Part 1 -- Federated Search Finds Content that Google Can’t Reach
Key points;
+ federated search engines will dig deeply into the target databases - more deeply than Google because they will fill in the search interfaces.
+ this saves searchers time of seeking out the individual services and searching them.
+ quality is higher - "Federated search engines show their value best in environments in which the quality of results matters, such as libraries, corporate research environments, and the federal government."
A Federated Search Primer - Part II of III - offers a definition and describes important features.
"Federated search is the process of performing a simultaneous real-time search of multiple diverse and distributed sources from a single search page, with the federated search engine acting as intermediary."
Firms Push for a More Searchable Federal Web by Peter Whoriskey, Washington Post (Dec 11)
Much of public information from governments in the United States is not indexed by the major search engines. Google's chief executive Eric Schmidt may be able to change that in his newposition as an "informal advisor" to President-elect Barack Obama.
For example: "A person using one of the search engines, for example, can't find Environmental Protection Agency enforcement actions against a given company, can't discover the picture of a specific ancient Egyptian artifact at the Smithsonian and can't search by name for the details of a Vietnam War casualty. "
But - searchers note this - "EPA enforcement actions can be found through a portal on the agency's site, details on Egyptian artifacts can be found through a search of the National Museum of Natural History and details of a Vietnam War casualty may be found by searching the National Archives site. ".
It is mainly because the data is stored in a database, accessible only by posting a query directly.
Deep Web Technologies - An Interview with Abe Lederman, Stephen Arnold, ArnoldIT (June 2008)
Abe Lederman is CEO of Deep Web Technologies - creator of federated search for science.gov, scitopia.org, biznar.com, mednar.com, and others.
The interview offers us an inside view of what really constitutes the deep web (databases), and why Google and the other general purpose engines are unlikely to succeed anytime soon at indexing it.
Lederman can't give a figure on the extent of "deep web", but he can say -- "Let’s just say that at least 90% of the information on the Web lives in databases that you just won’t see on Google."
The information in the databases is only extracted through queries. Google announced last spring (2008) that it was filling in forms at the search engines to retrieve information and index it. Lederman -- "Google will not be able to download every document in a database as it is only going to be issuing random or semi-random queries."
Deep Web as a federated search engine does the query in real time -- "Deep Web goes out and in real-time sends out search requests to information sources. Each such request is equivalent to a user going to the search form of an information source and filling the form out."
There is a huge difference between USA.gov which is content that MSN crawls and indexes., and Science.gov, which is content obtained from "high-quality scientific" sources that Google and others can't index.
Search engine dives into cleantech by Sara Stroud, Sustainable Industries (Nov 17)
More news about DeepDyve (formerly Infovell) -- "the startup says it is expanding its focus from life sciences and patents to cleantech and energy, and will begin indexing material on such topics in late 2008."
"Targeting information-savvy users, DeepDyve claims to dig into unstructured online material, including technical and scholarly publications, databases and proprietary information. The company has also partnered with publishers—mostly of academic journals— to index materials. Users will still have to purchase articles from the publishers, Park says; but he notes that DeepDyve’s query system offers a greater assurance that an article would contain the sought after information."
ReadWriteWeb reviewed DeepDyve (Nov 11) -- DeepDyve: Indexing the Deep Web - and called it "an interesting technical experiment".
DeepDyve Dips Into the Deep Web by Marydee Ojala, Newsbreaks (Nov 17)
Detailed article by an expert on the new deep-web service - DeepDyve.
Current coverage - "To expand beyond biosciences, DeepDyve added the open access site arXiv, which covers physics and computer science. It anticipates greater coverage of the physical sciences, particularly information technology, clean technology, and energy, by the end of 2008. Next on its agenda is business information."
Technology is "KeyPhrase" - as explained - "It’s purely statistical, there are no semantics involved, no synonyms, no metadata. We get content and relevancy without taxonomy."
Hmm - that gives one pause.
Marydee Ojala reported on her search experience and posted comments of others. Overall - mixed.