Web Search Guide banner
 

WSG Newsletter: Clustering Web Search Results

Issue: July 21, 2002

Note: Since this letter was written, AllTheWeb dropped Fast Topics (but promises to reinstate them someday), Wisenut deteriorated badly under Looksmart ownership and is badly out of date, Northern Light has been lost to divine. But Ez2www is a new meta-search engine that will cluster results.
February 5, 2003


Wonderful as Google is, there is one thing it can't do yet for Web searchers - and that is group the pages into topics for easier browsing. (Though it does do this in its news search). From the earliest days of Web searching, searchers have complained about being overwhelmed by results. An alternative has been to use a subject directory where web sites are classified according to main topic. While useful for obtaining an overview on a topic, the subject directory has never been good for satisfying specific and precise queries. A subject directory such as the very large Open Directory Project can find forest management but not pine tree fungus. Google, on the other hand, finds 18,000 pages with these words, and presents a good first page of likely possibilities. But from what angle? Is this a gardening interest, an environmental concern, a natural history interest?

In the past year several search engines have started to group results around dynamically generated topics. Vivisimo, on the Web in the fall of 2000, was among the first. Asked about pine tree fungus, Vivisimo groups Web pages into disease, gardening, photos, white pine blister rust, forestry - and more.

Vivisimo search results

This is document clustering. Clustering works from the premise that "closely associated documents tend to be relevant to the same requests". [p 45 Information Retrieval by C.J. van Rijsbergen] Close association is determined by analyzing the text for similarity among the documents in words and phrases used. Each cluster can be labeled by a short phrase description derived from the co-occurrence of significant words.

Vivisimo says that technically speaking it is doing "on-the-fly, conceptual, hierarchical, document clustering".

  • On-the-fly - results are analyzed and grouped dynamically at the time of the search into topics. Vivisimo analyzes only the title, url and short description - not the full document.
  • Conceptual - clusters are identified by what they are most about. If Vivisimo can't describe the cluster concisely it discards it.
  • Hierarchical - clusters are organized in a tree structure so that one can see further refinements.

In the case of pine tree fungus, the disease cluster contained sub-clusters for pests, root, blister rust, and other.

Benefits

Document clustering has long been proven to be an aid to searchers.

  • Done well, it saves the searcher time and effort in assessing the variety of possible meanings and aspects of a very long list, and provides quick identification of the clusters that best match interests.
  • It is especially helpful to people who are new to a subject area and don't know the key terms.
  • Most notably it can disambiguate words that have multiple meanings depending on the context. Jaguar is a classic example. Is that the cat (panthera onca), the car, the club, or the football team?

Search Engines with Topic Groups

Today there are several search services that offer some form grouping by topics. Northern Light is the only one to pre-tag documents. Strictly speaking, it is not document clustering - it is classification. All others do dynamic clustering and all have had to find ways to curtail use of CPU - by restricting the amount of text analyzed and possibly the number of pages.

Northern Light might be best known for its Custom Search folders. People who have been searching the Web for a couple of years or more will remember the Web search engine. Search results are organized into folders by subject, type of document, language, and country. Each page is examined at the time of indexing and matched to a controlled vocabulary of subject terms and definitions and tagged with a subject term. For a specific search, qualifying pages are organized into the folders based on the tags. Regrettably only subscribers to the Enterprise product line can avail themselves of the custom folders for web searching, but users of Northern Light's free Current News Search (www.northernlight.com/news.html) can still see it in action.

AlltheWeb added Fast Topics in November 2001. It used Open Directory's collection of categorized web sites as a training base for grouping and describing web results. Documents matching a profile from Open Directory are grouped into OD-like categories. Documents that don't match well are grouped into clusters and described from key terms. AlltheWeb analyzes and clusters only the first 200 results. (See Fast Topics FAQ)

AltaVista enhanced its search service with Prisma in July. Described as a 360 degree experience, AV Prisma suggests twelve additional and related terms for the search query. Words are drawn from an analysis of the first 50 results - the title, url, and short description - at least for now. (See Gary Price's report.)

AV Prisma - 6 of the 12 terms

For our query for pine tree fungus Prisma suggests allergy, bark, fungus gnats as the first three. White pine blister rust is number 12. Clicking on a term adds it to the search query. Prisma is not full-fledged document clustering but it does help one refine a search to a more specific meaning.

Teoma claims to use dynamic topic clustering along with popularity and text analysis. It will organize results along "naturally occurring communities that are about the subject of each search query". These are shown under Refine. It works best with one or two word queries that generate a large results-set. Teoma's small database of 200,000 documents reduces the usefulness of Refine because of the small number. A search for jaguar doesn't find cats at all.

WiseNut introduced its topic groupings about the same time as Teoma. Here they are called WiseGuides. However, pine tree fungus seems to baffle WiseNut and the Jaguar has no presence as a cat. Groupings seem to be based entirely on words in title.

KillerInfo is a better performer. It uses Vivisimo's technology for clustering and like Vivisimo is a meta-search engine. It searches Open Directory, Fast, MSN, Yahoo, Altavista. In fact, perhaps because its roster of search engines is different than Vivisimo's, it produced a more detailed list of topics. It also has the very handy Quick Peek feature to get a preview of a page.

Vivisimo also has browsing aids for clicking through results or viewing in a new window. In addition to metasearch against six search services (a dwindling number for Vivisimo), there are Vivisimo front-ends for New York Times, Yahoo News, Ebay, PubMed and some other specialty databases.

There is yet another meta-searcher - iBoogie (www.iboogie.tv) - which may challenge Vivisimo's lead. It also uses "linguistic clustering and statistical clustering" and generates " hierarchical clustering as opposed to a simple "flat" grouping of similar documents. People commenting in forums say this meta-searcher "rocks". The company is something of a mystery - no press releases, no about us. It seems to be affiliated to Quigo Technologies, which supplies the database for iBoogie's Deep Web search.

Infonetware uses Real Term search from Infogistics. It has a finer grain and appears to use the phrases in the documents more to identify the clusters. It too is a meta-search engine - clustering tools need large datasets - and calls on Yahoo, Lycos.uk, and MSN. Its list of topics for pine tree fungus is a bit overwhelming in detail but we can drill down on pine and then diseases as a sub-grouping to find the pine blister rust. Infonetware provides for a very fine definition of topics and close examination of the pages without having to click through each result.

When To Use

All searches are enhanced through the additional browsing and search refinements offered through topic groupings. But some queries may benefit more when you want:

  • A profile on a person. Infonetware seems especially good because of the detail it can extract. Vivisimo and KillerInfo do well also.
  • An overview to a new topic. AlltheWeb provides a good topic analysis of electronic records management
  • The right ballpark when using words and phrases that have multiple meanings. In theory all the search engines mentioned here should have been able to list cats as one of the topics for jaguar. In actuality, only KillerInfo could do it. It's almost as if the extinction of the jaguar precedes extinction in the jungle.

Conclusion

Each of these will have their own strengths and weaknesses. AlltheWeb may have more reliable topic names through its use of Open Directory's category descriptions. Vivisimo and KillerInfo present a good range of topics hiearchically arranged. Infonetware is more precise in picking out phrases for a cluster. AltaVista's Prisma is mainly an aid for adding search terms. Teoma and WiseNut can help take the searcher into the right ballpark. Use them all and pick a couple as favourites

Clusters of grapes

Marker Also of Interest

Tara Calishain wrote about Clustering with Search Engines in LLRX in June 2002.
www.llrx.com/features/
clusteringsearch.htm

She also mentions SurfWax for its FocusWords - words that might be used to refine the search.


Web Document Clustering: A Feasibility Demonstration by Oren Zamir, Oren Etzioni. Department of Computer Science and Engineering University of Washington Seattle (1998) -- definitive article.

Textquest: Document Clustering of MedlineAbstracts for Concept Discovery in Molecular Biology by I. Iliopoulos, A J Enright, C A Ouzounis in PSB Online Proceedings. (2001)

Evaluating Document Clustering for Interactive Information Retrieval Anton Leuski Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts (2002)

Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering by Jerome Moore, Eui-Hong (Sam) Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings, George Karypis, Vipin Kumar, and Bamshad Mobasher Department of Computer Science and Engineering University of Minnesota, Minneapolis

 

 

 


Newsletter by Gwen Harris


Copyright Gwen Harris
A service to subscribers of WebSearchGuide (http://www.websearchguide.ca)


Where to Next?

Return to list of newsletters.

 

home tutorials newsletter what's new about