WSG Newsletter: Clustering Web Search
Results
Issue: July 21, 2002
Note: Since this letter was written, AllTheWeb dropped Fast Topics
(but promises to reinstate them someday), Wisenut deteriorated badly under
Looksmart ownership and is badly out of date, Northern Light has been lost to
divine. But Ez2www is a new
meta-search engine that will cluster results.
February 5, 2003
Wonderful as Google is, there is one thing it can't do yet for Web searchers
- and that is group the pages into topics for easier browsing. (Though it does
do this in its news
search). From the earliest days of Web searching, searchers have complained
about being overwhelmed by results. An alternative has been to use a subject
directory where web sites are classified according to main topic. While useful
for obtaining an overview on a topic, the subject directory has never been good
for satisfying specific and precise queries. A subject directory such as the
very large Open Directory
Project can find forest management but not pine tree fungus.
Google, on the other hand, finds 18,000 pages with these words, and presents a
good first page of likely possibilities. But from what angle? Is this a
gardening interest, an environmental concern, a natural history interest?
In the past year several search engines have started to group results around
dynamically generated topics. Vivisimo, on the Web in the fall of 2000, was among the
first. Asked about pine tree fungus, Vivisimo groups Web pages into disease,
gardening, photos, white pine blister rust, forestry - and more.
This is document clustering.
Clustering works from the premise that "closely associated documents tend
to be relevant to the same requests". [p
45 Information Retrieval by C.J. van Rijsbergen] Close association is
determined by analyzing the text for similarity among the documents in words
and phrases used. Each cluster can be labeled by a short phrase description
derived from the co-occurrence of significant words.
Vivisimo says that technically speaking it is doing "on-the-fly,
conceptual, hierarchical, document clustering".
- On-the-fly - results are analyzed and grouped dynamically at the time of
the search into topics. Vivisimo analyzes only the title, url and short
description - not the full document.
- Conceptual - clusters are identified by what they are most about. If
Vivisimo can't describe the cluster concisely it discards it.
- Hierarchical - clusters are organized in a tree structure so that one can
see further refinements.
In the case of pine tree fungus, the disease cluster contained
sub-clusters for pests, root, blister rust, and other.
Benefits
Document clustering has long been proven to be an aid to searchers.
- Done well, it saves the searcher time and effort in assessing the variety
of possible meanings and aspects of a very long list, and provides quick
identification of the clusters that best match interests.
- It is especially helpful to people who are new to a subject area and don't
know the key terms.
- Most notably it can disambiguate words that have multiple meanings
depending on the context. Jaguar is a classic example. Is that the cat
(panthera onca), the car, the club, or the football team?
Search Engines with Topic Groups
Today there are several search services that offer some form grouping by
topics. Northern Light is the only one to pre-tag documents. Strictly speaking,
it is not document clustering - it is classification. All others do dynamic
clustering and all have had to find ways to curtail use of CPU - by restricting
the amount of text analyzed and possibly the number of pages.
Northern Light
might be best known for its Custom Search folders. People who have been
searching the Web for a couple of years or more will remember the Web search
engine. Search results are organized into folders by subject, type of document,
language, and country. Each page is examined at the time of indexing and
matched to a controlled vocabulary of subject terms and definitions and tagged
with a subject term. For a specific search, qualifying pages are organized into
the folders based on the tags. Regrettably only subscribers to the Enterprise
product line can avail themselves of the custom folders for web searching, but
users of Northern Light's free Current News Search (www.northernlight.com/news.html) can still see it in action.
AlltheWeb added Fast
Topics in November 2001. It used Open Directory's collection of categorized web
sites as a training base for grouping and describing web results. Documents
matching a profile from Open Directory are grouped into OD-like categories.
Documents that don't match well are grouped into clusters and described from
key terms. AlltheWeb analyzes and clusters only the first 200 results. (See
Fast Topics FAQ)
AltaVista enhanced its
search service with Prisma in July. Described as a 360 degree experience,
AV Prisma suggests
twelve additional and related terms for the search query. Words are drawn from
an analysis of the first 50 results - the title, url, and short description -
at least for now. (See
Gary Price's report.)
 For
our query for pine tree fungus Prisma suggests allergy, bark, fungus
gnats as the first three. White pine blister rust is number 12.
Clicking on a term adds it to the search query. Prisma is not full-fledged
document clustering but it does help one refine a search to a more specific
meaning.
Teoma claims to use dynamic
topic clustering along with popularity and text analysis. It will organize
results along "naturally occurring communities that are about the subject
of each search query". These are shown under Refine. It works best
with one or two word queries that generate a large results-set. Teoma's small
database of 200,000 documents reduces the usefulness of Refine because of the
small number. A search for jaguar doesn't find cats at all.
WiseNut introduced its
topic groupings about the same time as Teoma. Here they are called
WiseGuides. However, pine tree fungus seems to baffle WiseNut and the
Jaguar has no presence as a cat. Groupings seem to be based entirely on words
in title.
KillerInfo is a better
performer. It uses Vivisimo's technology for clustering and like Vivisimo is a
meta-search engine. It searches Open Directory, Fast, MSN, Yahoo, Altavista. In
fact, perhaps because its roster of search engines is different than
Vivisimo's, it produced a more detailed list of topics. It also has the very
handy Quick Peek feature to get a preview of a page.
Vivisimo also has
browsing aids for clicking through results or viewing in a new window. In
addition to metasearch against six search services (a dwindling number for
Vivisimo), there are Vivisimo front-ends for New York Times, Yahoo News, Ebay,
PubMed and some other specialty databases.
There is yet another meta-searcher -
iBoogie (www.iboogie.tv) -
which may challenge Vivisimo's lead. It also uses
"linguistic
clustering and statistical clustering" and generates "
hierarchical clustering as opposed to a simple "flat" grouping of
similar documents. People commenting in forums say this meta-searcher
"rocks". The company is something of a mystery - no press releases,
no about us. It seems to be affiliated to
Quigo Technologies, which
supplies the database for iBoogie's Deep Web search.
Infonetware uses
Real Term search from Infogistics. It has a finer grain and appears to
use the phrases in the documents more to identify the clusters. It too is a
meta-search engine - clustering tools need large datasets - and calls on Yahoo,
Lycos.uk, and MSN. Its list of topics for pine tree fungus is a bit
overwhelming in detail but we can drill down on pine and then diseases as a
sub-grouping to find the pine blister rust. Infonetware provides for a very
fine definition of topics and close examination of the pages without having to
click through each result.
When To Use
All searches are enhanced through the additional browsing and search
refinements offered through topic groupings. But some queries may benefit more
when you want:
- A profile on a person. Infonetware seems especially good because of the
detail it can extract. Vivisimo and KillerInfo do well also.
- An overview to a new topic. AlltheWeb provides a good topic analysis of
electronic records management
- The right ballpark when using words and phrases that have multiple
meanings. In theory all the search engines mentioned here should have been able
to list cats as one of the topics for jaguar. In actuality, only KillerInfo
could do it. It's almost as if the extinction of the jaguar precedes extinction
in the jungle.
Conclusion
Each of these will have their own strengths and weaknesses. AlltheWeb may
have more reliable topic names through its use of Open Directory's category
descriptions. Vivisimo and KillerInfo present a good range of topics
hiearchically arranged. Infonetware is more precise in picking out phrases for
a cluster. AltaVista's Prisma is mainly an aid for adding search terms. Teoma
and WiseNut can help take the searcher into the right ballpark. Use them all
and pick a couple as favourites
|
 |
Also of Interest
|
Tara Calishain wrote about Clustering with Search Engines in LLRX
in June 2002.
www.llrx.com/features/
clusteringsearch.htm
She also mentions SurfWax
for its FocusWords - words that might be used to refine the search.
Web Document Clustering: A Feasibility Demonstration by Oren
Zamir, Oren Etzioni. Department of Computer Science and Engineering University
of Washington Seattle (1998) -- definitive article.
Textquest: Document Clustering of MedlineAbstracts for Concept
Discovery in Molecular Biology by I. Iliopoulos, A J Enright, C A Ouzounis
in PSB Online Proceedings. (2001)
Evaluating Document Clustering for Interactive Information
Retrieval Anton Leuski Center for Intelligent Information Retrieval
Department of Computer Science University of Massachusetts (2002)
Web Page Categorization and Feature Selection Using Association
Rule and Principal Component Clustering by Jerome Moore, Eui-Hong (Sam)
Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings, George Karypis,
Vipin Kumar, and Bamshad Mobasher Department of Computer Science and
Engineering University of Minnesota, Minneapolis
|
|