WSG Newsletter: Taxonomies Rule
Report on Infonortics Search Engine Conference April 2002
Issue: June 1, 2002
Each year for the past seven, Infonortics has organized the
Search Engine Conference to bring together developers, users,
and observers of the search scene to look at issues and trends in search
technology. This years meeting was held in San Francisco, April 15-16,
2002. It was titled The Agony and the Ecstasy, possibly referring to the
sometimes agonizing problems in information retrieval and the ecstasy of
actually finding an answer.
Taxonomies and categorization made up the refrain throughout the two days As
Sue Feldman of IDC pointed out, taxonomies classification schemas
are needed for browsing they answer the question whats in
this database. Taxonomies are the base for creating retrieval systems
that will deliver information in context, a point driven home by
Clare Hart of Factiva. This is the ultimate objective to increase relevancy of
results. Further, users should be able to express questions more naturally, as
Elizabeth Liddy at the University of Syracuse outlined, and search tools should
deliver answers not mere matches on a topic. Natural language processing
attempts to understand text to extract implicit and explicit meaning. To
accomplish this requires rules and a taxonomy of entity types. In
linguistic analysis as well, and as explained by Laurent Proulx of nStein, a
taxonomy is needed as a frame of reference. Taxonomies figure as the structural
central support of the search edifice.
Presentations are online at
infonortics.com (with the exception of Googles). The
following account summarizes most of those with big-picture views of
developments in search technology. Most of the products mentioned by the
speakers are intended for use in an enterprise. The company sites will have
descriptions and often demos. A few may be seen in action in a public Web site,
such as the
Flamenco project at University of California, Berkeley, for
access to the UCW architecture image collection. More about examples on the
public web is the subject of the Web Search Alert column in the July / August
issue of
Information Highways.
Future of Search Engines
Clare Hart, of Factiva, talked of the Future of Search Engines in her
keynote address. Factiva is the marriage of Reuters and Dow Jones to provide
news and business information to companies around the world. They have studied
and been part of the rising tide of information for many years.
Some facts:
- newspapers account for 25 terabytes of information a year, magazines 10,
and office documents 195
- it took 300,000 years to generate the first 12 exabytes of information.
Another 12 exabytes will be generated in the next 2.5 years.
- 610 billion emails are sent in a year, representing 11 terabytes
To convert this to something we know, a public library of 300,000 books
might represent 1 terabytes. An exabyte is 1,000,000 terabytes. [Sources:
www.archive.org/xterabytes.html and
www.webstreetstudios.com/school/bitsbytes.htm]
Hart noted that since 1995 searching has become more sophisticated with
indexed databases, indexing, filters to deliver alerts, dynamic metadata, and
natural language processing. Attention post 2002 will be on knowledge
management and content management, with taxonomies providing the means for
categorization and organization of information, and the discovery of
information in context providing the relevance.
The need is great. Clare Hart observed that users are search illiterate
spending a lot of time searching poor and inadequate sources. They may soon be
overwhelmed by the growth in number of formats and, as mentioned, volume.
Getting to information in context viewing only that which is needed
for a particular need or task - is the ultimate goal, achievable, it is felt,
through an understanding of the work flows and by normalizing data so that it
can be commonly interpreted. There is an example of a Sales Marketing
department being able to see its functions interconnected to information and
events in other areas because there is a taxonomy that captures the elements of
the business. A salesperson preparing for a customer call would be able to
easily bring in relevant material from legal, finance, marketing research, and
other functions to create a customer profile and competitive assessment. The
result would be sales information in context. The aim is to
surface relevant information without a search box. There are some
slides online of PR Information in Context to illustrate.
What Content and Knowledge Management Require of Search Engines
Jim Bair of Strategy Partners also sees a future of information in context.
He predicts search technology will evolve to knowledge technology where instead
of a todays bag of words and tricks (patterns), well have
retrieval in context, ontologies, domain specific semantic
networks, lexicons and leprecons (agents). His vision of knowledge
technology is a contextual one where connected people use a Process and
Knowledge Base to produce decisions. Imagine an organization where all
documents email, office documents, transactions are processed by
an Electronic Document Manager to be tagged and stored for extraction according
to personal profiles.
Bair has identified eleven groups in the Knowledge Technology market by
which he has classified the products of over 90 vendors. These are: [See the
Web version of the presentation for the full list of examples].
- Portals (Autonomy
)
- Hybrids (OpenText
)
- Search Plus (Albert, iPhrase, Google
)
- Ontological Management (Applied Semantics, Ask Jeeves, Semio
)
- Knowledge Synthesis (ClearForest, nStein)
- Knowledge Aggregators (The Brain, Thoughshare, Enfish,
)
- Visualizers (WebMap, Antarcti.ca
)
- Collaboration (Groove Networks)
- Knowledge Mining (ClearForest
)
- Information Architecture (Adobe, Interwoven
)
- Information Aggregators (Factiva, Dialog, Northern Light
).
These are mapped in a schematic to Knowledge Management in the present and
over the next three years, Relationship Management (when experts will be in
greater demand) and Meaning Management up to 2005. The use of Metadata and the
application of Ontologies and Domain Specific Lexicons will dominate 2003 and
into 2004. By 2005 we may reach an age of meaning management and
sensemaking. The vendors are plotted on a cluster diagram over the
timespan with the visualizers and knowledge aggregators coming into their own
as sense-making tools in 2005 when these tools will deal with the
aboutness of the documents and not merely the individual words.
Find What I Mean, Not What I Say
Susan Feldman is Research Vice-President of Content Management and Retrieval
Software at IDC and has been a frequent writer of articles about information
retrieval technologies. She asks search tools to Find What I Mean, Not
What I Say. Finding information is a dialog. People when
seeking information will ask questions, interpret answers according to the
context, process and categorize the information and from this extract an
answer. Search systems are getting better at this too. Feldman sees an emerging
retrieval model of linguistic tools, categorization, taxonomies, rule bases,
and machine learning all of these being added to standard techniques.
Long a proponent of natural language processing, Feldman reviewed the levels
of NLP: the morphology of the word, its syntactic role in the sentence, its
semantic meaning, the text structure for most important sections and heuristics
for picking up most important points. The presentation includes an example of a
sentence tagged for parts of speech, bracketing and categorization. But in
addition, new tools are using linguistic analysis to expand the terms, and
categorization to analyze the content and alert a user to a topic or deliver
focused advertisements. Extraction can pick up concepts, relationships, and
even mood and intention. Attensity works with unstructured text, and
MetaMarker
can detect mood in text.
She sees new retrieval applications becoming more conversational capable of
understanding the query, asking questions or detecting mood, and coming up with
an answer. Linguistic tools to analyze, extract, summarize, and categorize are
part of the whole.
In the short term we will see better and expanded search tools
and in the long, unified search and machine learning.
On the eve of 2000 Sue Feldman wrote
The Answer
Machine for the Millenium edition of Searcher Magazine. I asked her how
close we are to the answer machine. It has all come true, she said,
but in different combinations. Examples are
EasyAsk it handles
natural language questions in a context, FAST which now categorizes web search
results, and Google has done a lot of work with languages.
In her presentation Feldman lists leading companies with products that are
answer engines, pattern finders, and monitors.
Why Settle for A List, When You Want an Answer?
Natural language processing was the focus of "Why Settle for A List,
When You Want an Answer?" by Elizabeth Liddy, Director of the Center for
Natural Language Processing at Syracuse University. People shouldnt have
to look for Human Resources in the hope of finding our if adoptive
fathers qualify for family leave. They should be able to ask the question
and have it understood. Liddy went through the steps of query processing
identifying the focus of the question, recognizing names and boundaries,
expanding the query, and representing it in tagged format and showed it
at work for a travel question, a search for statistical information, and
scientific questions of undergrads.
Tame the Terabyte Terror
Jay Van Eman, CEO of Access Innovations in Seattle, sought to Tame the Terabyte
Terror by examining the key terminology in building access:
- categorization an object can be put in several buckets
- classification an object can be associated with only one class
- taxonomy is the science of classification
- taxonomy may be a controlled vocabulary that describes the subject area
and be structured hierarchically
- thesaurus is a controlled vocabulary of terms and provides for
post-coordinate indexing
People intending to build a thesaurus will be interested in his examples,
the types and names of tools available, and the features to look for. Perhaps
best of all is his list of 10 signs you need a thesaurus.
Flexible Search and Navigation using Faceted Metadata
Metadata was addressed by Professor Marti Hearst at University of
California, Berkeley in Flexible Search and Navigation using Faceted
Metadata. Facets are orthogonal categories - distinct aspects
of the subject. Metadata will describe the data elements and can be used to
establish a set vocabulary of terms. People on the Flamenco project team have
been investigating the use of hierarchical metadata to improve navigation and
display of results. The project has centered on helping architects and city
planners find images in the UCB Architecture slide library.
Flamenco project and can be viewed. There are nine
hierarchical facets (such as people, location, structure type
) Users
have found the interface useful for seeing relationships and felt it helped
them find all that they needed.
To Seek and Not To Find. What is Your Question?
An appreciation of how all these pieces fit taxonomy, metadata,
linguistic analysis may be obtained from Laurent Proulxs
presentation. Proulx is Senior Vice-President and Chief Technological Officer
at Nstein, a Montreal-based
company. Nstein develops content management software based on linguistic
artificial intelligence that will categorize, summarize, and tag with XML
unstructured text. We see examples of taxonomies (eg Mesh), use of an
organization thesaurus (controlled vocabulary), a categorization process, the
extraction of named-entities. The final slide lays out the requirements for an
integrated framework , relevant indexing, and capability to extract new types
of information namely taxonomy, organizational thesaurus,
linguistic-based concept extraction, categorization, and named-entity
extraction.
Conclusion
Well be seeing more of these in new services on the Web and product
suites for the enterprise. Categorization will become better - more services
will be like Epicurious,
given as an example by Hearst of what to do, and fewer out of control like
Yahoo. Improved categorization will mean better filtering and personalization.
At search engines, searchers won't have to go through the hoops Amanda Spinks
at Penn State documented in her study of Excite users (see sidebar). Better
days are on the way.
|
 |
This Just In - July 15
2002
|
|
|
Also of Interest
|
User Behaviour
|
Inktomi and Google
|
|