Web Search Guide banner
 

WSG Newsletter: Taxonomies Rule

Report on Infonortics Search Engine Conference April 2002

Issue: June 1, 2002

Each year for the past seven, Infonortics has organized the Search Engine Conference to bring together developers, users, and observers of the search scene to look at issues and trends in search technology. This year’s meeting was held in San Francisco, April 15-16, 2002. It was titled The Agony and the Ecstasy, possibly referring to the sometimes agonizing problems in information retrieval and the ecstasy of actually finding an answer.

Taxonomies and categorization made up the refrain throughout the two days As Sue Feldman of IDC pointed out, taxonomies – classification schemas – are needed for browsing – they answer the question “what’s in this database”. Taxonomies are the base for creating retrieval systems that will deliver “information in context”, a point driven home by Clare Hart of Factiva. This is the ultimate objective to increase relevancy of results. Further, users should be able to express questions more naturally, as Elizabeth Liddy at the University of Syracuse outlined, and search tools should deliver answers not mere matches on a topic. Natural language processing attempts to understand text to extract implicit and explicit meaning. To accomplish this requires rules and a “taxonomy” of entity types. In linguistic analysis as well, and as explained by Laurent Proulx of nStein, a taxonomy is needed as a frame of reference. Taxonomies figure as the structural central support of the search edifice.

Presentations are online at infonortics.com (with the exception of Google’s). The following account summarizes most of those with big-picture views of developments in search technology. Most of the products mentioned by the speakers are intended for use in an enterprise. The company sites will have descriptions and often demos. A few may be seen in action in a public Web site, such as the Flamenco project at University of California, Berkeley, for access to the UCW architecture image collection. More about examples on the public web is the subject of the Web Search Alert column in the July / August issue of Information Highways.

Future of Search Engines

Clare Hart, of Factiva, talked of the Future of Search Engines in her keynote address. Factiva is the marriage of Reuters and Dow Jones to provide news and business information to companies around the world. They have studied and been part of the rising tide of information for many years.

Some facts:

  • newspapers account for 25 terabytes of information a year, magazines 10, and office documents 195
  • it took 300,000 years to generate the first 12 exabytes of information. Another 12 exabytes will be generated in the next 2.5 years.
  • 610 billion emails are sent in a year, representing 11 terabytes

To convert this to something we know, a public library of 300,000 books might represent 1 terabytes. An exabyte is 1,000,000 terabytes. [Sources: www.archive.org/xterabytes.html and www.webstreetstudios.com/school/bitsbytes.htm]

Hart noted that since 1995 searching has become more sophisticated with indexed databases, indexing, filters to deliver alerts, dynamic metadata, and natural language processing. Attention post 2002 will be on knowledge management and content management, with taxonomies providing the means for categorization and organization of information, and the discovery of information in context providing the relevance.

The need is great. Clare Hart observed that users are search illiterate spending a lot of time searching poor and inadequate sources. They may soon be overwhelmed by the growth in number of formats and, as mentioned, volume.

Getting to information in context – viewing only that which is needed for a particular need or task - is the ultimate goal, achievable, it is felt, through an understanding of the work flows and by normalizing data so that it can be commonly interpreted. There is an example of a Sales Marketing department being able to see its functions interconnected to information and events in other areas because there is a taxonomy that captures the elements of the business. A salesperson preparing for a customer call would be able to easily bring in relevant material from legal, finance, marketing research, and other functions to create a customer profile and competitive assessment. The result would be “sales information in context”. The aim is to “surface relevant information without a search box”. There are some slides online of “PR Information in Context” to illustrate.

What Content and Knowledge Management Require of Search Engines

Jim Bair of Strategy Partners also sees a future of information in context. He predicts search technology will evolve to knowledge technology where instead of a today’s bag of words and tricks (patterns), we’ll have “retrieval in context”, “ontologies, domain specific semantic networks, lexicons and leprecons” (agents). His vision of knowledge technology is a contextual one where connected people use a Process and Knowledge Base to produce decisions. Imagine an organization where all documents – email, office documents, transactions – are processed by an Electronic Document Manager to be tagged and stored for extraction according to personal profiles.

Bair has identified eleven groups in the Knowledge Technology market by which he has classified the products of over 90 vendors. These are: [See the Web version of the presentation for the full list of examples].

  1. Portals (Autonomy …)
  2. Hybrids (OpenText …)
  3. Search Plus (Albert, iPhrase, Google …)
  4. Ontological Management (Applied Semantics, Ask Jeeves, Semio …)
  5. Knowledge Synthesis (ClearForest, nStein)
  6. Knowledge Aggregators (The Brain, Thoughshare, Enfish, …)
  7. Visualizers (WebMap, Antarcti.ca …)
  8. Collaboration (Groove Networks)
  9. Knowledge Mining (ClearForest …)
  10. Information Architecture (Adobe, Interwoven …)
  11. Information Aggregators (Factiva, Dialog, Northern Light …).

These are mapped in a schematic to Knowledge Management in the present and over the next three years, Relationship Management (when experts will be in greater demand) and Meaning Management up to 2005. The use of Metadata and the application of Ontologies and Domain Specific Lexicons will dominate 2003 and into 2004. By 2005 we may reach an age of “meaning management” and “sensemaking”. The vendors are plotted on a cluster diagram over the timespan with the visualizers and knowledge aggregators coming into their own as sense-making tools in 2005 when these tools will deal with the “aboutness” of the documents and not merely the individual words.

Find What I Mean, Not What I Say

Susan Feldman is Research Vice-President of Content Management and Retrieval Software at IDC and has been a frequent writer of articles about information retrieval technologies. She asks search tools to “Find What I Mean, Not What I Say”. “Finding information is a dialog”. People when seeking information will ask questions, interpret answers according to the context, process and categorize the information and from this extract an answer. Search systems are getting better at this too. Feldman sees an emerging retrieval model of linguistic tools, categorization, taxonomies, rule bases, and machine learning – all of these being added to standard techniques.

Long a proponent of natural language processing, Feldman reviewed the levels of NLP: the morphology of the word, its syntactic role in the sentence, its semantic meaning, the text structure for most important sections and heuristics for picking up most important points. The presentation includes an example of a sentence tagged for parts of speech, bracketing and categorization. But in addition, new tools are using linguistic analysis to expand the terms, and categorization to analyze the content and alert a user to a topic or deliver focused advertisements. Extraction can pick up concepts, relationships, and even mood and intention. Attensity works with unstructured text, and MetaMarker can detect mood in text.

She sees new retrieval applications becoming more conversational capable of understanding the query, asking questions or detecting mood, and coming up with an answer. Linguistic tools to analyze, extract, summarize, and categorize are part of the whole.

In the short term we will see “better and expanded search tools” and in the long, “unified search” and “machine learning”.

On the eve of 2000 Sue Feldman wrote The Answer Machine for the Millenium edition of Searcher Magazine. I asked her how close we are to the answer machine. “It has all come true”, she said, “but in different combinations”. Examples are EasyAsk – it handles natural language questions in a context, FAST which now categorizes web search results, and Google has done a lot of work with languages.

In her presentation Feldman lists leading companies with products that are answer engines, pattern finders, and monitors.

Why Settle for A List, When You Want an Answer?

Natural language processing was the focus of "Why Settle for A List, When You Want an Answer?" by Elizabeth Liddy, Director of the Center for Natural Language Processing at Syracuse University. People shouldn’t have to look for Human Resources in the hope of finding our if “adoptive fathers qualify for family leave”. They should be able to ask the question and have it understood. Liddy went through the steps of query processing– identifying the focus of the question, recognizing names and boundaries, expanding the query, and representing it in tagged format – and showed it at work for a travel question, a search for statistical information, and scientific questions of undergrads.

Tame the Terabyte Terror

Jay Van Eman, CEO of Access Innovations in Seattle, sought to Tame the Terabyte Terror by examining the key terminology in building access:

  • categorization – an object can be put in several buckets
  • classification – an object can be associated with only one class
  • taxonomy is the science of classification
  • taxonomy may be a controlled vocabulary that describes the subject area and be structured hierarchically
  • thesaurus is a controlled vocabulary of terms and provides for post-coordinate indexing

People intending to build a thesaurus will be interested in his examples, the types and names of tools available, and the features to look for. Perhaps best of all is his list of “10 signs you need a thesaurus”.

Flexible Search and Navigation using Faceted Metadata

Metadata was addressed by Professor Marti Hearst at University of California, Berkeley in “Flexible Search and Navigation using Faceted Metadata”. Facets are “orthogonal categories” - distinct aspects of the subject. Metadata will describe the data elements and can be used to establish a set vocabulary of terms. People on the Flamenco project team have been investigating the use of hierarchical metadata to improve navigation and display of results. The project has centered on helping architects and city planners find images in the UCB Architecture slide library. Flamenco project and can be viewed. There are nine hierarchical facets (such as people, location, structure type …) Users have found the interface useful for seeing relationships and felt it helped them find all that they needed.

To Seek and Not To Find. What is Your Question?

An appreciation of how all these pieces fit – taxonomy, metadata, linguistic analysis – may be obtained from Laurent Proulx’s presentation. Proulx is Senior Vice-President and Chief Technological Officer at Nstein, a Montreal-based company. Nstein develops content management software based on linguistic artificial intelligence that will categorize, summarize, and tag with XML unstructured text. We see examples of taxonomies (eg Mesh), use of an organization thesaurus (controlled vocabulary), a categorization process, the extraction of named-entities. The final slide lays out the requirements for an integrated framework , relevant indexing, and capability to extract new types of information – namely taxonomy, organizational thesaurus, linguistic-based concept extraction, categorization, and named-entity extraction.

Conclusion

We’ll be seeing more of these in new services on the Web and product suites for the enterprise. Categorization will become better - more services will be like Epicurious, given as an example by Hearst of what to do, and fewer out of control like Yahoo. Improved categorization will mean better filtering and personalization. At search engines, searchers won't have to go through the hoops Amanda Spinks at Penn State documented in her study of Excite users (see sidebar). Better days are on the way.

Golden Gate Bridge San Francisco

Marker This Just In - July 15 2002

Donald T. Hawkins reported on this conference - 2002 Search Engine Meeting - in the June 2002 issue of Information Today. The article nicely condenses the proceedings and explains some complex topics.

Marker Also of Interest

User Behaviour

Amanda Spink of The Pennsylvania State University studies Information Behaviour in searching. Her latest study had a Penn State press release titled New web searching trends: Sex is out, e-commerce is in (April 1, 2002) More about the study may be found at Spink's faculty page.

In watching users at Excite.com from 1997 to 2001, they found users do “successive searching”, that is they use more than one tool, and the multitask, searching on more than one topic concurrently. Successive searches “refine and enhance results”. Users in their second search use more commands but may reduce use of Boolean.

I think this is very much in line with what Internet trainers teach.

Spink asked the vendors to give users better tools to do what they want to do.

Inktomi and Google

Andrew Littlefield showed some user interfaces being considered by Inktomi for intranet use. The variety and richness of content types is one challenge. Visual cues – mouse-over previews for a document – might help in quickly assessing relevance.

Matt Cutts, a software engineer, enumerated the qualities that have made Google so popular – depth, freshness, relevancy, search features, and, not to be overlooked, humour. As well there are the specialized applications to search news, mail order catalogs, and images. For the future, Google intends to continue to focus on the users – keeping the interface clean, helping with misspellings, and returning search results quickly. Google has been broadening the definition of search to include multiple file types, dynamic pages, and other content on the Internet. They intend to also extend their reach to mobile access. Under consideration are categorization (although multiple language may make it difficult to do topic analysis well), personalization, natural language, and premium content. Much depends on the market.

 

 

 


Newsletter by Gwen Harris


Copyright Gwen Harris
A service to subscribers of WebSearchGuide (http://www.websearchguide.ca)


Where to Next?

Return to list of newsletters.

 

home tutorials newsletter what's new about