What Is Search Engine Ranking Authority?, Tom Schmitz, Search Engine Land (May 10)
Relevance is still the retrieval of pages according to search terms, authority is how good they are. Question is how to search engines determine authority? This article as a few clues.
+ Google does something with brands. "Just as more links convey more trust, so do more and more brand mentions."
+ Social media probably has some influence. "Search engines can crawl these looking for links and mentions."
+ Site quality means "performance, user experience, and trust signals."
Google’s Cutts Explains How Google Search Works Barry Schwartz, Search Engine Land (Apr 23)
Matt Cutts explains in this video of 8 minutes how Google crawls, uses page rank, sets priorities, indexes, and filters. Even explains the dance.
Google Becomes Answer Engine With Semantic Technology − Great News For Retailers, Paul Bruemmer, Search Engine Land (Apr 12)
Google's is going to do what Ask.com was very good at - match search queries against a database for the answers.
"Search Engine Land author and Ontologica semantic services provider Barbara Starr said, “It’s inevitable that lots of verified structured data will give rise to the ability of search engines to become answer engines.” And that’s happening now. Google wants to better match search queries with a database containing hundreds of millions of entities on people, places and things that the company has been collecting over the last two years, while focusing more on structured data."
Google will add (more) semantic technology to web search to interpret the keywords, rather than depend on straight match on keywords.
Retailers will love the use of structured data to present information on products -- "GoodRelations RDFa is a semantic markup technology designed specifically for ecommerce. It allows retailers to send precise data on their products, items or services when communicating it to search engines"
Google lifts the kimono on what goes on behind search, releases uncut video of meeting, Drew Olanoff, The Next Web (Mar 12)
This behind-the-scenes video of a Google search quality meeting is fascinating. The discussion was about spelling corrections in very long queries. These account for only .1% of queries but all considered the implications for the searcher and the service. Get to see Senior VP Amit Singhal chair the meeting - quality was uppermost.
Note also the Apple laptops, and the crowded room.
The video uses YouTube’s annotation to identify the speaker and provide reader board text. Worth the 8 minutes.
12 Google Link Analysis Methods That Might Have Changed, Bill Slawski, SEO by the Sea (Mar 1)
Expert Bill Slawski tackles the question - what aspects of link analysis might Google be dropping or diminishing? He gives us a list a 12 - but he doesn't know either. Nonetheless, reading his list is a lesson in what link analysis does.
This one was interesting -- "Google has probably clustered similar web pages by looking at other pages that link to pages appearing in search results, and seeing what other pages they link to.
Google might have replaced this clustering approach with one that focuses instead more upon the content and/or the concepts contained on those pages."
How Google Evaluates Links by David Naylor (Feb 28)
Google announced that it would be " turning off a method of link analysis that we used for several years. " Davide Naylor speculates on what that method might be. Most likely candidate is inanchor text since it has been the piece for google bombing - but maybe Google intends to cut back much more on link analysis.
Interesting piece - essentially a primer on how factors affecting link analysis.
Google Knowledge Graph Could Change Search Forever, Lance Ulanoff, Mashable (Feb 14)
Search will not always be about matching on words and statistical relationships between words. It could become about entitites and concepts - ideas that those in semantic technology field (AI, natural language) have been talking about for years.
Anit Singhal, SVP at Google describes where Google has been and where it is headed.
"Google, Amit said, was the first to use links as “recommendation surrogates.” In those early days, Google based its results on content links and the authority of those links. Over time, Google added a host of signals about content, keywords and you to build an even better query result."
....
"Google now wants to transform words that appear on a page into entities that mean something and have related attributes. It’s what the human brain does naturally, but for computers, it’s known as Artificial Intelligence.
It’s a challenging task, but the work has already begun. Google is “building a huge, in-house understanding of what an entity is and a repository of what entities are in the world and what should you know about those entities,” said Singhal."
Manu Sporny, chair of the RDF Web Applications Working Group at the World Wide Web Consortium, shows that webmasters could take advantage of what Google is doing today to prepare for this future. Google is identifying RDFa and schema.org markup on pages and enhancing search results that include those pages. The markup is used to add structural information. See articles for examples and screenshots.
Bing and the Plus Sign, Stephen Arnold, Beyond Search (Nov 16)
Some people still like to do "boolean" search - use AND, NOT, OR - construct their sets and methodically do research.
Google annoyed those users when it dropped the +. But Bing still has it - as well as AND.
But that + is also being used to stop Bing from delivering semantic expansions. This is not new - but people are gradually discovering it - as did Glen Cathney in the article that Arnold cites - Bing’s Semantic Search, Phonetics and Undocumented Operator. Article has many good screenshots of where semantic expansion distorted results and misled.
At Google you have to use " " marks to look for the xact word.
Gartner's Top 10 tech trends for 2012, ITWorld Canada (Oct 20)
The "contextual user experience" is what search engines are aiming at too.
"Social and contextual user experience: According to Gartner, context-aware computing uses information about an end user's or object's environment, activities connections and preferences to improve the quality of interaction with that end user or object. A contextually aware system anticipates the user's needs and proactively serves up the most appropriate and customized content, product or service. The tipping point here could be technology such as near-field communications getting into more and more devices. Some interesting facts here: By 2015, 40% of the world's smartphone users will opt in to context service providers that track their activities with Google, Microsoft, Nokia and Apple continuously tracking daily journeys and digital habits for 10% of the world population by 2015, Cearley says."
Employing Microformats & Structured Data For Enhanced Search Engine Visibility, Aaron Bradley, Search Engine Land (Sep 29)
The rich snippets we see in Google abd Bing for recipes, reviews, and online products are the result of web designers coding in attribute-based markup formats. It's good for visual appeal and it might improve visibility.
"While employing structured data by no means guarantees superior rankings, the provision of metadata can potentially provide the search engines with a better understanding of what any given Web resource is about."
Adapting Search to You, Bing Community (Sep 14)
Bing tries harder to understand your search intent and the meaning of documents using semantic technologies. Mind, Bing learns from you - so you need to be a frequent user for this to work for you.
"As an example, let’s say you’re in the process of planning a vacation – you might decide to search for “Australia”. In this case, you’re most likely to be looking for websites specifically about the country Australia, or information about travel. " -
Then you switch.
"Now suppose, instead, you’re a movie-buff and are trying to decide on a movie to rent for the evening. With this context, the smart technology powering this feature will infer that you’re probably looking for the movie “Australia”, and begin to adapt the search page to your intent by showings results relevant to the movie Australia higher up on the page than they were previously:"
It's called Adaptive Search - and there's a video.
The Language Problem: Jaguars & The Turing Test, by Gord Hotchkiss, Search Engine Land (Sep 9)
When someone says "I love jaguars" - how do we know what that person means - and how would a search engine? This article is an excellent exploration of the linguistic analysis we do intuitively.
"The first is studying language structure, or grammar. We look at how words are formed (morphology), how phrases and sentences are structured (syntax) and determining the meaning of the word based on how it sounds (phonology). If there is no ambiguity with the words involved, that should be sufficient to interpret the meaning. "
But a jaguar could mean many things - the cat, car, sports team - and much else.
"Here, we look at the context of words. First, we would look at semantics. Are there inherent clues in how the sentence itself is structures that would help us resolve ambiguity?"
Gord Hotchkiss promises us a fascinating series of articles in natural language processing.
"Regardless of the date, understanding natural language represents one of the most significant challenges facing artificial intelligence. In the next few articles, I’ll be looking at some of the people tackling the challenge and look at how Google and other search engines currently process language."
How Google Might Introduce Job, Recipe, and Other Search Modes into Web Search Results, SEO by the Sea (Sept 6)
This is very promising - Google appears to be working on discerning the type of search desired and responding to that need.
Quoted from the patent - "Methods, systems, and computer program products feature determining a plurality of search result items responsive to a search query. A plurality of search modes are identified based on the query or the plurality of search result items or both. Each search mode is associated with a respective collection of records."
It’s time for a revolution in Web search, UW professor says, Nick Eaton, Seattle PI (Aug 3)
We keep waiting for search engines to be able to answer questions. It's still the promise.
Microsoft's head of search, Qi Lu, said - “Search is still essentially a website finder.” “It’s all nouns. But the future of search is verbs — computationally discerning user intent to give them the knowledge to complete tasks.”
Science fair gold medalist, 17, invents better way to search Internet, Emily Jackson, Globe and Mail (Aug 3)
Google will likely scoop this young man up.
"Seventeen-year-old Nicholas Schiefer has found a better way to search small documents, such as tweets and Facebook statuses – all for his Grade 11 science fair project. "
Google Now Supports Authorship Markup, Angela Guess, Semantic Web (Jun 8)
Metadata comes to the web. Google is supporting authorship markup that enable connections from an authored article to information about the author.
Google said -- "If you’re already doing structured data markup using microdata from schema.org, we’ll interpret that authorship information as well. We wanted to make sure the markup was as easy to implement as possible. To that end, we’ve already worked with several sites to markup their pages, including The New York Times, The Washington Post, CNET, Entertainment Weekly, The New Yorker and others."
Today author, tomorrow dates? Dates are still deplorable on the web.
Google, Yahoo! and Bing Announce Schema.org, Eric Franzon, Semanticweb.com (June 2)
Google, Bing, and Yahoo have each announced their support for standardizing on the schemas defined at schema.org for page markup.
From schema.org -- "A shared markup vocabulary makes easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts. So, in the spirit of sitemaps.org, Bing, Google and Yahoo! have come together to provide a shared collection of schemas that webmasters can use. "
Introduction to Google PageRank: Myths & Facts, Rob Chant, Search Engine Watch (Apr 15)
One presumes that PageRank - Google's method of assessing the importance of a page - still pertains - but how much has it been altered from the original, and how much is it used today?
"However, it is also clear that PageRank has evolved a lot over the last 10 years. The basic structure is probably pretty much the same as it has ever been, but it has become much more sophisticated in many respects (for example, introducing an additional factor for detecting and valuing links based on where they are on a page, as well as on the strength of the page)."
Let's hope this works. Google is assessing quality of the site in the rankings and has advised webmasters on how to improve that quality.
Google Panda Update Tip: Remove Low-Quality Content, Danny Goodwin, Search Engine Watch (Mar 10)
Quoting Google's Michael Wyszomierski. "Our recent update is designed to reduce rankings for low-quality sites, so the key thing for webmasters to do is make sure their sites are the highest quality possible. We looked at a variety of signals to detect low quality sites. Bear in mind that people searching on Google typically don't want to see shallow or poorly written content, content that's copied from other websites, or information that are just not that useful. In addition, it's important for webmasters to know that low quality content on part of a site can impact a site's ranking as a whole. "
Using Influence To Tune Signal To Noise On The Social Web, Rishab Ghosh, Search Engine Land (March 9)
In the real time world of the Internet influence matters. This article defines influence, describes how assessing "influence" and using it will remove noise from search results.
"We define influence as the likelihood that, each time you say something, people will pay attention. "
"Applying this definition to Twitter, influence does not reward the person with the most followers, or the person who tweets the most.
Rather, influence should measure attention (such as retweets and replies) and be computed based on the degree of such attention given to each individual tweet, quantified to the keyword & domain level."
It makes the concluding point that - "Social networks provide the new search signals on the web. Whether you are a consumer, marketer or publisher, you should be using these signals to your benefit."
Consumers need to learn how to search, publishers need to keep social content fresh and relevant, and marketers need to yse content to "educate" consumers.
My conclusion: Likes and steams may be edging out "links" as the main ranking factor. This is surely going to complicate SEO.
Is 2011 The Year for Semantic Technology?, Luca Scagliarini, Information Management (Feb 24)
The huge volume of content arising from increasing use of social media tools to broadcast (or share) news and opinion is - boggling and overwhelming. Can one even begin to search it? Companies and organizations want to though - in all that noise customers and competitors are saying things they need to know.
Semantic technologies for analyzing the content are likely the solution (at least for a while).
"At its most basic level, semantic technology is able to understand the meanings of words expressed in their proper context no matter the number (singular or plural), gender, verb tense or mode (indicative or imperative). But this is just the starting point. Semantic technology incorporates morphological, logical, grammatical and natural language analysis that translates into higher precision and recall when searching for information, delivering all of the most important and accurate data to the user.
Semantic technology allows enterprises to monitor and assess information contained in Web-based conversations and unstructured information so that companies can get the lay of the land when it comes to overall sentiment. And while the enterprise is gradually adopting semantic technologies, the consumer realm has a mind of its own."
Expect to see use of semantic technology at Facebook - "Facebook is a kind of parallel Web universe, with its own content, search functions, applications and games. It has content of every type and of every quality, so it’s necessary to render that content more useful in an effective and realistic way."
Search Engine Optimization: Anchor Text Flows Just Like PageRank, Bill Hartzer, Search Engine Marketing (Feb 21)
Describes some of the workings of the Google algorithms to rank results based on anchor text and page rank.
[Anchor text are the words above the link - the words that describe the destination. SEO people work very hard to build up relationships so that the anchor text is positive and carry the desired keywords.]
Excerpts
PageRank traditionally flows from one web page to another. For example, if one web page has a PageRank of 5 and it links to another web page, PageRank should (under normal circumstances) flow to the page it links to: and that page would normally be a PageRank of 4. There are a lot of “factors” that may or may not cause PageRank to flow from one page to another, but you get the idea.
But now, I have been able to prove (at least to myself) that anchor text now flows from one web page to another, just like PageRank. "
And further on - "Anchor text is flowing from one site to another–and then to another. Google appears counting the anchor text of links not directly to your site, but they’re counting the anchor text of links that link to pages that then link to your site. This might be one way that they’ve been able to cut down on “google bombing”, by allowing anchor text to flow, just like PageRank."
A Computer Called Watson, IBM 100 Icons of Progress (Feb 14)
The week was abuzz with news that Watson, the IBM "Deep QA" computer, bested humans at Jeopardy.
Watson doesn't think. Instead it finds the answer. "Watson does a remarkable job of understanding a tricky question and finding the best answer." ... "“The goal is to build a computer that can be more effective in understanding and interacting in natural language, but not necessarily the same way humans do it.”"
The article invites you to take a trivia challenge against Watson.
Computer crushes competition on Jeopardy! , AP (Feb 16) - has an account of Game 1 of the Man vs. Machine competition on “Jeopardy!”
But Watson blew one question under US cities - picking Toronto (in Canada) as the answer.
The main point is that this has huge significance for search. Do easy digital answers put us in jeopardy? , Ivor Tossell (Feb 16)
If you haven't had enough of the Bing - Google controversy, there is this 40 minute segment at Big Think on the Bing and Google Face Off in which Matt Cutts of Google, Dr Harry Shum of Microsoft, and Rick Shrenta of Blekko talk. Some of it centres on the data collected by Internet Explorer's suggested sites and the Bing toolbar and also the Google Toolbar. (Feb 2)
There are several other short tantalizing clips about search - going beyond the search box with more AI - Peter Diamandis; promise of search technology - in which Malcolm Gladwell finds it overrated; and some others.
How a Search Engine Might Fight Googlebombing, Bill Slawski, SEO by the SEA (Jan 11)
If enough pages use the same anchor text to link to a page and there is a query using those keywords, the linked-to page will show well in search results. This is the essence of Googlebombing. Google has been trying to diffuse that ever since the famous bomb in which "miserable failure" brought up George Bust.
Bill Slawski is on the patent trail of what Yahoo, and possibly Google, are doing about it.
Google did change its algorithm to do more "phrase based indexing". But Yahoo may be onto something else in using sentiment analysis.
The subject matter disclosed herein relates to mitigation of search engine hijacking. In one example implementation, a sentiment value associated with anchortext in a search engine result may be determined.Similarly, a sentiment value of one or more web pages referenced by the anchor text may also be determined. A divergence between sentiment values associated with the anchortext and a web page may then determined.
Google’s decreasingly useful, spam-filled web search, Marco.org (Jan 5)
Is Google losing the war with spam? A few think so - and I'm suspicious as well. Many recent searches on products / service queries have brought up aggregators and spam home.
From the article: "But recently, spam has taken over the “guide” query results, and even many “reference” queries. It wouldn’t surprise me if spam even started defeating the “address bar” queries — Google’s ranking algorithms recently have had a lot of trouble detecting the canonical source of duplicated content."
But - is this the solution?
"One solution may be for Google to radically change their algorithms and policies for web search to de-emphasize phrase-matching and more strongly prioritize inbound links and credibility. And, in what’s probably a huge departure for them, have human employees use their opinions of site quality to manually adjust the relevance of domains."
How Google May Use Categories as a Search Ranking Factor, Bill Slawski, SEO by the Sea (Oct 12)
The days of search engines doing exact word match are long past. The search technologists have been working on meaning for a long time. A recently approved Google patent shows that it has thought about employing a kind of categorization on the query and on the results to find the right ballpark.
Essentially - "Pages and query terms may be associated with specific categories based upon some relative strength of correlation to those categories. The strength of those association with categories might vary from one page to another, or from one query to another. "
It is not known if Google is doing this today. The posting does tell us some of the techniques search engines do employ.
+ "return transactional sites in search results when someone’s query term might be written like “buy xxxxx”. "
+ "returning informational pages when a query begins with a phrase such as “How to”. "
+ recognize navigational searches to a site
+ recognize "situational intent" - query for pizza probably means you want to order on.
+ including synonyms (which Google does do) to not be so literal.
Does Bing Time Shift Search Results?, Bill Slawski, SEO by the SEA (Oct 8)
Search engines are better at picking up the current qualities of a query. A search on the olympics will pick up the most recent games - and the upcoming ones.
For example -- "Search for “Independence Day,” around July 4th and chances are you want to learn about the holday in the United States. Do the same search around August 15th, and you may be more likely to want to learn about the holiday in India. The same search back on July 3, 1996 might have been about the Will Smith movie of the same name."
Bill Slawski analyzes a recent patent by Microsoft on how time-sensitive queries can be identified.
He concludes - "It’s not unusual to see news items for some queries in all of the major search results these days. But, the idea of changing suggested query refinements, and shifting web pages in search results in response to timely and topical events is something that may not be as noticeable as those news results. "
Google, Facebook Tout Serendipity Engines, Personalized Media, Clint Boulton, eWeek (Sep 30)
This idea of personalized search results predates Telidon in the late 1970s. Here it is being predicted again.
"Facebook Chief Operating Officer Sheryl Sandberg said the future of media will be personalized."
Google CEO Eric Schmidt introduced the idea of "autonomous search as a serendipity engine where information comes to users instead of tracking them down".
"Schmidt's and Sandberg's points are that computers and the software that empower them are getting smart enough to tailor content for users based on a number of signals, including tastes for food, entertainment activities and other personal preferences. "
Having my computer or handheld knowing that I like pizza and sending me alerts about specials at the local Pizza Pizza doesn't shake my world.
Search As Conversation: Surf Canyon’s Mark Cramer, Gord Hotchkiss, Search Engine Land (Oct 1)
Mark Cramer, the creator of the search plug-in, Surf Canyon, talks about how search is a conversation. Peter Norvig, of Google, has adopted this view as well.
Until the recent past, search has been display of some number of results which we review and select from - it doesn't change (and we'd prefer that they didn't). But with interfaces like Google Instant, results change as we act.
In fact, Surf Canyon introduced this - that search results changed depending on what you clicked on - Surf Canyon learned from you.
But there may still be a lot of results. Some recommend using the right vertical for the topic. Crammer suggests this is too much cognitive load for users and that the general search engine will get better at recommending the vertical.
"So if you go to Bing, for example, and you type in “Continental SFO LaGuardia,” you can actually get the flight information, which is very vertical in nature, on the top of the search page. There are ways to navigate to the vertical search site, which is the travel site, and then go ahead and actually book that flight or to get arrival information or whatever you’re looking for. My guess is there’s going to be increasing integration of that into the horizontal search. "
Hotchkiss asked about a meta-search app - to be the middleware while also being our agents.
Maybe the direction is search as conversation - or maybe more QnA. But this assumes a certain kind of information request - and not necessarily "research".
Are all Results on Search Engines Equal? A Surprising Journey Within the SERPs, Dominik Johnson, Search Engine Watch (Aug 31)
Search scene now includes web (Google, Bing, Yahoo), real time search (Twitter, OneRiot), large foreign language search (Baidu in China) - to name the big areas. This study examined the effectiveness of the search interface and search results.
Questions were from the point of view of online marketing.
* Are all results equal?
* Most engines have the same basic layout; however does a pixel change here and there make a difference?
* Are universal search type results on Bing and other engines as attractive as Google?
* Are trending topics a natural way to navigate the new breed of search engines?
* Do Baidu and Yandex have a design edge?
* Are moving/scrolling elements on the page effective?
It's unclear if they were successful at answering all of these - and I'm not sure of the value of evaluating placement of search engine logo, though placement of the AdBox is clear given the interest in advertising.
Not Brands but Entities: The Influence of Named Entities on Google and Yahoo Search Results, SEO by the SEA (Aug 19)
Google may be identifying entities (person, place, or thing) as part of the search process.
"The process in that patent may mean that if Google recognizes when a search query involves a particular entity, and if the entity can be associated with a specific web site, it might show multiple results for that site. For example, Google recognizes that “SEO by the Sea” is an entity, and when I perform a search such as “SEO by the Sea entities,” (without the quotation marks), Google will show a number of search results from SEO by the Sea:"
Thanks to Metaweb, Google might be able to keep a cross-reference / see-also index that identifies different names for a particular entity - such as UN = United Nations.
Yahoo is doing something with named entities and interpreting the query accordingly.
Interface: Where We're Headed with Web 3.0, William Laurent, Information Management Magazine (Jul/Aug 2010)
This is a very rosy forecast of a very connected Web 3.0 information world of colleagues, resources, and browsers that act as your personal smart agent.
"In the semantic Web, all information is categorized and stored in such a way that both a computer and human being can fathom what it empirically represents. Unlike Web 2.0 - where keywords are used to organize data into digestible nuggets for search engines - Web 3.0 will effectively categorize and present digital information to users in a visually improved manner that enhances interaction, analysis, intuition and search functions. The key driver in this scenario is the concept of taxonomies - standardized and self-describing classifications with codified semantics that are related to one another via highly normalized and descriptive metadata, not by a pastiche of static hyperlinks. For information on the World Wide Web to have a solid degree of relevance to users and live up to the 3.0 hype, it must contain a new magnitude of (artificial) intelligence."
How Does Google Work - crawling, indexing, retrieving - as laid out in this graphic from PPCBlog.
Infographic by the Pay Per Click Blog
Facebook, Google to Battle Over Smarter Web?, John P Mello Jr, PC World (Jul 16)
Google acquired MetaWeb - "a company that maintains an open database of things in the world. Working together we want to improve search and make the web richer and more meaningful for everyone."
Meanwhile Facebook has Open Graph - "Open Graph is also trying to tie together the far corners of the Web into packages that can be more meaningful to its users."
Good video about Metaweb that explains "entities". [3.26 min]
Also - Google Buys Metaweb To Bolster Answers, Google Squared & Rich Snippets, Barry Schwartz, Search Engine Land (Jul 16)
Of interest - Google claims to "understand facts about real people and real events out in the world" - meaning it can recognize that kind of query. For example - events toronto august 1 2010
Google: Caffeinating the Real-Time Web By Stacey Higginbotham, Business Week (June 13)
Google's Caffeine is described here as a "real time search engine" that makes possible "continuing real-time index of the entire Web". This is accomplished through custom-built hardware to reduce the storage time before a file is indexed, and through software that handles data in smaller chunks.
"By indexing in seconds instead of weeks, Google's juiced-up new search index (on Caffeine) may be one key to enhancing the real-time Web."
Google's new search index Caffeine goes live by Tom Krazit, Relevant Results (Jun 8)
Google has switched over to a new web search index called Caffeine. Tested for over a year, this promises to index more and to do so faster.
From Google's post on Our new search index
"Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you.
With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before—no matter when or where it was published."
From Google’s New Indexing Infrastructure “Caffeine” Now Live, by Vanessa Fox, Search Engine Land (Jun 8)
What's it mean? Fresher content, for one.
"Google told me that this change doesn’t make any of the crawling, indexing, or ranking factors more or less important than before. It simply makes crawled content available in search results more quickly before and paves the way for added flexibility in taking advantage of the whatever may come as the web evolves."
And potentially more information about a document that may, in the future, feed into ranking algorithms.
"Matt Cutts told me “It’s important to realize that caffeine is only a change in our indexing architecture. What’s exciting about Caffeine though is that it allows easier annotation of the information stored with documents, and subsequently can unlock the potential of better ranking in the future with those additional signals.”"
The Fate of the Semantic Web , Janns Anderson and Lee Raine, Pew Internet (May 4, 2010)
Experts discussed the future direction of the web at the FutureWeb 2010 conference. This was accompanied by a survey of the experts for their views on the prospects for the "semantic web".
The report and survey and the summaries of sessions at the conference are rich in analysis of today's state of information and direction.
"Technology experts and stakeholders who participated in a recent survey believe online information will continue to be organized and made accessible in smarter and more useful ways in coming years, but there is stark dispute about whether the improvements will match the visionary ideals of those who are working to build the semantic web."
Sir Tim Berners-Lee was among the first to envision a Semantic web where programs could recognize relationships between data and connected related items. There are several scenarios - but imagine beginning to plan a trip and have the computer present suggestions for flights, hotels, sights - the whole package - the computer (or program) would be the agent. But this presumes a structure and use of universal standards that the Web has not reached, and some feel it never will.
This survey showed a nearly even split of views - where 47% felt that "by 2020, the semantic web envisioned by Tim Berners-Lee will not be as fully effective as its creators hoped and average users will not have noticed much of a difference" and 41% thought the opposite - that the semantic web "will have been achieved to a significant degree and have clearly made a difference to average internet users."
This report includes the comments from the experts about the direction of web. Some note we have semantic web exists or will shorlty in some niches (health as one), that semantic technologies such as those in semantic search will do the heavy lifting, that there will never be widespread adoption of the necessary standards. Fascinating reading in respondents' thoughts
SEOs have the inside track on how search engines rank and display results. Searchers will be interested.
Google Increased The Title Tag Limit To 70 Characters?, Search Engine Roundtable (Jun 1)
Google seems to be showing 70 characters of a title tag in search results, up from 66.
How a Search Engine Might Rerank Search Results Based upon Time-Based Data in Query Logs, Bill Slawski, SEO by the Sea (May 31)
It's likely that search engines will rank results based on currency. Search for world cup and you should see the upcoming 2010 World Cup first. It could do this by watching query logs to interpret current interest.
"It’s quite possible that a search engine would look through its query logs, and see if a particular query is often included in more specific searches that include some kind of temporal data such as a year, or month, or day or time of day, and rewrite a searcher’s query to include that time-based information. "
Wrong Page Ranking in the Results? 6 Common Causes & 5 Solutions, SEOMoz (Jun 1)
Wonder why something turns up and not something else that is more relevant? SEOs ask the same question. Could be
+ anchor text of incoming links
+ links from external sites - they could be out of date
+ link authority may outweigh other relevance ranking signals from the page
+ on-page optimization suggests something that it isn't
+ search engine interprets a page to be "about" something - implied here is that the search engine (especially Google) actively interprets the content in relation to the searcher's keywords.
Understanding Semantic Search and SEO, Search Engine Journal (
Written for SEO, good for searchers, this article explains that semantic search is the "probabilistic/statistical approach to understanding concepts/meanings of a web page/document". Most (all ?) search engines do it today to some degree.
Semantic search is about the searcn engine inferring from the words available - the ones you use in your query, and the ones on the web page.
David Harry shows how this might work out by using an example of an automobile - there are synonymns (car, vehicle, auto etc) and there are associations (engine, tires, hood) and then things happen (car is sold, rented, insured, driven).
Where the words occur adds another level of meaning. Search engines use several clues in analysing a page.
* TITLE of page * Content of page (phrase ratios) * Prominence factors (Headings, italics, lists) * Anchor of inbound links * TITLE and content of pages linking in * Spam detection * Duplicate content detection * Personalization
What does this mean for SEO? Build the content around concepts - create a semantically rich page.
He lists some tools that might help for finding words and phrases.
I think it boils down to well written content by subject experts - rather than extended keyword research.
But for the searcher, it means being open to meaning-based search - not expecting or relying on strict keyword match.
How To Craft the Perfect Meta Tags, Stay on Search (May 18)
Title tag is that important - "Try to put your main keywords/phrases towards the front of your title tag. Search engines (especially Google) will place more emphasis on the beginning of the tag."
Has other advice about the meta-description tag.
10 Things that Make Search a Semantic Search, Hakia Blog (May 17)
This is the clearest list of capabilities or characteristics of a semantic search engine I've ever come upon. Some to note:
+ morphological variations - words from a root such as invest, investing, investment. Google, btw, does fairly well on this too.
+ synonyms with correct senses according to the context - Hakia will do this. Google can too - much of the time - when you use the ~.
+ Handling generalizations - searcher uses the general word, and the search engine fills in the detail. In theory, Scandanvia search will pull in Norway, Sweden, Denmark.
Hakia compares a search done with its technology to that of PubMed's index search.
Google Using Whois Data For Keyword Matching?, Search Engine Roundtable (May 12)
This seems far-out and would certainly be annoying. There is some evidence that Google uses WHOIS data on a site for ranking and for spam detection.
"There are many ways Google could use the data, from checking who owns which web sites, checking site age, checking transfer of ownership of domains, and possibly for some sort of keyword relevancy factor."
But what if the Whois registration data is the only match on a keyword - will Google show that as a search result?
Google Defines Semantic Closeness as a Ranking Signal, by Bill Slawski, SEO by the SEA (May 12)
Google filed a patent back in 2004 on using page structure to assist in semantic analysis of the page and ranking the search result - lists in particular. Google would be able to assess that keywords in a header are related to the list below and consider those words as "close"to items in the list when ranking search results.
"In other words, the search engine is attempting to locate and understand visual structures on a page that might be semantically meaningful, such as a list of items associated with a heading. We’re told that this process may also look for other kinds of meaningful semantic structures other than just lists as well."
Google’s Usability Fixation Reveals Local Ranking Factors, by Chris Silver Smith, Search Engine Land (Apr 26)
Article identifies 10 usability elements of business websites which the author believes could be leveraged for rankings in local search. Relates to quality scores that Google uses in ranking paid listings and organic results. Some of these are connected to page / site performance.
"You don’t really need to know the next “secret” technical trick to be edging out your competition in Google Maps rankings—you need to be including all elements which consumers would desire to find on your website when they’re attempting to select one business from the pack to go to for products and services."
Can Search Engines Sniff Out "Remarkable"? , Andrew Goodman, Traffick (Apr 7)
There is probably well over a single trillion of web pages now. No one is going to index all of it - and we are well past simple matching on words.
Andrew Goodman writes - "The task of measuring relevancy, quality, and intent is far more complex than it looks at first."
"In light of all this, search engines have done a pretty good job of looking at off-page signals to tell what's useful, relevant, and interesting. The major push began with the linking structure of the web, and now the effort has vastly expanded to many other emerging signals; especially, user behavior (consumption of the content; clickstreams; user trails) and new types of sharing and linking behavior in social media."
That's where Twitter comes in as another "signal" - "But another type of remarkable happens when your contribution truly makes non-confidantes want to retweet and otherwise mention you. When your article or insight achieves "breakout" beyond your circle of confidantes, and further confirming signals of user satisfaction later on when people stumble on it."
Web Browsing History Better Search Engine Ranking Signal than PageRank?, Bill Slawski, SEO by the SEA (April 6)
Interesting patent submitted by Yahoo - determine importance of page from what is browsed by users rather than from link analysis.
"This web site browsing history information might include the history of web pages visited by a user and when those pages were accessed. It might also include demographic information describing the user. A pending patent application from Yahoo, published last week explores the use of web browsing history as an alternative to looking at links to determine how important web pages might be."
Search Based upon Concepts: Applied Semantics and Google
By Bill Slawski, SEO byt the SEA "on March 30th, 2010
"A recently granted Google patent from the founders of Applied Semantics discusses a search interface that could help searchers find web pages based upon the meanings of their queries rather than just pages that include those keywords."
Yusuf Mehdi's Too-Candid Comments About Abandoning the Long Tail , by Andrew Goodman, Traffick (Mar 29)
Argues that "currated" search tools don't scale - meaning subject directories, subject guides - anything that is "hand edited". Right they don't - but maybe that is not the point - just need a starting list of good resources.
When you want the "long tail" - go to a web search engine. But maybe search engines aren't coping well with this.
Goodman writes:
"A significant aspect of the PR rollout of Bing was focused on the fact that Microsoft knew it would be most effective -- again -- at doing better for users in the realm of more popular types of searches, ceding long tail excellence to Google."
"Google itself is no saint when it comes to long tail accomplishments and relevance. On many counts, all search engine companies have waved white flags on truly scaling to address all potential content, because there is just too much of it (and too much spam). Dialing back on the ambitions of comprehensiveness, to devote more screen real estate to trusted brands and search experiences that are tantamount to paid inclusion, is Google's current trend, much as it was for companies like Inktomi and Yahoo in the past."
Article on the Microsoft Bing situation is - Microsoft Ignored the Long Tail in Search, Bing Boss Says
It's hard to tell whether Bing intends to tackle "long tail" or if it feels "head" is what it should be doing.
Mehdi said: "Think about the explosion in the long tail. You have to crawl, index and make sense of that. On any given a month, one-third of queries that show up on Bing, it's the first time we've ever seen that query. A huge chunk of those, we'll never see again. They're like gone with the wind. The challenge of being able to be up to speed to understand that new flow of data and to be able to index the right thing so you can respond in subsecond time is a very, very hard problem." "
But then - "Mehdi said the tide is turning back around to the head of popular queries. This is why Microsoft has partnered with Twitter, Wolfram Alpha and, as of today, popular location-sharing service Foursquare."
One thing is clear - Microsoft wants to understand user intent - ""It's more of a dialogue with the consumer," Mehdi said. "We are about understanding user intent, and in mapping the intent into tasks and into actions." "
As does everyone else.
Peter Norvig offers an insider's look at Google Research during SMX West, Search Engine Watch (Mar 3)
"... Peter Norvig, who spearheads Google's wide-ranging research efforts, offered a behind-the-scenes look at the cool technology projects Google is developing for future products and services."
On the list of 21 project Norvig showed the audience were image swirl, Google Squared, clustering, attribute extraction.
Google To Begin Indexing The Internet In Real-Time? by Alex Wilhelm, The Next Web (March 4)
Is real-time indexing a good thing?
"In a move that might rewrite the entire search market, Google is rumored to be creating a system that will let allow web publishers to submit content to Google for search indexing in real-time."
It's a kind of PubSubHubBub for moving syndicated content quickly online and into the readers.
"This move by Google, if it comes to fruition, would be a super-PubSubHubBub, not just moving your content into Google Reader at light speed, but also into the hands of the tens of millions of people searching Google every few hours. It would be a bigger move towards a real-time web than Twitter will ever be."
Google Index to Go Real Time , Marshall Kirkpatrick, ReadWriteWeb (Mar 3)
Apparently there are significant benefits.
"PuSH is much more computationally efficient for Google but Slatkin says that even more important is the impact of such a move for small publishers. Right now many small sites get visited by Google maybe once a week. With a PuSH system in place, they would be able to get their content to Google automatically right away.
A richer, faster, more efficient internet would be good for everyone, but the benefits in search wouldn't be limited to Google, either. The PubSubHubbub is an open protocol and the feeds would be as visible to Yahoo and Bing as they would be to Google."
Search is the Web's fun and wicked problem, Mac Slocum, O'Reilly Radar (Feb 19)
Mac Slocum interviewed Peter Morville, author of the new book "Search Patterns" which looks at the next wave of search.
"He shows how "weird ideas" will shape search's future, and he also reveals the one recent innovation that unlocked a watershed moment for search (it's not what you'd expect)."
+ "web search works well for basic lookup" - a la Google - but not for much else
+ "Search is a complex, adaptive system and an iterative, interactive experience."
+ watch for emerging technologies as the base for changes in search
+ "search works best as a conversation"
+ "Social search" - not a threat to Google - but social search is changing the web search experience
+ users have to take some responsibility - information literacy is critical - "I'm convinced that information literacy is among the most important subjects we can teach our kids. They must learn where to search and how to evaluate what they find."
+ "Plus, search isn't only about findability. We created a searcher's edition of the user experience honeycomb to argue that search must also be useful, usable, desirable, accessible, credible, and valuable."
+ autocomplete - "new life in Web and mobile search."
Book's website - http://searchpatterns.org/ - Chapter 1 free plus pages about behaviour patterns, design patterns, and some illustrations.
Here's a preview:
Location Aware Browsing
Some sites can tell you more and personalize your results if they know where you are. Firefox has "location aware browsing" for this.
It assures us that privacy is protected - you as user do get the last word.
Firefox says this on its location aware browsing page.
"Your privacy is extremely important to us, and Firefox never shares your location without your permission. When you visit a page that requests your information, you’ll be asked before any information is shared with the requesting website and our third-party service provider."
konsrtr is such a site - it shows upcoming concerts for a city - with images and videos. (Unfortunately it's not searchable - and the music events is all bands-in-town kind of thing.) It does recognize Toronto.
Charles Knight commented on konsrtr's simple design - When Less is More – concert search engine konsrtr (Feb 26)
Kngine - Web 3.0 Search Engine
I nearly missed this new search engine that aims to "unlock meaning" in search. Amazingly, this search engine comes to us from Cairo, Egypt. The about page says -
"We are working on next generation of searching technologies to unlock meaning; rather than indexing the document in Inverted Index fashion, Kngine tries to understand the documents and the search queries in order to provide customized meaningful search result.""Our goal is to build Web 3.0 Web Search Engine on the advances of Web Search Engine, Semantic Web, Data Representation technologies -- a new form of Web Search Engine that will unleash a revolution of new possibilities."
The Tour page shows the ways this approach can assist in searches.
* Read Perception Words with Multiple Meanings.
* Smart Information.
* Available Results. NEW !
* Concept List (List of things).
* Answer your questions.
* Comparisons.
* Updated Information (Weather, Stock, Currency Price, and Sport Matches Results).
* Link the data, and view direct data.
There could be a Canadian connection. One of the sample questions is When did the Toronto Dominion Centre open (but the link that Kngine provides for this is a bad search - somebody is not good with detail.)
Kngine gets its reference and question-answering information from Freebase, web results from Google, and maps from Google.
You have to stay high level in your queries to see the concepts. There are none for exploring the Canadian Arctic, but Canadian Arctic as a search identifies one concept and several "nearest" concepts. The full treatment of a topic shows in this query for green living.
Choices for search are Web, Web with full information, and Photos. I haven't seen any difference in the two webs. The concept analysis doesn't apply to Photos - it seems to be the standard photo search.
I'm quite surprised that Kngine can answer questions like - top 10 countries for oil production, or top 10 countries for wheat. with a clickable map no less.
Since the web results are coming from Google, we can use Google's syntax. This can somewhat defeat the purpose of a semantic search engine, but may be helpful if you want pages from a domain: eg gov for US Government (site:gov), or ca for Canada (site:ca).
Kngine is very promising. For now it seems to deal with high level topics very well, and can handle some kinds factual questions. I don't know how far we can push that but it does a very good job on population of Toronto.
A note on the site reveals that "Kngine contains 1.2+ billion of pieces of data about more than 8 million concepts".
This is one to use and watch. Let's hope Kngine succeeds.
Entity Cube from Microsoft Asia is an extremely interesting experiment that works with entities and a faceted view.
From the site: "EntityCube is a research prototype for exploring object-level search technologies, which automatically summarizes the Web for entities (such as people, locations and organizations) with a modest web presence."
" EntityCube is just a prototype service for collating and summarizing the specific entity information appearing in the Internet web content. Its technology is built to help users to perform Web data mining for text contents around the names and get the search results based on the search of available web contents. "
It's very good on names of people who are very present on the web. See this profile on Stephen Abram. or Elizabeth May, leader of the Green Party. Will show related names, and academic references (though this part on most searches seems completely unrelated.)
See Entity Cube, Experiencing Information (Dec 19, 2009) for further comments.
Is Google Getting Too Personal?, SEOMoz (Feb 16)
Dr Pete ran some searches to find out how much Google actually personalizes results.
Key finding: "This data suggests that being logged out has very little impact on rankings, assuming that you're on the same machine with the same IP. Move to a new machine/IP, and the difference is much more substantial."
Web 3.0 and Semantic Search by Abhishek Gattani, AltSearchEngines (Jan 310
Guest writer, Abhishek Gattani, of Kosmix gives his take on the semantic web. Kosmix has been utilizing some of the semantic analysis and technology to present search results, and notes that Google, Yahoo and Bing have recently been adopting aspects.
He wrote - "Semantic web is about annotating facets and attributes associated with web content and linking data. In other words, semantic web is about teaching machines to read web pages, which are designed to be read by humans. So how can semantics improve search?"
One manifestation of semantic understanding, is he new ability of search engines to present the structure of a web page. "For instance, for events we [Kosmix] list the date, time, event snippet, and even ticket prices, which really let you decide if you should be clicking to book a ticket or it is not aligning with your schedule and budget."
Semantic technologies come into play into ranking results too - in the occurence of related words.
Bing, Google, And The Enigmatic T2: The Race For A Complete Semantic Search Engine, Erick Schonfeld, TechCrunch (Jan 22)
Bing in introducing its recipe search is getting closer to semantic search. But there are others interested in doing this well.
"Bing is big on guided search (showing relevant search categories to help narrow results), but this goes one step further towards semantic search (the ability to index and search the Web by different facets). Recipes are just the beginning, and it’s not just Bing. Google and a handful of startups, including Evri, Hakia, and Radar Networks, are hard at work on making semantic search a reality. The race is on to bring this type of semantic filtering for nearly every category of search across the Web."
Cracking the Google Algorithm, and Understanding Search Patents with Ted “tedster” Ulle, Stuntdubl (Jan 28)
Outside of Google itself, Ted Ulle of WebMasterWorld is the expert on Google's algorithms. In this interview he answers questions about "5 most significant algorithm change" and "top 5 changes in the next 5 years"
Of particular interest:
+ "Phrase-based indexing, as described in the 2006 patents, brought a deeper level of semantic intelligence to the search results."
+ "Geo-located results began to create different rankings even for various areas of the same US and UK city somewhere around 2005 or so."
+ "Google’s user "intention engine" has had a major effect, and that rolled out in a big way in 2009. This was coupled with a kind of automated taxonomy of query terms."
This was especially interesting because it suggests that Google is clustering results in the background but not showing the "taxonomy labels". Instead it selects from the clusters. See Ted's post from August 2009
I've been studying one of the "phrase-based indexing" patents that Google filed, in particular Automatic taxonomy generation in search results using phrases [patft1.uspto.gov]. It's giving me new thoughts on how search results can be blended to include representatives from different clusters, or different taxonomies related to the original query phrase.Walking through the patent's logic: a search phrase is associated with several clusters of web pages. Each one of those clusters is a group that includes some other phrase, in addition to the requested keyword phrase. This assumes that the phrases that create a cluster are groups of words that offer what the patent calls "information gain".
This patent would automatically create a taxonomy label for each cluster, based on that second phrase. A given web page could be a member of more than one cluster, and therefore be part of several different taxonomies related to the principal search term.
From Webmaster World - Google Search News - Blended Results, QDF and User Intention at Google
+ "The beginnings of sentiment analysis may begin to show up in the next few years. I expect to see it first on the level of rating for where content falls on a fact-to-opinion spectrum. Full sentiment analysis (rating content on a "favorable-to-critical" opinion spectrum) is already in use for some social media monitoring, but that is probably too big a technical challenge to expect Google to go with it in the general search results."
Bing’s Stefan Weitz: Rethinking The Search Experience, by Gord Hotchkiss, Search Engine Land (Jan 22)
Gord Hotchkiss through a series of articles will explore where web search might be going from a user 's perspective. He begins with talking with Stefan Weitz, Director of Bing Search, about using search to make decisions.
Weitz described the main concept to Bing as a decision engine.
" The Decision Engine was built around three big areas. The first was providing great core results. That’s the standard “block and tackle,” keyword to keyword algo based search. ... The Decision Engine comes when you add in the other two big things.
The first is that organization of results to help people explore topics that they don’t understand. Can we do a better job with related searches? Can we organize results using categorized search. Can we semantically break down the 160 million results in a way that makes more sense. The third thing is how can we provide tools that help you make decisions? We focused in the initial release on the travel, the shopping, the local, the health. We built fairly complex computer science tools to help you when you do decide you want to a search engine and book that trip to Florida. What can we do differently that will help you get that job done faster? In the travel vertical, it’s the Farecast technology, the ranking within airfares… all those types of things. That’s how it practically manifests in the engine and it is designed to respond to those data points I mentioned earlier."
Helping Children Find What They Need on the Internet
By STEFANIE OLSEN, International Herald Tribune (Dec 25)
Search engines have been mainly designed for adults who are equipped to think up keywords. Children haven't reached that stage. Several search engines are trying to serve them better, and in doing so will probably help adults too.
"Children’s choices of search engines differ only slightly from the preferences of adults. Google ranks most popular among children, followed by Yahoo, Google Image search, Microsoft’s Bing and Ask.com, according to the research firm Nielsen. (Among adults, Bing is ahead of Google Image.) "
From this may come new and better visual aids and search prompts.
How Google May Expand Searches Using Synonyms for Words in Queries, SEO by the Sea (Dec 22)
Google sometimes searches on words related to the ones you use. This patent discovery confirms that.
"A patent granted to Google this week explores how the search engine might expand the search terms that searchers use to include synonyms in searches, to make it easier for searchers to locate information on the Web. In the Ft. Wayne example, this could mean that Google would look for pages on the Web that were relevant for both [web hosting Fort Wayne] and [web hosting Ft. Wayne]."
This posting describes the process for finding the synonyms (or related words) and evaluating the quality of the words in context.
"What does this mean for you as a searcher or as a site owner if Google is using this process?For searchers, it might mean that Google may add pages to your search results based upon words it perceives as synonyms to words you used in your query. Search for something while including the words “District of Columbia” in your search, and you may see also see pages that use “Washington, D.C.” or “D.C.” instead of “District of Columbia.” "
How a Search Engine May Choose Search Snippets By Bill Slawski, SEO by the Sea (Dec 4)
Today search engines show a snippet from the page - this might be summary of page, section where your search terms occur, description from DMOZ (true at Google), or the publisher's metadescription (rare).
Yahoo has a patent that should make those snippets more relevant.
"The Yahoo patent filing tells us that it could look at the following for each line on a page, to come up with a score for each line to use as a snippet:* A query-independent relevance for each line of text – a degree to which the line of text of the document summarizes the document.
* A query-dependent relevance of each of the lines of text – a relevance of the line of text to the query.
* The intent behind a query.
Search User Interfaces by Marti Hearst, Cambridge University Press 2009.
Marti Hearst has been involved in the Flamingo Project and knows a lot about search behaviour and designing search interfaces. This book is available online for free.
"This book has two intended primary audiences. The first is academic researchers, graduate students, and those teaching graduate level courses in information retrieval, user interfaces, and other information management-related topics. The second intended audience is practitioners who design and build search interfaces."
There are also some webcast videos from the course on Search Engines: Technology, Society, and Business (2005)
Book was mentioned in A Roundup Of 2009’s Best SEO Books by Chris Sherman, Search Engine Land.
Check that article for short reviews of four titles on search engine optimization.
Not all good products succeed.
+ Groxis - maker of the visual search tool, Groker - has closed. Happened early in 2009. Groker was a front end for Yahoo (in a demo), and was used by Internet Public Library.
+ Siderean Software - notable for relational navigation that was once used by LII.org is floundering. Web site is still up but news is from 2008.
Source: Information Today (Oct 2009)
Microformats: What, How, and Why by Steven Bradley, vanseodesign (Nov 3)
Steven Bradley to the rescue - in this post he describes microformats clearly and succinctly - it's a markup (code) that makes it easier to "share and reuse information across different applications and websites."
"The goal of this post is to introduce you to the what, how, and why of microformats and point you in the right direction so you can begin using them where appropriate."
Microformats are used for various purposes, but one that will matter to searchers is in the creation of rich snippets.
"Google recently introduced the idea of rich snippets into search results. Rich snippets make use of microformats to add additional details about your site in the snippet below your link in search results. Reviews about your products or contact information for ordering might be included directly on the search results page."
Cognition Technologies Parses Its Way to a Better Understanding of Language, at Altsearchengines (Nov 1)
Cognition is doing a lot of work on semantic mapping to make natural language processing effective. This posting at Altsearchengines talks about a new syntactic parser.
"Cognition Technologies has recently added an advanced syntactic parser module to its language-understanding technology. “What does that mean?”, you ask? It means that Cognition can now “parse”, or break down, the component parts of sentences to deliver an even more accurate and complete understanding of the content. Since words often have more than one meaning, the ability to parse sentences enhances the technology’s ability to understand the context and sentence structure of the material being analyzed."
Try it at http://www.cognition.com/. This is one of the directions search technology is taking.
Two posts from the Kosmix Blog about Semantic Web.
Basically, Digvijay Lamba observes - "that Wikipedia can provide a global and ever improving vocabulary bloggers and other content creators to provide richer context around what they write." - Wikipedia can help with identifying the entitities and providing context.
Why Wikipedia Can Make a Giant Leap Ahead for the Semantic Web
Wikipedia and the Semantic Web – Part 2
The future:
"In the end, we have to take baby steps in our goal for rich semantic annotation of Web content. Automated tools are already attempting to do this for content that has already been created. Will the automated methods improve fast enough that there will never be a need for content creators to annotate? Or will having a vocabulary and an easy method of annotation give enough advantage to the content creators that we will see widespread adoption? My guess is that the answer lies somewhere in between."
Google Removes PageRank Data From Webmaster Tools, Search Engine Roundtable (Oct 15)
Reports that Google has dropped PageRank information from the Google Webmaster Tools - presumably to encourage webmasters to attend to other ways to improve rankings. Why, then, does it remain on the Google Toolbar for searchers to use?
But - I don't think searchers use it much if at all. And in my analysis of search results, albeit loosely done, very often high page rank does not rule in the top 3 results. In fact I was beginning to suspect that pagerank was much less a factor than people have thought. It may be that Google really has changed its algorithms so much that PR - or at least the "many links" understanding of it has been downgraded. This posting in Search Engine Roundtable might be more evidence of that.
However, Google's technology page puts PageRank first. It is clear that this is no longer a simple links-in algorithm (if it ever was):-
"PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.
PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page's importance."
An Evolution of Search, by John D. Holt and David J. Miller, ASSIS&T(Oct/Nov 2009)
John D. Holt and David J. Miller, senior architects in the Lexis Nexis Risk and Information Analytics Group, review the progression in information search and retrieval technologies since the early times of Boolean. The tugging match is still between precision (precise results) and recall (comprehensive). Search has evolved to Entity search - retrieval by the attributes of the information item.
"This paper provides a brief review of some of the earlier stages of search evolution in the context of the evolutionary pressures of the concurrent improvement of both precision and recall. "
Most especially note the conclusion:
Entity search is another step in the evolution of information retrieval systems. Entity search builds upon Boolean and relevance ranking techniques. Entity search provides improvements in both precision and recall over traditional Boolean and relevance ranked search techniques.
Boolean search techniques require the researcher to be knowledgeable of the words and expressions used in the document or record collection. Precise results can be obtained, but at the cost of a significant drop in recall. Recall can be achieved, but only at a significant drop in precision.
Relevance ranking via statistical techniques can be used to improve apparent precision in some cases. However, the statistical techniques do not apply well to searching structured and semi-structured data with attribute values.
The linking or clustering of the documents or records into sets of references that describe an entity can be used for much more than just reporting on an entity. The information from the set can be used in some cases to improve recall by broadening the search. Alternatively, and more powerfully, the entity can become the object of the search.
A search expression that specifies a set of attribute values can be used when the entity is the object of the search. Both precision and recall are improved. Precision is improved because the entities returned are all consistent with the attribute values supplied in the search. Recall is improved because the combination of entity values specified in the search expression need not appear in any particular underlying reference document or record.
Business Week ran a lengthly series on Google Search in which Silicon Valley bureau chief Rob Hof interviewed CEO Eric Schmidt and the major heads of search technology.
The main article was Can Google Stay on Top of the Web? (Oct 1)
Below are the four interviews with the company's search gurus plus one with CEO Eric Schmidt.
Matt Cutts: How Google Deals With Web Spam, Rob Hof, Business Week (October 04)
This interview with Google's Matt Cutts tells us more about how Google search departments work together: ranking, spam control, and ads. It's all part of their mission to deliver quality results.
Evaluting search results:
+ "we’ve built up a lot of evaluation metrics"
Understanding search intent:
+ "We try to do a lot so we can understand queries better. Some people will mistype queries, so we try to do a real good spell-check system. A lot of people will type in synonyms, like "automobile" instead of "cars" when the name of the business is Cars R Us. So we try to take the query as a suggestion."
+ "We used to require an absolute perfect match, but over time we’ve gotten better at spelling, morphology, synonyms, all these sorts of things like stemming, where somebody types in “runners” and maybe they meant “runner,” or “running.”"
Delivering freshness:
+ "But in general, Google is fresher. Google is not only fresher but more comprehensive. Those are three key things: freshness; comprehensiveness (you want to crawl as much of the Web as possible); and relevance (core ranking and Web spam). And you want the user experience to be really clean."
Detecting hackers
+ "We write detectors. We’ve written classifiers—an algorithm, a heuristic that essentially takes a bunch of signals and tries to say yes, this site has been hacked or no, it hasn’t, and at what level of the directory and things like that."
Other articles in this series on Google search:
Google's Udi Manber: Search Is About People, Not Just Data
Udi Manher is VP of technology for search.
Excerpts:
Q: Can you give me a sense of the types of methods you use to improve search?A: Humans are involved, formulas are involved, experiments are involved. We often do A/B tests, give one set of people an algorithm, give another set of people another set of algorithms and see how they behave. We measure lots of things, not just clicks.
Q: So you have to determine what does change and focus on indexing that?
A: We have to determine from the query whether it can benefit from something in real-time. Like “history of the Renaissance.” It’s possible that somebody on Twitter just mentioned that. But a) it’s not that likely and b) it’s probably not what you want. You want the best article on the Renaissance. So time is not as important on that kind of query.
But search for “earthquake” and time is much more important. Or a particular celebrity that had news in the last five minutes. So we have to change the algorithm based on the query. We do that now.
Google Search Guru Singhal: We Will Try Outlandish Ideas
Amit Singhal looks after ranking algorithms. His team ran 6,000 experiments last year which led to roughly 500 changes in how search works.
Google's Scott Huffman: Many More Search Features Coming
Scott Huffman's team evaluates the effects of every proposed change to Google. Last year there were 6,000 experiments.
"Huffman explained in detail how Google runs all those experiments—which include the use of hundreds of human evaluators in addition to Google’s massive computer infrastructure."
Google uses people and statistical analysis of clicks to evaluate the results. It especially works on relevance for a country or locale.
Excerpts:
Q: What does the evaluation unit do?A: We try to measure every possible which way we can think of how good is Google, how good are our search results, how well are they serving our users. And we break that down all kinds of ways—by 100 locales [country plus language pairs], by different genres (product queries, health queries, local queries, long queries, queries that don’t happen very often, queries that are very popular) times how are we doing on those in France and Switzerland and other places
Q: Can you give me a sense for how you approach evaluation?
A: We use two main kinds of evaluation data. One kind is we have human evaluators all over the world for whom we have a workflow system. They come to it and are fed things to evaluate. A typical thing is: Here is a query, you’re speaking French in Switzerland, here’s a URL, tell us on some kind of scale or some set of flags and description how good of a URL is that for that query.
The other data source we use is live experimentation with our users. A typical example where we use that more is for user interface changes to search. It’s hard to guess what people’s reaction will be to any particular UI change.
Q: How are personalized search results evaluated—any differently?
A ... Another thing that we spend a lot of time on is at the country level. Many countries speak English, but when I type in, say “bank,” I want pretty different answers if I’m in the U.S. vs. the U.K. vs. India vs. Australia. And today Google gives you very different answers for those. It also applies inside the country—in Dallas and Atlanta, you’ll get different results for “First Baptist Church.” Those kind tend to be a little trickier for us.
How Google Plans to Stay Ahead in Search
"CEO Eric Schmidt discusses how Google is handling challenges from Microsoft and upstarts Twitter and Facebook—and why search remains its priority "
Q You said recently that you worry about where growth for a large company such as Google comes next. Where will that growth come from, and what does that say about what Google will be in five to 10 years?A We are first and foremost a search company. Of course, search changes. Location will become more important, for example. As long as we can be first to invent the new solutions to search, we'll be fine. We're still investing a lot in search and search quality. In our case, growth will come from businesses we're already in.
Organizing the Web around Concepts by Mitul Tiwari, Kosmix Blog (Sept 30)
Identifies the next wave of web search as being one which reorganizes the "Internet by topic or concept". Examples of the concept orientation are "Freebase, Google Squared, DBLife, and Kosmix topic pages".
Kosmix sees web pages being comprised of three types: search pages, topic/concept pages, and articles. Searchers benefit from seeing concepts related to those in their query.
People / editors organize pages by concept - this is the essence of a directory. But today topical approach is done algorithmically through:
+ concept extraction
+ relationship mining
+ linking data with concepts
Concluded - "In short, organizing the web around concepts is a promising area and a stepping stone to bring meaning behind the web data."
17 Ways Search Engines Judge the Value of a Link, Rand Fish, SEOmoz Blog (Sept 10)
Illuminating article on the most important factors to a search engine in ranking results. It opens with the importance of links between domains.
"As you've likely noticed, search engines have become more and more dependent on metrics about an entire domain, rather than just an individual page. It's why you'll see new pages or those with very few links ranking highly, simply because they're on an important, trusted, well-linked-to domain. In the ranking factors survey, we called this "domain authority" and it accounted for the single largest chunk of the Google algorithm (in the aggregate of the voters' opinions). Domain authority is likely calculated off the domain link graph, which is unique from the web's page-based link graph (upon which Google's original PageRank algorithm is based). In the list below, some metrics influence only one of these, while others can affect both."
Choose Your Own Adventure: Alternatives to Bing and Google by Jason W Bunyan, DMB (July 30)
This article was in response to the question - is Bing better than Google, but is really about whether or how much search will evolve.
Nice quote from Sue Feldman:
"Search has a long way to go, according to Dalhousie University Canada Research Chair in Management Informatics and associate professor Dr. Elaine Toms. “Sue Feldman used a wonderful analogy: we have ovens, microwaves, toasters and barbeques, which all have heating technology, and we use each for its specific purpose. So why not multiple search tools? In her example, I think the common outdoor fire is the current search engine. And we have not yet developed tools for task-specific environments that need and use rich information. [Compare] what a scholar needs and what a health consumer needs. We are not even close.”"
HealthBase--medical search engines maturing by Elizabeth Armstrong Moore, CNet (Sept 2)
HealthBase uses a "content intelligence platform" as semantic technology to understand health content.
"Culling through 10 million health articles and sorting search results on two types of data, "conditions" and "treatments," into manageable subsets, HealthBase includes "causes of," "treatments for," "complications of," and "pros and cons of treatment." Content sources are also provided and ranked. And Jens Tellefsen vice president of marketing and product strategy, said it might include user collaboration akin to Digg's voting articles up or down in the near future."
For more about Content Intelligence see Is Content Intelligence the New Business Intelligence?
"Content intelligence is about creating new content and information services derived from a company’s own premium content, and then optionally combining and enriching it with insights from the Internet, resulting in new sets of content that can power new and differentiated information services. But how is this achieved? By using semantic technologies to mine the breadth and depth of relevant, targeted information from the Web, or proprietary or enterprise sources."
Postscript
Comments from Gary Price - Netbase Debuts HealthBase Demo (Sept 2)
It’s semantic – easier solution to annotate and search images , ICT Results (Aug 27)
Indicates the direction of image search - a mix of text mining (surrounding text and name), object identification and face identification - plus semantic annotation or additional assigned terms.
Search: The Last Frontier by Barbara Brynko, Information Today (June 2009) - via AllBusiness.com
Report from the 2 day Infonortics Conference in Boston in April 2009. This is always cutting edge. Semantic search was the main topic.
Why Semantic Search?
Since searchers have begun wading through the quagmire of information, their needs have changed and so have their tolerance levels. There are many times when ? age -ranking results just don't produce what users are searching for on the web. Dmitri Soubbotin from Semantic Engines elaborated on three reasons users need semantic search. First, he says users deal with insufficient relevance of traditional search results; users just spend too much time searching for information but not always finding what they want. Second, users are pressed for time and have short attention spans; users want relevant information retrieved quickly. Third, most users only look at the first page of the results and don't even peek at the useful sources beyond. Far too many users say they will "settle for what I have here," he says.
But what's really under the hood? Instead of using ranking algorithms as Google does to try to predict relevancy for the user, semantic search uses the science of the meanings in language to produce point- on results. Natural language processing, linguistics, and text mining can be matched against an ontology that works especially well for verticals. Homogeneous content yields better results; there's just "less noise" and less disambiguation for users to deal with.
After all, the goal of the web is to extract more relevant results and to retrieve accurate answers for users while discovering additional content and digging deeper for pertinent data.
A search engine such as Sensebot provides an overview of a topic's hard facts interspersed in text results. Users receive a multidocument summary and links that go beyond simple information search and retrieval.
For Diane Burley of Nstein, "Search is so yesterday. ... It's now all about the finding."
But to make the process of finding information easier, we need to take a look at how people seek information, how they orient themselves, and what their sources of frustration may be, she says.
"Until users are inconvenienced, they don't see the value in the search process," she says. If concepts and entities are extracted, links give users more reason to stay on a site and make it easier for them to mine and to aggregate results, even across different languages and country borders.
Bringing Clarity to the Mix
For semantic search to work effectively, users need to maximize relevance and minimize disambiguation. Kathleen Dahlgren from Cognition Technologies explored approaches to tagging, ontology, syntax parsing, and a semantic map. The most common words are the most ambiguous, she says, using the word "lemon" as an example. A word string for "lemon" produces a number of possible definitions: It could be a citrus fruit, a poorly manufactured item, a yellow color property, or a behavioral property.
But word definitions are just part of the puzzle for semantic technology. Add concepts to the mix (lemons are typically yellow) and personal ignorance (pythons are dangerous, but what are pythons anyway?) and social ignorance (the sun revolves around the Earth), and users have the beginnings of a deeper search. In semantic analysis, the word is not only defined by its relevancy, it also takes into account the other words that are present in the sentence and as part of the context of the complete text. Less disambiguation means more-relevant results and a better understanding for the user.
Other topics:
+ Image-driven search and visualization
+ Mobile - voice search
+ Enterprise search - and e-discovery
+ Meaning extraction
+ Aids for engaging in a dialogue with the user
Semantic Web is getting much more attention. Richard MacManus interviewed Tim Berners-Lee at MIT in July
Part 1: Linked Data - this is the base - "The Semantic Web and Linked Data connect because when we've got this web of linked data, there are already lots of technologies which exist to do fancy things with it. But it's time now to concentrate on getting the web of linked data out there."
Part 2: Search Engines, User Interfaces for Data, Wolfram Alpha, And More... - Tim Berners-Lee describes how search will be --
"So I think people will search using a search text engine, and find a webpage. On the front of the webpage they'll find a link to some data, then they'll browse with a data browser, then they'll find a pattern which is really interesting, then they'll make their data system go and find all the things which are like that pattern (which is actually doing a query, but they'll not realize it), then they'll be in data mode with tables and doing statistical analysis, and in that statistical analysis they'll find an interesting object which has a home page, and they'll click on that, and go to a homepage and be back on the Web again. "
Advanced Custom Search Configuration, Google Custom Search Blog (June 29)
Presentation at Google I/O by Nick Weininger on Advanced Custom Search Configuration. [46 minutes]
Key tools for building and presenting - including new features for rich snippets and microformats.
Shows About.com's uses of CSEs on topics. Also - the Google Blogger search gadget for creating a search on the blog's domain of interest.
In last 20 minutes Adobe showed a use case of Custom Search for community help.
Search leaders debate semantics, by Tom Krazit, Webware (June 17)
"Panelists from the four major search engines--Google, Yahoo, Bing, and Ask.com--joined Web search start-ups TrueKnowledge and Hakia at the W3C's Semantic Technology Conference to discuss the rise of semantic technology as the engine behind the still nascent Internet search industry. Semantic search, or the idea of divining a user's true intent from how they enter their queries and how Web data is structured, is an unfamiliar concept to the majority of Web surfers who tend to think Internet search is actually pretty good as it is."
Semantic technology for search is about:
+ structuring data - Andrew Tompkins, chief scientist at Yahoo Search - "Today on any major search engine, you'll see structured information about a restaurant," he said, basic things like phone numbers, address, or maybe a link to a map of its location. All of those things require agreement on standards to make it happen."
+ analyzing the meaning of plain text
+ answering questions - "The goal of all this work is to make search more intuitive, more like asking a friend or colleague a question, said Riza Berkan, CEO of semantic start-up Hakia. "We believe search is going to move to more conversational techniques," he said."
Yahoo! Announces Common Tag: Like The Meta Keywords Tag, But Even Better, Vanessa Fox, SearchEngineLand (JUn 15)
Common Tag - an effort to create a semblance of structured data (or semantic tagging) but hard to know now what will come of it. Somewhat replicates the simpler meta keywords and social bookmarking tagging.
"Not only does Common Tag seem to replicate the purpose of the meta keywords tag, it seems to also replicate Delicious-style tagging and external anchor text."
Google and the Evolution of Search I: Human Evaluators, by John Paczkowski, Digital Daily (june3)
Are people involved in adjusting the ranking og Google's search results?
"Google, for example, employs a vast team of human search “Quality Raters” (You’ll find a copy of an old training manual here). Spread out around the world, these evaluators, mostly college students, review search returns against established criteria–testing different algorithms and see which works “best” in predicting the quality of a site (though not directly judging the quality of any individual site itself).
They’re aided by Google’s own registered users, who can now, when logged into their Google accounts, promote and delete sites from their own search returns according to their preferences."
Would be helpful to have an estimate of the number of registered users who bother to adjust the rankings.
This is a three-part series of interviews with Engineering director Scott Huffman of the search evaluation team. Senior Google software engineer Matt Cutts, and Google Fellow Amit Singhal.
Amit Singhal closed wtih "AS: I believe that the role of the human evaluator in search will be there until we can understand language by computers, which is a far distance from where we are today. You know, we have made great advances but by no means is our language understanding technology close to saying this person really meant to get this document or not."
Presentations from the April 2009 Infonortics conference held in Boston are available for viewing. Most are pdfs. There are also interviews with the speakers by Stephen Arnold.
Topics
+ Semantic approach to search
+ Semantic web
+ Visualization of results
+ Classification
+ Voice search
+ E-discovery
+ Enterprise search
Ask.com Searches Smarter, Ask.com Blog (May 19)
It's a case of blowing own horn, but Ask answers some questions quite well by being able to search structured data. This blog entry points to an article by Jennifer Zaino - Ask.com Answers the Data Extraction Question at SemanticWeb.com
"Ask.com is putting a focus on the structured data search problem, helping searches extract web data that is often not in text but in database tables and XML feeds where keyword searches don’t cut it. For example, a table might have data points around the words Toyota, Prius, and hybrid, and price, but if you ask most search engines to what is the price of a 2009 Toyota hybrid Prius that table won’t come up because those keywords aren’t together in the table."
Interesting - but Ask has concentrated on consumer interests, and the consumer is pretty loyal to Google.
Changes to search tools are coming on fast and furious this spring.
New search engines aspire to supplement Google by John D Sutter, CNN (May 12) notes some themes or trends.
+ "Some sites, like Twine and hakia, will try to personalize searches, separating out results you would find interesting, based on your Web use.
+ Others, like Searchme, offer iTunes-like interfaces that let users shuffle through photos and images instead of the standard list of hyperlinks.
+ Kosmix bundles information by type -- from Twitter, from Facebook, from blogs, from the government -- to make it easier to consume."
+ Wolfram Alpha crunches data
+ community ranking (Wikia) is fading
+ real-time search is at Twitter and Twitter-related engines - and more will do this
+ social search has a future (even if community ranking doesn't) - Twine
+ Google's "show options" - new ways of viewing results
Infonortics Search Engine Meeting, Boston, April 27-28, 2009 - this is one of the preeminent conferences in the year on search and information technology. Presentations are available for many of the sessions. Topics include:
+ several on semantic web
+ visualization of search results
+ classifying images
+ mobile search
+ e-discovery
+ natural language based text mining
+ text analytics
+ information seeking process
Very rich - spend an hour or two.
9 Semantic Search Engines That Will Change the World of Search, by Arun Radhakrishnan, Search Engine Journal (April 13)
We all hope that semantic technology will change search so that results are closer to what we mean (and not necessarily what we said). This article describes nine contenders - briefly but well.
+ Hakia - that uses "concept relations", a list of possible queries connected to answers, and ranking based on sentence analysis.
+ Kosmix - creates a "dashboard of content" - though I prefer to say dossier on a topic.
+ Exalead image search (not the web search) - narrow selection by facet (I'm not sure how "semantic" that is).
+ Sensebot - creates a summary of the top results
+ Cognition Search - maps the English language - has some trial content areas (legal, health, wiki, bible)
+ Lexxe - prefers short questions - and then it applies its natural language analysis. The clustering helps.
+ Swoogle - searches "semantic web" documents created in RDF. Useful mainly to the specialist.
+ Factbites - returns results with understandable sentences (one of my favourite engines).
+ Powerset - studies meaning of sentences rather that word relationships. It built its pilot using Wikipedia. Now owned by Microsoft, we expect that some of the technology will be used to improve Live Search.
Interesting point: "The appeal of semantic search engines is that the content of a page alone decides its utility. This means lesser spam and of course more relevant ads. It would be harder to game a semantic web engine."
Federated Search Blog (by Sol Lederman) has a series of interviews with "federated search luminaries": Erik Selberg, Michael Berman, Todd Miller. Kate Noerr of MuseGlobal, a fourth, is on her own page.
From the Michael Berman interview
"Search engines work best in the discovery phase, when searching is a fast, give-and-take, contact sport. Real-time performance is important and interaction and testing are the user mode. I frankly feel deep Web search is not terribly useful or helpful in this phase. Identifying candidate searchable databases can be very important in this phase, but that can be accomplished from a search engine for databases such as CompletePlanet or the DQM rather than going to the site directly (reserving deep Web search for the purposeful harvest mode.)
Once the researcher has got a good bead on their capture requirements, harvesting and the deep Web come to the fore. But, this can be scheduled, and need not meet a real-time criterion. "
Video of My Semantic Web Talk by Nova Spivack (Feb 2008)
Where we were and where we are going, from Web 1.0 to Web 4.0 - what we do and how we search. Nova Spivack at Radar Networks speaks to a group of students about semantic web. Points to the weaknesses of Google and describes alternatives: linguistic approach (expensive), semantic web (using metadata to describe items), artificial intelligence (ontological and reasoning engine). Some "make the software smarter", and others have higher component of "making the data smarter".
Nova Spivack - Semantic Web Talk from Nicolas Cynober on Vimeo.
Google Next Victim Of Creative Destruction? (GOOG) by John Northwick, Business Week (Feb 8)
John Northwick, who watched the AOL fall from innovator grace, offered this observation: " I now see search as fragmenting and Twitter search doing to Google what broadband did to AOL."
(Mind, as commenters to the article did point out, John is CEO of betaworks, a Twitter shareholder.)
Search has moved into two main streams: video (YouTube and more) and real-time (Twitter watching).
Video:
* "YouTube generates domestically close to 3BN searches per month — it’s a bigger search destination than Yahoo. "
* "44% of YouTube views happen in the embedded YouTube player (ie off YouTube.com) and late last year they added search into the embedded experience. YouTube is clearly a very different search experience to Google.com. "
* "Video search now represents 26% of Google’s total search volume."
Notificator (the electronic message board)
This really means getting the buzz of the moment whether it's about friends or events and developments.
"Yet at http://search.twitter.com the conversations are right there in front of you. The same holds for any topical issues — lipstick on pig? — for real time questions, real time branding analysis, tracking a new product launch — on pretty much any subject if you want to know whats happening now, search.twitter.com will come up with a superior result set."
It's the social context that is important - people you know (or know of), people you trust.
The post refers to an article by Gerry Campbell on the role of social inference in search. Search is broken – really broken. (Feb 6)
"Our daily lives are rich with social inference, and they happen in real time. Search from Google, Yahoo… you name it – they are all based on published (e.g. considered, thought-through) documents that take minutes-to-weeks to update in the search index."
Campbell wants to see "Realtime search, using social inference for discovery, ranking and prioritization."
Do Search Engines Look at Keywords in URLs? By William Slawski, SEO by the SEA (Mar 26)
Judging from this Yahoo patent application, search engines do consider words in the url.
"Keywords may also be extracted from the URLs of pages, by using an algorithm that can break the URL into components, understanding the structure of those URLs, and removing candidate keywords from the different parts found within the URL."
Automated Categorization of Search Results, a New Era? Hakia (Mar 23)
Hakia calls the categorization that we see in the Galleries - aspect categorization.
Aspect categorization is different than what some search engines are already doing. For example, dividing the SERP into Web Results, Videos, News, Images, etc., is not aspect categorization. However, when the categories are related to the query, such as Obama’s Speeches and Quotes, Obama’s Fans, etc., (for the query Obama) then it is aspect categorization.
How Searchers’ Queries Might Influence Customized Google Search Results by William Slawski, SEObytheSea (Mar 19)
Slawski presents a possible explanation for how Google personalizes results by considering earlier queries by you and similar ones by others.
You might see this message from Google:
Recent Searches You or someone else recently searched for infinity auto using this browser.
Possible ways that searches might be found to be related:
+ "if they are typed in by a searcher consecutively"
+ "if they are performed by a searcher with a certain period of time, such as within 30 minutes of one another"
Google has more on this -- Features: Search customization details
British search engine 'could rival Google' by Bobbie Johnson, Guardian (Mar 9)
Watch for a new natural language search engine called Wolfram Alpha to be released in May 2009. Stephen Wolfram, a British scientist, aims at succeeding with natural language, not through ontologies (semantic web), but through computations.
Wolfram described this as "explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable" in his blog entry on Wolfram / Alpha is coming.
We'll see.
Go3R - search technology from Germany that sorts results into facets. Information page must be striking a humourous pose with its opening paragraph -
"The project aims at developing a knowledge-based search engine for alternative methods to animal experiments in order to provide optimal search options for alternatives to animal experimentation. The first step consists of developing an ontology for the knowledge domain of alternative methods to animal experiments. Such an ontology represents a system of knowledge which permits logical deductions as a result of the numerous relationships between terms describing alternative methods it contains - in rough analogy to the possible connections between synapses in the brain."
Putting aside the 'animal experiments', you can experiment with pubmed searches at GoPubMed.
Marissa Mayer On Charlie Rose: The Future Of Google, Future Of Search by Michael Arrington, Tech Crunch (Mar 6)
Marissa Mayer, VP at Google of Search Product and User Experience, spoke to Charlie Rose about search and technology. This posting has the transcript and video. Duration 54 minutes
Opening question: Is it fair to say that search is in its infancy? - YES
Understanding Intentions and Microsoft Search Personalization by Bill Slawski, SEO by the Sea (Mar 6)
Kumo - means spider or cloud in Japanese - and is the code name for a new version of Live Search. Let's hope that Microsoft is going for cloud - the spider theme is tired.
Bill Slawski has studied a patent filing that points to Microsoft's intent in understanding the searcher's intent.
"The basic premise is that when two different people are searching for the same query term, chances are that the answers that they are trying to find or the sites that they might want to see are different, and that a search engine might be able to help each of those searchers find what they are looking for based upon past experience, and past searches and search result selections."
The search engine will need to get to know you to do this - cookies, and history of search queries (words used) and results clicks (what did you look at).
The nugget: "If you tend to search using the same or a substantially similar query term or phrase, and tend to select the same page or pages in response to that search, don’t be surprised if at some point it might be highlighted or bolded or placed at the top of the search results in the future."
[Google does this today in its personalized search.]
Bottom Line: "The question though is, whether past searches are a good indication of intent for searches in the future? Sometimes the statement “It’s cold in here,” isn’t an invitation to a hug, but rather a request to turn up the thermostat. "
Also - First screenshot of Microsoft's Kumo by Ina Fried, CNet News (Mar 2)
Has the text of an internal Microsoft memo - "Announcement: Internal Search Test Experience " - and a screenshot of Kumo that shows: topical groupings (one word), related searches, your history; and it's in the universal style with mix of web, images, videos. Looks like there is some organization of results perhaps along the idea of what Kosmix does - to create a package.
Dear Monica, We Changed our Algo - Google's Matt Cutts by Andrew Goodman, Traffick (Mar 5)
Has some information on Google's recent tweaks to ranking algorithms to give "trusted" sites a small boost.
"Matt [Cutts] says that Google doesn't think brand when it thinks about quality and authority ("if we did, you'd see Mitsubishi Eclipse ranking #1 for [eclipse]"), but this is disingenous. Indirectly, when you take that VW example, they are thinking brand when they take a shortcut that calls the VW.com domain "known information" and put a higher threshold of "track record required" on pages of sites that aren't as known and trusted. "
Video with Matt Cutts and comment are in this posting at Seroundtable -
Google Confirms Algorithm "Change" But Down Plays Brand Push
Top 5 Semantic Search Engines, Pandia (Feb 16)
Semantic search is defined in this article as being able to "make sense of search results based on context" - to identify the concepts.
Makes the excellent point that "Semantic search has the power to enhance traditional web search, but it will not replace it. A large portion of queries are navigational and semantic search is not a replacement for these. Research queries, on the other hand, will benefit from semantic search."
Describes five candidates - but it is a mixed bag.
+ hakia - general web search
+ Sensebot - summarizes search results - demo on the web but better as a plugin
+ Powerset - prototype - searches Wikipedia. Best for a defined subject area rather than overall web. Incidentally, this is owned by Microsoft.
+ DeepDyve - digs more deeply into scientific databases. Author does not say why this is considered a semantic tool. It might be because it has a more-like-this option (and one presumes it does more than just match on text); and can cluster results based on concept analysis - but only DeepDyve Pro ($) users will see this.
+ Cognition - has done a "semantic mapping of the English language". Demos are available including one using WIkipedia
Microsoft plans Google-killing search site - Experimental searh offering out this summer By Nancy Gohring, ComputerWorlduk (Feb 25)
Microsoft is still trying to make its mark in search. It has the Live Search database - now for an interface. This article previews a new site called Viveri where developers will test out new ideas.
"The site will serve Live Search results and is being built using Silverlight, Microsoft's technology for designing online user interfaces. "
The first aim is to dig into databases (everyone wants to do deep web now).
"One technology aims to better deliver search results from vertical search engines. When a user types a search item into the field, a typical list of results pops up. But on the right hand side of the screen several boxes appear. Each box contains results from within a specific domain that is relevant to the search term. The domain could be, for instance, Amazon.com, Craigslist, Consumer Reports or WebMD, depending on relevancy. "
So - this is and will be a direction for web search developments. I wonder if we are prepared to handle all the information that will be extracted from this "deep web", and what that will do to relevance ranking algorithms that have been quite finely honed?
Riza Berkan: The Search for Quality on the Web, AltSearchEngines (Feb 5)
Hakia CEO, Riza Berkan, argues that semantic search technology (based on analysis of meaning) is far superior than the established statistical relevance ranking method. The example given of the old is Google, of course, and the point is made that old style is the base for search engine marketing and associated revenues. Berkan does not acknowledge Google's work with semantic technologies or personalization, both of which are throwing SEM off.
Berkan posits that semantic technology will address information quality issues and describes in clear language how it works.
"The underlying idea behind semantic technology is to teach computers how the world operates. For example, when a computer encounters the word “bill,” it would know that “bill” has 15 different meanings in English. When the computer encounters the phrase “killed the bill,” it would deduce that “bill” can only be a proposed law submitted to a legislature, and that “kill” could mean only “stop.”"
The promised benefit is that results will return meaning and be independent of popularity (ie number of links to a site).
"The answer is simple: precision. Once computers can handle natural languages with semantic precision, high-quality information will not need to become popular before it reaches the end user, unlike what is required by Web search today."
How Google crawls the deep web by Greg Linden (Jan 31)
Refers to a paper in which Google describes how it fills in web forms to query databases.
"This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index."
Interestingly, there was also this today --
Google: "We're Not Doing a Good Job with Structured Data" by Sarah Perez, Read Write Web (Feb 2)
"Google's Alon Halevy admitted that the search giant has "not been doing a good job" presenting the structured data found on the web to its users. By "structured data," Halevy was referring to the databases of the "deep web" - those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means."
Yahoo and Google are both working on automating the extraction of information from databases on the Web.
AnswerFarm Technology from Ask.com, Ask.com Blog (Jan 14)
Ask.com has been blogging about its "semantic" technologies. This post gives us some idea of how the Q&A works.
"The technology behind the Ask Q&A channel is called AnswerFarm technology. We built it by crawling and extracting question/answer pairs from across the web – more than 100 million question/answer pairs from several hundred thousand sources – and it is, no doubt, the most comprehensive and diverse repository of question/answer pairs in the world."
From Semantic Search Technology Advances from Ask.com we learn that Ask can get into databases with its DADS technology - Direct Answers from Databases.
"With DADS, we no longer rely on text-matching simple keywords, but rather we parse users’ queries and then we form database queries which return answers from the structured data in real time. Front and center. Our aspiration is to instantly deliver the correct answer no matter how you phrased your query."
But they are "trialing" all this good stuff on sports, and specifically NASCAR.
Obama Is “Failure” At Google & “Miserable Failure” At Yahoo by Danny Sullivan, Search Engine Land (Jan 22)
That "miserable failure" bomb never seems to go away. This detailed explanation by Danny Sullivan shows how complicated it can get with redirecting urls. Sullivan found that Obama was coming up in Google for the word 'failure' - that seems to have stopped since, and for 'miserable failure' at Yahoo, which is still true (Jan 24). Frankly, White House IT staff should just hire Sullivan to fix it.
Google To Push Semantic Search In 2009? by Matt McGee, Search Engine Land (Jan 23)
Matt McGee spotted a few words spoken by Google CEO Eric Schmidt that suggests that Google will do better at understanding "the meaning of your phrase rather than just the words that are in that phrase".
Natural (foreign) language search, Enterprise Search Center (Dec 17, 2008)
"InQuira has introduced Version 8.1 with Multilingual Dictionary (MLD) of its namesake Web self-help software, which is said to improve searching for content for multinational audiences and to facilitate the ability of users to review, translate and post content in their native language."
New eLibrary Interface Shows the Path for ProQuest by Barbara Quint, Newbreaks (Jan 15)
Detailed description of eLibrary - market, content, and new features.
"The eLibrary service (www.proquestk12.com/productinfo/elibrary.shtml) leads ProQuest’s outreach to the K–12 and community college market. " ... "A new interface platform launched by eLibrary throughout this year will introduce special features such as Smart Content and Content Creators that support the multimedia, multisource eLibrary content and user experience."
Of interest - there will be editors working to provide "best of" answers.
"The Smart Content feature, which Beach says was called the "aggretorial" internally, has teams of editors preparing overviews of the most-queried and most-studied topics, a "smart page" that provides top document and multimedia picks, plus suggestions for other research lines. Smart Content combines eLibrary’s 65 million documents with editorially prepared content. This means what users need first, they see first. The editors bring best-of material "above the fold" to provide comprehensive, foundational understanding of the topic as well as pathways for further exploration. For most-queried and most-studied topics searched, results sets will display a "smart page" that provides biographical, historical, and other contextual information, in addition to eLibrary editors’ top document and multimedia picks."
And with Content Creators, users will be able "to create customized web applications within the eLibrary system. "
Best of both worlds - editors organizing material, and users customizing personal views or collaborating.
Did Someone Just Expose Semantic Data?, Dr. Riza C Berkan, CEO, Hakia Blog (Jan 12)
CEO Dr. Riza C Berkan of Hakia takes issues with Marshall Kirkpatrick posting, Did Google Just Expose Semantic Data in Search Results?.
Google seems to be able to answer questions like capital city of oregon, or marlene dietrich's husband, possbly based on some analysis of language. Is Google figuring out meaning?
It could be (and likely is) a simple extraction.
These examples don't show deep understanding of a subject or any attempt to present possible meanings to the searchers.
Google Tech Talk: Reconsidering Relevance by Daniel Tunkelang, The Noisy Channel (Jan 8, 2009)
Daniel Tunkelang, Chief Scientist at Endeca, has posted slides on a presentation on Reconsidering Relevance. "We’ve become complacent about relevance", he says. Perhaps search has become a kind of "fast food". We are too easily satisfied by the results we get from web search engines and don't appreciate that there is deeper and better content. Exploring information through tags and facets is one important method by which searchers can search (and learn) more effectively.
The future of search by Digvijay Lamba, Kosmix blog (Dec 11)
Opens with what search was like in 1900, compares to today, and looks forward to 2100.
Prediction is that "... we are clearly moving in a direction where a machine will automatically create the perfect article that precisely and completely covers the searched topic."
Of course, this is exactly what Kosmix tries to do.
Yahoo technology will offer abstracts of search results John Ribeiro (IDG News Service) via PCWorld (Dec 5)
This will be a dramatic improvement for Yahoo Search - "Yahoo India is developing information extraction technology that will offer abstracts of URLs when users do a search."
"The Bangalore lab is working in the area of automated information extraction, which involves going into the URLs, going through billions of pages, and extracting the relevant information, he said."
How the Semantic Web Will Change Information Management: Three Predictions by Silver Oliver, FUMSI (Oct 2008)
Semantic Web means adding structure - making the web of unstructured data more accessible through explicit connections along the lines of Dublin Core metadata. Silver Oliver makes three predictions - the third points to a changing role for the information professional - probably in "modeling the domain of information we are dealing with".
Google and the Real Search for Meaning on the Web By Saul Hansell, New York Times (July 17)
Some insight on how Google works at finding meaning is revealed (somewhat) by posts from Amit Singhal in Googl's search quality group. Essentially - Google can derive concepts from context.
Key post in the Google Blog - Technologies behind Google ranking
Academics sink teeth into Yahoo search service by Stephen Shankland, Webware (Oct 10)
"... Yahoo, is trying to give a little more power back to the professors and grad students through a program called BOSS (Build Your Own Search Service). The service lets academics and start-ups build their own search sites around Yahoo's search engine for free, manipulating results however they want. "
How a Search Engine Might Add Related Information about People, Places, and Things into Search Results, Seo by the Sea (Sept 23)
Yahoo has filed a patent that shows interest (and maybe intention) in identifies "data entities" in search results and expanding on them.
"This kind of expansion of search results, to include names of people, places, events, and things found in a search for an original search query is described in a patent filing from Yahoo. While it doesn’t presently appear in use, it’s a possible approach from the search engine."
Microsoft shifts R&D for search engine technology to Norway news. domain-b (Sept 30)
Microsoft's announcement that search engine technology will be centered in Norway could be significant.
"Software giant Microsoft Corporation has decided to move its main centre for search engine technology to Norway. This was announced by Microsoft's Steve Ballmer in Oslo after a meeting with Prime Minister Jens Stoltenberg."
Norway is the home of Fast Search & Transfer, a specialty information technology company, which Microsoft recently acquired.
Two new semantic engines: Cognition and Eeggi , by Rafe Needleman, Webware (Setp 18)
Two new semantic search engine - Cognition (not entirely new - does have three demo applications and promotes CognitionSearch for the enterprise), and Eeggi , very early stages - has a small demo.
"Rather, they are databases and algorithms that hold the structure of language (in both cases, the English language). At the most basic level semantic engines tell you what's synonymous with what. At the advanced end of the spectrum they know how grammatically similar phrases like "take a seat," "take a stand," and "take a lollipop," mean completely different things. "
New SemantiFind Enhances Search Engine Experience, Newsbreaks (Sept 15)
"Semanti Corp., a web services provider offering "find" technology to enhance the results of search engines, announced its flagship product, SemantiFind (www.semantifind.com), at the recent DEMOfall08 conference in San Diego. SemantiFind is a web service that enhances the search engine experience by letting users indicate the precise meaning of their search queries."
Requires registration.
Search 3.0 - Web Search in Evolution by Yihong Ding, Alt Search Engines (Jul )
Distinguishes between types of links in search results according to whether they are 1.0 type resources (hard coded links to pages), 2.0 (tagged, threads, dynamic), 3.0 (???).
"Applying this criterion to evaluate the current Web search engines, we may surprisingly find that almost all of them belong to just Search 1.0, no matter how they have labeled themselves. Some search engines may have great records on its performance (such as Google), some search engines claim to be advanced in integrating new technologies (such as Hakia), or some search engines declare to bring revolutionary new experiences to the world (such as Powerset). "
If Google had Semantic Technology…by Dr. Riza C Berkan, Hakia blog (July 18th, 2008 )
Semantic technology for search such as used at Hakia is completely different from the link-based technology used by Google (and others).
In this posting Dr. Berkan of Hakia compares search results of Hakia to Google. "If Google had semantic technology", then why doesn't it do better on the query examples?
Main point: "We have no idea how Google’s algorithm works, and it does a great job in so many ways. But, one thing is clear. The results show no sign of systematic performance to understand the meaning of concepts. They don’t show ranking based on quality. They don’t show aspect categorization beyond statistical clustering. They don’t show question type detection."
SEO For Semantic Search Engines by Pierre Far, Search Engine Land (Jul 14)
Compares 3 semantic search engines to Google: Powerset, Hakia, and Cognition. As the writer notes, these engines are still in beta and very young. Still they shine in their own way and can "beat" Google on the search examples.
Conclusion: "One way or another, semantic search engines will be part of the future of search engines in terms of natural language queries and indexing. This is new to our industry and we have to sit up and pay attention. Failure to do so may mean that you will miss the next big thing."
Eric Enge Interviews Yahoo's Priyank Garg by Eric Enge, Stome Temple Consulting (July 7)
Some aspects of how Yahoo ranks results are demystified in this interview.
+ anchor text matters -- "What we look for are links that would be naturally useful to users in context, and that add to their experience browsing on the Web."
+ importance of links in ranking is declining -- "New sources of data and new features that Yahoo! has built and developed have made our ranking algorithm better. Consequently, as a percentage contribution to our ranking algorithm, links have been going down over time."
+ Yahoo evaluates the quality of the site and looks for signs of "spamminenss".
+ social media sites are being figured in - such as del.icio.us
+ Yahoo uses human editors to watch for new spam techniques. Claims -- "We show the least spam among the search engines, because both of our techniques are in action. Our spam detection techniques run on every page, every time we crawl it. Those detection algorithms are fed directly into our ranking function, where the spam detection is actually pretty high in importance." -- I'm not sure Yahoo has the least spam.
How Google Universal Search and Blended Results May Work Seo By The Sea (
A patent sheds some light on how Google can return "a mix of results from different types of searches, including Web pages, news stories, images, videos, book listings, and others"
Indexing Dynamic Databased Content by Stephen Arnold, Beyond Search (April 20, 2008)
Considers the question of "deep web" content in response to Google's announcement that it will begin to index forms to extract and index information from dynamic databases.
Has some startling figures on the growth in use of dynamic databases for web site design -- "Today, more than half of the Web sites created each month are dynamic, and the ratio of static to dynamic sites is changing. I can’t reproduce the data I obtained from one of my clients, but I can highlight two facts. First, the number of dynamic sites is growing more rapidly than this time last year. At soime point in 2009, static sites will be in minority, essentially becoming brochureware that no one will pay much attention to. Second, the people operating dynamic sites want to protect their data from aggregators. Once structured data have been sucked out of a dynamic site, the value of the information decreases sharply."
Mentions BrightPlanet and Deep Web Technologies as two companies that can dig past subscriber logins and through query forms.
May 2008 InfoTip: Powerset.com Mary Ellen Bates (May 2008)
See potential in using Powerset, the new semantic search tool that helps one make sense of search results by identifying the facts and creating summaries.
"PowerSet is best used for those searches that cover a number of topics or areas. It's not perfect, and it only searches Wikipedia, but I find it an exciting new approach in the efforts of search engines to make sense out of web content.
Matt Larkin has some comments on Powerset too - Smarter isn't better...yet Traffick.com (June 6)
"While it’s silly not to consider a search engine that “understands” us an exciting prospect, the effectiveness of existing methods makes me wonder if we “need” semantic search yet. Powerset claims it works best for research, for those not searching for specific items but instead seeking general information on a topic. Well, which types of users are most likely to use search for research and inductive gathering of information? Anyone in the educational field. The last time I checked, they have large internal databases through which they can gather boatloads of literature on their topics of study, be it government documents, journal articles or online writings. In other words, they’re doing just fine. Is there a demand yet for a smarter search?"
Search Illustrated: Spider Traps Search Engine Land (Jun 3)
New infographic shows why a search engine won't be able to index a site. Session based coding and unfriendly SEO-CMS systems are two reasons that those pages are staying invisible.
Semantic Search: The Myth and Reality by Alex Iskold, Read Write Web (May 29)
On a grid of Structured Data vs Query Complexity where to Google, SearchMonkey, Freebase, and the two semantic engines, Hakia and Powerset sit? Alex Iskold has done the analysis - fascinating reading that helps to place these tools.
Makes the point that we are misled by the user interface - the simple search box used nearly universally hides too much.
"Semantic search is an upcoming technology that has set the expectations way too high. We have all been misled into thinking that these technologies are here to dethrone Google by delivering better search results. Neither of those things are true. What is true, however is that semantic search is going to be big and it is going to help us answer questions that we simply cannot answer today - complex, inferencing queries asked over the entire web as if it was a database."
Evri building a data graph of the Web by Dan Farber, Webware (May 28)
Evri may guide us through a new search experience for navigating through information resources graphically.
CEO Neil Roseman said, "We read sentences, extracting the subject, objects and verbs, and map to other content on the Web."
"Evri creates profile pages, which are like search results, that include a variety of lenses for an entity, such as top connections (entities most closely associated with the target entity), people, location, products, organizations, and events."
Evri may become available in beta in early summer.
Meta-Tag Optimization Tips: A Search Usability Perspective by SHari Thurow, Search Engine Land (May 29)
Meta-tag content can matter for ranking results of non-web pages - and specifically video. And the meta descriptions are sometimes shown in the search results.
"Some commercial web search engines use meta-tag content to determine page relevancy. Some do not. Most of the time on a text-based document, meta-tag descriptions and keywords are not used to determine whether or not a page ranks." ... "Meta-tag keywords and descriptions become more important when the search engines are not able to determine (or have a difficult time determining) the "aboutness" of a file, such as a video file."
Article distinguishes between navigational searches (get to the site or a page - the url is important) and informational (get the answer immediately in the snippet).
Making the Web Searchable: The Story of SearchMonkey by ALex Iskold, Read Write Web (May 29)
Notes from talk by Peter Mika on Yahoo's SearchMonkey search platform initiative.
"The motivating question for Mika's presentation was: How can we make web search better by leveraging web annotation? There are many kinds of annotations, but Mika focused on simple data and lightweight semantics, and began by reviewing the history and evolution of annotations to explain how we got to where we are today."
Google Offers Peek At How It Controls Search Quality by Eric Zeman, Information Week (May 21)
Udi Manber, VP of engineering at Google, Search Quality, has begun a series of postings on the Google Search Blog about the Search Quality team and its work (full posting).
This InformationWeek article has a few excerpts of the high points. It's not simple and Google works hard at it. We know that from articles that appear from time to time, and from the results.
The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it's not just the language, it's how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing).
How Search Engines May Substitute Other Search Terms for Yours, SEO by the Sea (May )
Search engines sometimes expand on your search terms by using some "related words". We see that at Google, Yahoo, Ask and Live. Bill Slawski explains some of the process and refers to a new patent filing by Yahoo on generating and using substitute words.
+ Use query logs to see reformulations of queries with those words.
+ use a dictionary
+ use statistics on "other phrases that tend to show up in documents with the original query".
Yahoo beckons coders to gussy up search results By Stephen Shankland, WebWare (May 15)
We'll start to see new search aids and applications thanks to Search Monkey. This is an "application foundation" - developers will build on Yahoo search results.
"The company will offer developer tools to let programmers start using SearchMonkey, technology to make search results more elaborate and, the company hopes, more useful. SearchMonkey lets programmers write applications that can turn dry textual listings in search results into a much more elaborate display, and Yahoo hopes its search business will benefit."
Of interest: "SearchMonkey also is interesting because it fits into the broader sweep of Internet history. Tim Berners-Lee, who initially developed the protocols behind the World Wide Web, has for more than a decade been advocating a move toward a more advanced sequel called the Semantic Web. SearchMonkey specifically takes advantage of Web site features designed to fulfill some of the promise of the Semantic Web."
Article explains microformats.
Related: Eric Enge interviews Yahoo's Andrew Tomkins Stone Temple Consulting (Apr 28)
"This interview expands upon the keynote presentation that Andrew gave at SES NY on the future of search. The presentation covered some very interesting ideas on how to improve the presentation of results within a search results page. The discussion relates to the initiative Yahoo referred to as "SearchMonkey"."
A Personalized Search Using Advanced Search Operators SEO By the Sea (May 13)
"A personalized search method described in a Yahoo patent application published last week collects information about a searcher’s interests from their search history, their browsing history, and their interests listed in profiles from places like MySpace and other social networks."
Yahoo's interest in this worried the readers of this blog entry.
Search Illustrated: How A Search Engine Determines Duplicate Content, Eastern by Elliance, Search Engine Land (May 13)
Love this series. "This week's infographic shows how search engines make distinctions between original and duplicate content"
Powerset brings the Semantic Web to Wikipedia By Dan Farber, Webware (May 11)
Powerset is now in public beta showing off its semantic search capabilities on Wikipedia.
"Amid speculation that Microsoft is looking to make an acquisition, Powerset launched a public beta of its Wikipedia search engine. It brings a new, rich semantic dimension via natural language query processing to Wikipedia that greatly improves the search and reading experience."
Longer term plan is to index and analyze 20 billion documents. Might it do that with Microsoft's help?
Several entries from Search Engine Land on search technology that were submitted at the WWW2008 conference --
WWW2008: Search Research Paper Roundup - research papers from WWW2008, the 17th International World Wide Web Conference that concern image search at Google, local aspects of web search, using search history to identify relevant information sources, using query patterns, wisdom of crowds, tag-based social interest discovery, and several more.
Also Microsoft Paper: Improving Search Results By Mining Web Surfing Activity by Danny Sullivan - Summarizes anew research paper from Microsoft about how surfing behavior -- as logged by a search toolbar -- can be used to improve search results.
Yahoo Paper: Finding The Local "Center" Of Search Queries by Danny Sullivan -- "A new research paper from Yahoo and Cornell University -- with search legend Jon Kleinberg as one of the coauthors -- provides a fascinating look at how a search query such as "red sox" or "hurricane deal" can be centered around a physical location -- including one that changes over time."
Yahoo! Launches SearchMonkey Developer Tool in Limited Preview by Vanessa Fox, Search Engine Land (Apr 24)
SearchMonkey is a tool for developers and site owners to use to enhance the listings of their pages in Yahoo. Article describes two types and provides instructions on how to set these up.
- enhanced with extra links and navigation
- infobar with additional information
Will this improve quality of search results?
"How does this impact the future of the web and search? On first glance, it appears to be a strong move to advance the semantic web and the beginning of a whole new way to view search results beyond ten blue links. But at least for the short term, Yahoo! is taking things more slowly than that. Their plan seems to be to use the presentation applications that others create as a test of search quality compared to the current results. Will searchers choose to opt in to these enhanced listings? (With Google's Subscribed Links, that answer seems to be no.) Will the additional meta data from the semantic web be useful or spammy? This may be a test to find that out."
Powerset: Don’t call us a search engine by Chris Morrison, Venture Beat (Apr 10)
Powerset could make 2008 a significant year in semantic search.
"It appears 2008 might well be shaping up to be the year that semantic technology kicks off: Semantic search engine Hakia has begun licensing its technology, the intelligent organizer Twine is readying for launch, and now natural language search engine Powerset is also considering a near-term launch, as TechCrunch recently noted."
It won't be a Google killer (no one is asking for that), but it could fill a gap and need.
"The answer is all in possibilities. Google is still the best way to hunt through vast numbers of silos (web pages) containing information when you’re looking for a specific fact. No new technology will seriously challenge that ability for a year or two, at least. But a technology like Powerset could short-circuit Google’s process by just giving you the damn fact, already instead of listing relevant websites."
The Secret Google Quality Raters’ Handbook Pandia (Apr 2)
Information about adjustments that Google makes in ranking results - what they deem as important on a page - was leaked in March. Pandia has a summary of the guidelines that are given to human editors in assessing quality of selected sites.
There are degrees of quality: vital (official site), useful (good and authoritative content), and relevant (but not as good as useful). Everything else not relevant or marked as for problems in content.
Google watches for "thin affiliates" - essentially spam -- "A thin affiliate is a site that gives you no original content and that only provides copied descriptions of products with affiliate links." If this is so, why does so much of this turn up? The article has a list of other types of spam and PPC that Google tries to winnow out.
Sol Lederman is the master blogger on Federated Search. He explains --
"This blog exists to serve the federated search industry. This Sol Ledermanincludes vendors, customers, and potential customers. While Deep Web Technologies (DWT) is sponsoring this blog, i.e. they are paying me to manage the blog and produce content for it regularly, don’t dismiss this blog as a marketing piece for Deep Web. My intent is to produce quality original content that educates all of us about the offerings of all federated search providers, addresses the issues and concerns of vendors and their customers, and keeps us all abreast of happenings in the industry."
This is one of the best looking / designed weblogs I've seen in a long time. It covers several interesting aspects of search: collaborative search, incremental results, deep web, verticals, federated search in libraries.
There is a sense of humour too such as in this April 1 2008 posting - Google to stop crawling the web: will federate it instead.
How Google Sets Works SEO by the Sea (Mar 30)
Google Sets was developed in Google Labs a few years ago. It allows you to “automatically create sets of items from a few examples.”"
A newly published patent reveals how it works.
The simple explanation of how the program works is that Google attempts to identify lists on the web as it crawls pages. It may look for these lists by considering:* HTML tags for unordered lists, ordered lists, definition list, headings
* Items placed in a table,
* Items separated by commas or semicolons,
* Items separated by tabs.
* Other ways.
One person commenting on this post also recommended Google Adwords Keyword tool for finding related terms.
I tried Google Sets on three Canadian women writers -- In the set of 15 there were 12 Canadian authors - not bad.
How Yahoo Could Avoid Microsoft - Part 1 Andrew Goodman, Traffick.com (March 28)
Written for search engine marketers but has some interesting musings the breakdown of Google's pagerank system and the possibilities of Yahoo's use of microformats (or open formats) for selective tagging. This becomes more important in the face of a deluge of user generated content (UGC).
"The reality of the massive growth in web content (most of it user-gen) is - something must change so that search engines work better with formatted, quality content, rather than their own proprietary, generic, semi-intuiting way of trying to sort out what's what. Google long ago broke with the majority of "troglodyte metadata" conventions, but nothing really solid has risen to take its place (Google Base is a failure). I see the new adoption of contemporary open formats by Yahoo as a big step in an evolution towards a more usable web, much more so than, say, the SiteMaps protocol."
Goodman promises more in Part 2 -- "I'll explain how startups like Mahalo are on the right track, but ultimately, utterly wrong. I'll talk about how Yahoo has it right, if they move forward in a certain direction. And I'll discuss their target audience and the potential that yes, they could still come back to be a credible alternative to Google in many markets. ..."
Yahoo and the Future of Search by By Eric Enge, Search Engine Watch (Mar 26)
Yahoo is introducing ways by which webmasters can more fully describe websites that will be less subject to the abuse that rendered metatags nearly unusable.
"Some of these include:
* Microformats
* RDFa and eRDF markup
* OpenSearch
* Atom/RSS Feeds
Yahoo says this information won't be used to affect ranking results. Yahoo wants to use the information to provide a better search listing in their results. "
Methods that have been developed for vetting local search submissions could be applied to web sites in general.
"Trust-based systems play a critical role in that. Keyword meta tags may be dead as a ranking signal, but there's no reason why a search engine can't implement something new and more robust (such as an extension of the Microformats protocol) to allow the Webmaster to provide lots of information about their site."
Search Engine Ranking Factors V2 SEOmoz.org (2007)
Excellent reference for understanding the key factors in search results ranking. Lists top 10 positive factors, most controversial factors, and top 5 negative factors.
"This document represents the collective wisdom of 37 leaders in the world of organic search engine optimization. Together, they have voted on the various factors that are estimated to comprise Google's ranking algorithm (the method by which the search engine orders results). The result is a resource of incredible value - although not every one of the estimated 200+ ranking elements are included, it is my opinion that 90-95% of the knowledge required about Google's algorithm is contained below."
A Chat with Hakia’s CEO Dr. Riza C. Berkan Natalya Murakhver, AltSearchEngines (Feb 29)
In this interview, Dr. Riza C. Berkan, Co-Founder & Chief Executive Officer of Hakia described semantic search:
"Semantic search introduces “understanding” where the algorithm analyzes both the Web page and the query to match and rank meaning. To give an example, if you are looking to find out the answer to the question, “What drug treats headache?,” you have to enter various combinations of these words to be able to search all relevant text, such as “drug, treat, headache” ; “drug treat migraine”; “drug help headache”; “Tylenol treat headache”: etc. You get the drift. When responding to the same question, semantic search can deliver a search result that states “aspirin helps migraine” where no words match but the concepts do."
Also mentioned Hakia's Galleries - "For short, discovery type queries, hakia brings categorized results (galleries) to offer a wide range of aspects of the search term."
More on Hakia, a meaning-based search engine at Pandia (AUg 2007)
Yahoo Set to Open Search Engine to Third Parties Heather Havenstein, Computerworld via PC World (Feb 26)
"New open-source application programming interfaces will allow Web site owners to add information directly to the Yahoo Search results Web page."
Example given is of a restaurant adding information about itself that goes beyond the address and information snippet.
"Code-named "Search Monkey," the new open-source application programming interfaces (API) will allow Web site owners to add information such as ratings and reviews, images, deep links and other data directly to the Yahoo Search results Web page."
Google does something similar with Subscribed Links -- "allows users to create custom search results that users can add to their own Google search pages. Matt Cutts, a Google software engineer and head of Google's Webspam team, noted that Subscribed Links, which Google debuted in 2006, allows users to "display links to your services, answer questions, and calculate useful quantities and more."
Both seem to be mainly about local search for commercial businesses and services, but you could see schools, libraries, governments and social services making use of this.
Also see Yahoo Announces Open Search Platform TechCrunch
Has an example of " a screenshot of a different search, for “hillary clinton.” The New York Times has altered the result to include links to other election news, debate analysis, and added data for current delegate count and total money raised:"
Sir Tim Berners-Lee: Semantic Web is open for business, Paul Miller, ZDNet UK (Feb 26)
Update on the progress of the use of semantic web concepts in data integration. Links to a 20 minute podcast with Tim Berners-Lee.
"We spent some time (almost 15 minutes, from about 20 minutes in, for those listening along) talking about the ways in which data holders will gain benefits from their data being visible to a new generation of Semantic Web applications -"
Interesting comment by Berners-Lee about Web 2.0 and social networking sites -- "Now if you look at the social networking sites which, if you like, are traditional Web 2.0 social networking sites, they hoard this data. The business model appears to be, ‘We get the users to give us data and we reuse it to our benefit. We get the extra value."
Web 3.0 will change that -- "“Web 2.0 is a stovepipe system. It’s a set of stovepipes where each site has got its data and it’s not sharing it. What people are sometimes calling a Web 3.0 vision where you’ve got lots of different data out there on the Web and you’ve got lots of different applications, but they’re independent. A given application can use different data. An application can run on a desktop or in my browser, it’s my agent. It can access all the data, which I can use and everything’s much more seamless and much more powerful because you get this integration. The same application has access to data from all over the place.”
Rethinking Recommendation Engines by ALex Iskold, ReadWriteWeb (Feb 26)
Recommendation systems are often put forward as a form of personalizing search results or as an example of social search (recommendations from community). Alex Iskold identifies three types of systems (personal, social and fundamental) and the main difficulties with each.
Key -- "Building a recommendation engine is a complex endeavor, which we discussed here a year ago. But in addition to being a technical challenge, there are also fundamental psychological questions: do people want recommendations and if so, then when are they open to them? Perhaps an even bigger question is: what happens when the user receives one or more bad recommendations? How tolerant will they be?"
How Search Really Works - an ongoing series by Ruud Hein at Search Engine People.
Short, pithy and illustrated descriptions of the under-the-hood operation of search engines. The Keyword Density Myth is especially informative.
11 Things To Know About Semantic Web Bernard Lunn, ReadWriteWeb (Feb 16)
Lunn gives a shape to the semantic web (aka Web 2.0) in this article.
His definition of Web 3.0 is “the combination of Web 2.0 mass collaboration with structured databases”. Instead of data-modelled relational databases, structure will be obtained "on the fly".
"Structure on the fly is done by people adding structure as they use the service and by engines that automatically create structure from unstructured content."
Vertical search will be the first place where semantic web will show: he called it the "pragmatist’s Semantic Web".
"Vertical Search businesses use whatever techniques they need - basic search engines, scrapers, APIs, human editors - to create some meaningful/useful structure in a single domain. Over time these cobbled together pragmatic solutions will be replaced by a semantic web platform, probably by an API that enables human editors to leverage their valuable domain expertise".
Stephen Abram sees this as "made for librarians' skills" - Semantic Web - Web 3.0, Stephen's Lighthouse
A three-year study on the freshness of Web search engine databases Lewandowski, Dirk, E-Prints in Library and Information Science (Jan 19, 2008)
This paper looked at the index freshness of the Google, Yahoo and MSN/Live, and found practices uneven and certainly not 100% fresh.
Study methods: "We conducted a test of the updates of 40 daily updated pages and 30 irregularly updated pages, respectively. We used data from a time span of six weeks in the years 2005, 2006, and 2007. "
Findings: "We found that the best search engine in terms of up-to-dateness changes over the years and that none of the engines has an ideal solution for index freshness. Frequency distributions for the pages’ ages are skewed, which means that search engines do differentiate between often- and seldom-updated pages. This is confirmed by the difference between the average ages of daily updated pages and our control group of pages. Indexing patterns are often irregular, and there seems to be no clear policy regarding when to revisit Web pages. A major problem identified in our research is the delay in making crawled pages available for searching, which differs from one engine to another."
Of interest:
Surmised from the findings that MSN updates its entire index within a certain time span, whereas Google updates those it considers important frequently and leaves the rest for later.
In 2007 Google had a mean of 14.8 days and median of 6, and MSN had 9.3 and 9.
Re indexing of the German Wikipedia, MSN showed a regular update pattern. Google suffered from a lag of 2 days between time crawled and time it showed in search results. Yahoo was erratic.
Concluded: "all search engines investigated have large shortcomings in updating their databases. None of the engines offers the ideal solution for the user (ie a comprehensive database of the Web that is updated according to the actual updates of the pages themselves). We found that none of the engines provides up-to-date copies even for the daily updated pages".
From Keyword Search to Exploration: How Result Visualization Aids Discovery on the Web Kules, B., Wilson, M., Schraefel, M., Shneiderman, B., Juman Computer Interaction Lab, University of Maryland (February 2008)
Discusses ways to meet the needs of the searcher who is engaged in exploratory search, situations "in which users need to learn, discover, and understand novel or complex topics". Looks at the information retrieval models, use of classification and of data visualization - includes examples from tools in use on the Web.
"This monograph offers fresh ways to think about search-related cognitive processes and describes innovative design approaches to browsers and related tools. For instance, while key word search presents users with results for specific information (e.g., what is the capitol of Peru), other methods may let users see and explore the contexts of their requests for information (related or previous work, conflicting information), or the properties that associate groups of information assets (group legal decisions by lead attorney)."
PageRank Is The Primary Google Search Ranking Factor Andy Beard, Niche Marketing (Feb
Long article with many comments that examines the importance of PageRank (or inBound links) in getting pages to show in search results.
Several revealing points.
+ "the toolbar PageRank has very little to do with rankings, and is manually manipulated based on Google's commercial goals." - so don't bother with the PR you see in the Google Toolbar.
+ There are many aspect to ranking through links: within a site, to a site, topical relevance of inbound links, popularity of the linking site, and others.
+ The point - "no page rank, no google juice - no index".
+ All content is not indexed, content that is indexed will be dropped as it gets older, links from old content won't keep it in the index. "To be in Google's index, pages really have to have a certain undefined amount of juice, no matter what other factors you gain merit for."
+ Pages that are deeper in the structure tend not to get indexed or are dropped from the index. A flat structure seems to be better.
"Drew if you divert juice into various archive pages, and keep them flat as if they are sitemaps, and also have an HTML sitemap, you can keep far more pages in the primary index."
+ To be found the page must be indexed, and to be indexed it must be linked to.
"Sure it is a little bit of a chicken and egg thing, but if you have a 750 word article and you want it to appear in the Google search results, the primary factor is that it receives enough juice to get in the index in the first place.
Other factors goven what it will rank for and how high in the results. PageRank can also affect that, at least possibly, but it is the only thing that is 100% required to appear in results, providing you have a document that can be addressed using some kind of URI."
+ Beard questions whether Google has really ended its practice of a supplemental index. Google introduced a supplemental index separate from the main one to store more "unusual" documents and would only show results when there there were only a few from the main (but supplemental also held duplicate conent). Supposedly Google has merged the two. - at least it doesn't show supplemental, but Beard's test raise some number questions.
+ John Honeck says that PageRank also determines how Google treats the pages -- "Crawling speed, indexing speed, updating frequency, and even the statistics that they display for a site in webmaster tools is determined by PageRank and not relevance or quality."
Lots to digest.
Advancing Advanced Search by Stephen Turbek, Boxes and Arrows (Jan 16)
Advanced search at search engines and sites has never worked well for a variety of reasons. But presenting filtering aids helps searchers refine a search for greater specificity. Presents some examples from product shopping searches. Excellent comments on the nature of search and the search interface follow.
Google Looks to Tech That Recognizes Text in Images Heather Havenstein, Computerworld via PCWorld (Jan 4)
Google has filed a patent for extracting text from images. This would help in indexing images of book pages, documents, product packaging, and much else.
"The search giant in June filed a patent application for technology that can recognize text in images. It could be used to retrieve text from video or from photographs that may show up as part of a street scene."
The Two Flavors of Google by Stephen Baker, Business Week (Dec 13)
Describes cloud computing: Hadoop is open source, and MapReduce which is proprietary to Google..
"Why are search engines so fast? They farm out the job to multiple processors. Each task is a team effort, some of them involving hundreds, or even thousands, of computers working in concert. As more businesses and researchers shift complex data operations to clusters of computers known as clouds, the software that orchestrates that teamwork becomes increasingly vital. The state of the art is Google's in-house computing platform, known as MapReduce. But Google (GOOG) is keeping that gem in-house. An open-source version of MapReduce known as Hadoop is shaping up to become the industry standard."
A battle could be shaping up between the two leading software platforms for cloud computing, one proprietary and the other open-source
Google uses human evaluators to improve search results Pandia (Dec 20)
Pandia recaps the main points from an interview with Peter Norvig, director of research at Google published in the Technology Review. It clarifies several points:
+ Google is working to understand "concepts" but doesn't plan on "natural language" queries
+ Google has started to apply personalized search to News.
+ Google does employ humans to look at queries and results and make adjustments to the algorithms and possibly raise the ranking of a site that has the answer or to block spam.
Q&A: Peter Norvig - The evolution of Web search. by Kate Green, Technology Review (Jan/Feb 2008)
How does Google Pick Snippets for Your Pages to Show in Search Results? SEO by the Sea (Dec 19)
Google's search results are comprised of a title, url, and a description. This is " A summary of the page in the form of a snippet or snippets, taken from either a meta description tag, or a description of the page from a directory like the DMOZ, or actual text from the page itself."
"A patent granted to Google today describes some of the process behind the choosing of text from a page to summarize the content of that page in relation to the keywords that it was found for in a search."
Some points: snippets need to be small, they may be pre-generated, keywords and surrounding text will be important, there is some weighting of possible snippets based on number of words or textual meaning.
Search 2010: Thoughts on the Future of Search by Leading Experts Enquiro Research (Dec 11, 2007)
"On December 11, 2007 leading experts on search met to discuss the future. In fact they met to share their thoughts on the future of Search in the year 2010." Webex Webinar - 65 minutes
Gord Hotchkiss was the host for panel discussion with several prominent search people from the main engines.
Participants:
Marissa Mayer - VP, Search Products and User Experience, Google
Larry Cornett - VP, Search Experience, Yahoo
Justin Osmer - Senior Product Manager, Live Search, Microsoft
Daniel Read – Senior VP of Site Product Management and User Experience, ASK
Jakob Nielsen - User Advocate and Principal of Nielsen Norman Group
Chris Sherman - Executive Editor, Search Engine Land
Greg Sterling - Founding Principal, Sterling Market Intelligence.
Projections:
+ more dramatic universal search / blended search. (Google and Microsoft). Ask says it will be about the "interface".
+ Yahoo - search engines will be able to understand "user intent".
+ Greg Sterling - personalization and more structure.
+ Jakob Nielsen - mobile phone
+ Compares a "discovery" search for the singer-songwriter Feist at each engine: Ask, Google, Live, Yahoo. All try to identify indent and provide "contextual relevance". Idea of disambiguation is being adopted by all engines. Even Google sometimes show words at the bottom of the page.
+ Marissa Mayer (Google) commented on improvements in creating snippets at Google.
+ Another test question - when is iphone coming to Canada? Ask was weakest for current material, other three were stronger. But all poor in the summaries.
+ Yahoo, through Search Assist, tries to have a conversation with the searcher.
+ Mayer doesn't think there will be a great breakthrough in figuring out user's intent in the next year.
+ Promise of personalization - "how much traction can we get in disambiguation through personalization?" Related Search at Ask (Daniel Read) begins to help user refine the search - combine this with patterns in use of verticals. May see payoff in three years. (Justin Osmer, Microsoft).
+ Local search will advance further through personalization - and benefit mobile.
+ How people are looking at pages - divide page into sections and dealing with them independently. E shaped scanning rather than F. More attention to the left than the right. Google uses the "golden triangle" to tightly hold to the margin on the left. But new challenges as images are introduced - images and videos are disruptive to the scanning. Ask found that when they interleaved video and images with web results saw some loss of ease of use and therefore moved to the three columns. Enquiro in its studies on the same query at Google and Ask found people spent the same amount of time.
+ How to put important marketing messages in front, and keep balance between sponsor results and organic results? Marissa Mayer says there is a sophisticated algorithms for selecting and ranking - and some new methods of presentation - richer content such as video and maps. Richer format ads are coming and higher relevance with more "brand relevance". Still there are dangers that people will be turned off. Nielsen advises targetting ads based on personalization elements or simple profiling.
+ How will community patterns influence results? Larry Cornett at Yahoo has several social services - finds that people will trust others. Gord asked "Will there be a merging of Facebook and search?" Greg Sterling - social search in a vertical may work better because of shared interests and values. Jakob Nielsen - social networks will be too small - need to be able to capture across a large number of people - Facebook etc won't cut it.
Google on Desktop Search and Personal Information Management SEO by the Sea (Dec 2)
Bill Slawksi sketches out some scenarios based on his readings of Google patents.
"You sit down at your computer, and start working on a document, and visiting the Web to find information.
A program on your computer considers the way that you move your mouse, and the speed at which you type, and recognizes you as one of the people who use that computer, and looks through your past computing sessions to see what kinds of things you are interested in, what web pages you may have visited, which documents you’ve printed, whether you prefer HTML or PDF documents when given a choice."
10 Semantic Apps to Watch Richard McManus, Read / Write Web (Nov 29)
Highlights 10 semantic applications - they "all try to determine the meaning of text and other data, and then create connections for users". Describes this as being top-down - analyze the text, or bottom-up - embed meta-data.
Several are still in private testing. Hakia is one of the few search engines that while still in beta is open for use by the public.
Of particular interest - a Firefox extension named Gnosis from ClearForest.
"The Firefox extension is called Gnosis and it enables you to "identify the people, companies, organizations, geographies and products on the page you are viewing." With one click from the menu, a webpage you view via Gnosis is filled with various types of annotations. For example it recognizes Companies, Countries, Industry Terms, Organizations, People, Products and Technologies. Each word that Gnosis recognizes, gets colored according to the category."
Sensebot gets an upgrade Pandia (Nov 15)
Update on Sensebot - the engine that "takes results from Google, Yahoo! and [or] Live and summarizes them into one concise digest on the topic of your query."
There is a test page where you can choose the engine. Search results are in sentences.
Searching the web using text mining, Pandia (Nov 14)
"What if you could get a search engine to summarize all the information found for you?"
Power Text has two text mining search engines - iResearch Reporter and NewsFeed Researcher. Demo versions of both are available.
Pandia describes both and has some additional information about text mining.
Google and Personalization in Rankings by Bill Slawski, SEObytheSea (Nov 16)
Slawski detects a possible move to a behind the scenes move to personalization that operates not from search history or expressed likes and dislikes, but from overall searcher activity in selecting certain results.
"We often talk about the ranking of Web pages with terms like PageRank or relevancy, meaning how relevant terms on a page might be to a query used by a searcher.
Many patent filings coming from Google refer to statistical models, like a probabilistic model that can learn about how words are related to each other, and how pages might be similar. Those models might tell us something about searchers. "
Information would be obtained from "user query sessions".
TrueKnowledge Demos Its Semantic Search Engine Marshall Kirkpatrick, Read/Write Web (Nov 7)
Has a video that demos the new TrueKnowledge search. Search engine is still under wraps.
"TrueKnowledge combines natural language analysis, an internal knowledge base and external databases to offer immediate answers to various questions. Instead of just pointing you to web pages where the search engine believes it can find your answer, it will offer you an explicit answer and explain the reasoning patch by which that answer was arrived at. There's also an interesting looking API at the center of the product. "Direct answers to humans and machine questions" is the company's tagline."
TrueKnowledge will be inviting users to add to the "knowledge" by adding what they know. Video briefly describes the process. But any contribution from users invites the wikipedia (and Yahoo Answers) problem of authority. correctness, and worth.
True Knowledge Launches Natural Language Search Engine Michael Arrington, TechCrunch (Nov 8)
The new True Knowledge from the UK "aims to give appropriate answers to natural language queries, even if key query terms are not included in the data being indexed. Current search engines are unable to return appropriate results for these queries."
True Knowledge is using structured databases - it isn't indexing the web. "Results can be returned based on inference of the intended meaning. So a question about if someone is married or not can be answered even if there is no specific structured data about that question."
34 comments »
Cambridge, UK-base
The Quaero project - new European search technology Pandia (Oct 30)
Pandia will be running a series of articles about European search engines beginning with Quaero, being developed by the French.
"Quaero is to take part in this market [blogs, podcasts, multimedia], by developing technologies for finding, accessing, manipulating and processing multimedia and multilingual content."
Rewriting the Beginner's Guide - Part I: How Search Engines Operate SEOMoz (Oct 10)
What a good series this will be - "For the next few weeks, my blog posts will primarily consist of re-authoring and re-building the Beginner's Guide to Search Engine Optimization, section by section". Starts with How Search Engines Operate.
Web Searching with Advanced Commands Genie Tyburski, Virtual Chase (Oct 11)
"This article examines ... using advanced search commands to manipulate or improve search results."
Nice to have all these advanced commands in one place with examples from an expert on how to use them.
I'm not sure, however, that * will work well as a wildcard at Ask or Live. It does at Google and will at Yahoo when inside a phrase (eg "three * mice).
At Exalead you can use NEAR to get words within 16 words of each other, and specify the nearness of the words with NEAR/number eg NEAR/2.
Also, Live.com has a partial stemming capability in its new "related terms" feature - it can do a reasonable job of picking up the singular on a plural word (eg - markets and market), and sometimes finds combinations (health care and healthcare).
Websearchguide has comparison charts for these engines, and a section on the use of syntax.
Powerset: Move Over, Google by Robert Hof, BusienssWeek (Sept 17, 2007)
Powerset, a new "natural language" engine that hopes to challenge Google, has set up Powerset Labs and is asking people to help improve its search before its launch in 2008.
Where Google and others do keyword search, Powerset will - "analyze the actual meaning of words and phrases that it indexes on the Web. It then will analyze the linguistic meaning of the query and find the best matches between the two—theoretically, at least, producing more meaningful results. "Our system reads every single sentence in every single document and extracts meaning from them," says Powerset Chief Executive Barney Pell. "
Natural language engines thrive on words - the more the better. Searchers will have to change their ways from the skimpy 1 to 3 word queries they use today.
View to the future: "Google executives have said that natural-language search could be years away from practical use and that linguistic analysis hasn't produced notably better results so far, which Powerset disputes. At the same time, there's little doubt Google's search wizards are examining the possibilities and are positioned to take swift advantage if the technology pans out. But even if Google isn't threatened by the competition anytime soon, it's clear the search game is far from over."
See demo about Powerlabs . The Powerset Labs demo site will use Wikipedia its database during the trial.
10 Future Web Trends by Richard MacManus, Read / Write Web (Sep 6)
MacManus gives a time frame of 10 years for these changes, but I think some will be well developed in the next 5.
+ Semantic Web for making connections between blocks of information - has always been thought that will require metadata.
+ Artificial Intelligence for computers to do what humans do - especially in seeing patterns.
+ Virtual worlds - live in them, create them.
+ Mobile web and location aware devices.
+ Attention Economy - "personalized news, personalized search, alerts and recommendations to buy"
+ Web sites as web services - starting to see this in widgets.
+ Online video / tv - get the television programming you want.
+ Rich Internet Apps (RIA)
+ International web - China, Korea, India - surely growth areas but will they use US-based sites and services?
+ Personalization - more and more of this unless people fear for privacy and turn it off.
At SEOMoz, randfish added his thoughts on trends - Where are Search Engines Most Likely To Innovate? - more query intent detection, more use of social, more verticals but the search engine has to recommend it. Some of this is good - I'd like to see some figures on how much searchers use suggested phrases especially those based on a log of queries - I'd rather an engine that can make sense of pages that are returned in a search set.
Search In The Year 2010 by Gord Hotchkiss, Search engine land (Aug 10)
Hotchkiss brough together his dream team of 8 people for looking down the pipe to what search will be like in three years. Among them were Jakob Nielsen usablity guru; Marissa Mayer, Google VP for interface design; Michael Ferguson from Ask; Larry Cornett at Yahoo; Justin Osmer at Live search; Chris Sherman, Greg Sterling, and Danny Sullivan - all SE pundits.
Topics:
+ Search results page - maybe more mixed content.
+ Personal portal page - can results come back organized into a portal-like page? Will iGoogle be able to do that some day?
+ Social experience - much about Stumbleupon.
+ Personalization - Chris Sherman said, "I don't really see any kind of dramatic breakthrough on the horizon. I think as long as we’re limited to the current search form factor, if you will, where we’re encouraged to do the slot machine approach, where we punch in a few keywords, pull the lever and hope to hit the jackpot."
It's hard to do and some people will worry about privacy, but all the search engines are going to work on this anyway.
+ Usefulness as part of the algorithm - Maybe, if searchers will agree to indicate what is useful to them.
+ Contextual search - Chris Sherman challenges search engines to come up with "search by example"
+ Semantic search - none of the participants talked about linguistic analysis and meaning extraction. Instead there is the interesting idea from Mayer that results might be presented with different views - on a map, on a timeline - with models taken from what has been developed in local search.
+ Hands-on - replace advanced search with buttons and sliders. But Jakob Nielsen thought users would ignore those too, as they did with the sliders on MSN.
Nielsen observed, "The basic information foraging theory, which is, I think, the one theory that basically explains why the web is the way it is, says that people want to expend minimal effort to gain their benefits. And this is an evolutionary point that has come about because the people, or the creatures, who don’t exert themselves, are the ones most likely to survive when there are bad times or a crisis of some kind."
Interesting.
The Ultimate Search EngineBy J. Nicholas Hoover, InformationWeek (August 4, 2007)
"Google, Microsoft, Yahoo and others are developing next-generation technologies that automate and personalize information search."
Nicholas Hoover is rather harsh in describing the effectiveness of search engines today suggesting that users must "dumb down their queries with the pidgin language understood by first-generation search engines", but he does provide a good overview of trends in search technology and design. Query making has always been working with words - thinking of them, combining them, getting into the mind of the writer. The main change today is that search engines are adding capabilities to understand the words and make some suggestions, or to group the results so that the searcher can try some new words. But there are other developments as I've noted below from the article.
"Search results will be more accurate and automatically summarized, with relevance determined by individual preferences. New methods of presentation such as clustering, tag clouds, graphical scales that widen or narrow searches based on parameters, and automated categorization will make it easier to navigate results. And search engines will be enhanced by human intelligence and the wisdom of crowds through tagging, social bookmarking, and shared searches."
+ Learning language - Hakia and Powerset are two that are applying linguistic analysis to interpret and analyze the question, content, and results. Autonomy and IBM are also adopting this.
+ Queryless search - where the search engine anticipates need based on what you are working on. Watson is one tool which watches in the background. StumbleUpon will use Web history to make recommendations as does the new Google Dice.
[Re Google Dice, see Searching without a query]
+ Personalization - iGoogle has personalized pages including a recommendation service based that is based on the user's search history. I've never found it to be useful, but there must be potential.
+ Social skills - essentially means getting answers from other Web users whether you know them or not such as through Yahoo Answers or any of the social bookmarking services.
Google has something like this now - "iGoogle, "magic tabs" present a menu of gadgets and feeds deemed relevant to a search query--the word "travel," for example--based on the tabs other Googlers have created".
Collarity identifies communities-of-interest and uses "collaborative filtering" for relevance ranking. It goes to some trouble to pick up suggestions from users it has identified as showing a deep interest in a subject.
+ Results oriented - cites engines that cluster results (Clusty) or otherwise categorizes (Endeca) and those that have smart answers (such as Windows Live being able to show a map). These are both very important, and the examples for smart answers should have included Ask.com.
+ Multifaceted - really multimedia and being better able to discern content from patterns rather than being limited to metatags and surrounding text.
Some of the developments and players mentioned in this article aren't new. Watson, as an example, has been trying for broad acceptance for some time.
Of interest is the update on Watson -- "Watson got a second life in MediaRiver's ClickSurge widgets, which determine important concepts on a Web page and embed relevant links elsewhere on the page." Most Web searchers do not want to download software, and many will be wary of gadgets and widgets too.
Nonetheless, this is a good description of the main vectors in play for improving web search, the query interface and the results.
Powerset and hakia - Quest For The Semantic Web by Phil Butler, Read/Write Web (July 20)
Hakia, the meaning-based engine, and Powerset, which promises "semantic search", are quite different, explains Butler, in the index that is used, the processing, and the horsepower.
Of interest, Barney Pell, CEO of Powerset, said that Facebook is one of the key innovation of late and that it "will become one of the primary communications platforms of the future".
Butler opined that "Facebook is one heck of a representation of information for a social network. Essentially, hakia, Powerset, Facebook and others are bending the machines to engage humans. And in a way, Facebook is the semantic Web in a microcosm - but in it's infancy."
I think that is a stretch but perhaps it does depend on how important personalization will turn out to be.
Google's Research Director Peter Norvig On 'The Future Of Search' by Greg Sterling, Search Engine Land (July 17)
Several excerpts from on an interview done with Peter Norvig, Google's director of research and published in the MIT Technology Review -- The Future of Search - The head of Google Research talks about his group's projects.
+ emphasis on machine translation and speech - will be useful for video search.
+ want to know more about the searcher's intentions.
+ want to be able to understand contents beyond word matching. Google does understand synonyms and place names but can't parse a sentence yet.
New in the Demo Center -- Cognition Linguistic Search, EContent (July 11)
"Linguistic search technology employs a unique mix of linguistics and mathematical algorithms which has, in effect, "taught" the computer the meanings (or associated concepts) of nearly all the words and the frequent phrases within the common English language. Unlike all of the popular search engines in use today, which utilize mathematically-based pattern-matching technology (i.e., they search for a particular word pattern), CognitionSearch understands the meaning of words in context; in both the query and in the document base."
CognitionSearch looks for concepts in your query and identifies alternate meanings. You select from the meanings it finds in order to refine the search. It reminds me of Oingo of long ago. It's a bit too much work for the searcher - would be better if the engine could figure out the meaning from the context.
Eric Enge interviews Udi Manber about Search Quality, Stone Temple (July 9)
Udi Manber is VP of Engineering at Google. He talked about Google's work to improve quality of search results through changes to the algorithms.
Manber explained, "we [Google] use more than a hundred different parameters. PageRank is still an important parameter, but it's just one parameter. And, there are all kinds of parameters, such as whether the word appears in the title and whether the two words are close together and all the obvious traditional information retrieval parameters. There are many others that we invented and there is the combination of all of them, which is really where the hard work is being done, figuring out when and how to put them all together, of course, all of which is being done in real time."
People often repeat web searches by Greg Linden, Geeking with Greg (jul 7)
As many as 40% of queries are repeats of earlier searches to refind results or get new ones. If this is the case, then having tools to remember your searches and the results would be useful. This ties into personalization -- "A first and necessary step toward personalization is to start maintaining search and viewing history for each user. "
Proof Google is Using Behavioral Data in Rankings, SEOMoz.org (June 12)
Worth a careful read - two points of interest:
+ experiment by Visio "proved that Google was using the data from Google Analytics to improve the ranking algorithm". Google Analytics is a tool to use on a website to see where visitors come from and how they interact with a site.
+ Google's purchase of Feedburner will provide even more data to Google from the feeds to subscribers - "Obviously a site with 10,000 readers is going to have more authority than one with 100 readers! I would say it's a safe bet that this new data will eventually find its way into the ranking algorithms."
Powerset Meets the Press, SEW Blog (Jun 29)
Powerset showed its natural langauge / semantic search technology to the press in San Francisco. People were impressed. Release is expected for September.
From Powerset: The natural language search mashup platform, by Dan Farber, ZDNet (Jun 28)
Steve Newcomb, COO, is quoted as saying, “Imagine a mashup between Facebook, Digg and Google Apps, but you get to participate in the building of the products that sit on top of our platform. You log into a social network, like you would Facebook, and you get certified to be a Powerlabber. Once certified you can join different interest groups, such as travel, and participate in idea and mashup competitions. QA is embedded and its all bloggable.”
Search Engine Friendly URL’s explained, SEO (June 27)
The dynamically generated url is no longer a problem for search engines to index - "Google and the other top search engines have figured out how to do deal with dynamic URL’s. There are many sites that rank well using them. Having a “?” or “&” in your URL is not considered a negative or a positive by the search engine algos. What they do have a problem with is session ID’s. Do not use those."
Posted in SEO at 6:45 am
What's next for the Internet By Michael V. Copeland, Business 2.0 via CNN Money (July 3)
Nova Spivack is working on making the semantic web real through his company Radar Networks. Here they are working on an artificial intelligence that will make connections between blocks and bits of information to create an order and reveal meaning.
"For Spivack, however, the semantic Web begins now with the data engine and user applications he and his team are prepping for launch -- and ends somewhere in the future with artificially intelligent software agents handling all the online drudgery of your business and professional life."
Radar will be launching "a sort of personal data organizer. It will allow you to bring in e-mail, contacts, photos, video, music --anything digital, really -- from anywhere on the Web, turn it into RDF, and access it in one place."
Sensebot summarizes search engine results on the fly, Pandia (June 28)
"Sensebot is a new search engine that takes results from Google and Yahoo! and summarizes them into one concise digest on the topic of your query."
Still in beta, but if it succeeds students and researchers will love it.
Semantic Search: An antidote for poor relevancy, by Dr. Riza C. Berkan, Founder & CEO, hakia.com, at Read / Write Web (May 29)
Of course, Hakia.com search technology is a form of semantic search.
"The option of "Semantic Search Engine" has yet to be tested. My company hakia, along with others like Powerset, Cognition Search, and Lexxe are taking steps in this new direction. There are challenges with this approach as well. First and foremost, the knowledge of languages must be built in a structure that would allow a scalable and speedy search process. Building such resources is an expensive, tedious, and time consuming endeavor. Then, all the Web pages must be analyzed using this system to prepare for a retrieval platform; another time-consuming process. But when all of this is done properly, the users will start to experience something totally new. Let me emphasize the word "properly" here, which is an entirely new discussion point."
The .edu domain that generally represents universities and colleges (educational institutions) in the United States could become contaminated with spam. Rebecca at SEOMOZ.org suggests seems to answer yes to her own question - Will .edu Links Ever Lose Their Luster?. SEO people look for ways to post material to an edu domain (such as jobs), and to get links. The comments on this are also interesting with some feeling that edu and gov links offer a boost and another quoting Matt Cutts at Google who said definitely not.
A Smarter Web by John Borland, Technology Review (Mar/Apr 2007)
A view of the future - "The next wave of technologies might ultimately blend pared-down Semantic Web tools with Web 2.0's capacity for dynamic user-generated connections. It may include a dash of data mining, with computers automatically extracting patterns from the Net's hubbub of conversation. The technology will probably take years to fulfill its promise, but it will almost certainly make the Web easier to use."
Web Search Results: Something to Keep in Mind, ResourceShelf (May 25)
Refers to an article in Forbes - Google-Proof PR? (May 25) about a company called Reputation Defender that "pads the Web with friendly-sounding content like flattering blog entries, personal sites and other positive pages, and then pushes those sites to the top of the Google".
Gary Price at Resourceshelf points out that this people should be taught how to spot this kind of manipulation as part of information literacy. Indeed. He also has pointers on search strategies people can use to circumvent (somewhat): use advanced features, have more specific queries, use more than one tool and preferrably specialty tools.
Will Universal Search Mean Universal Domination? , Eric Enge, Searchday (May 17)
Google's new universal results presentation is a complete integration from web, video, news, local and books. Enge points out that this requires a "relevance scoring system that would work on the same numerical scale across all of their properties." ... "The key thing that Google needed to do was to normalize these results, putting them all on a common scale. " ... "But once they succeeded in normalizing and extracting their relevance scoring systems, the rest was relatively easy."
Top 17 Search Innovations Outside of Google by Nitin Karandikar, Read / Write Web (May 7)
Excellent round-up of new search technologies roughly grouped into 4 categores - Query Pre-processing; Information Sources; Algorithm Improvement; Results Visualization and Post-processing. Has all the themes and excellent examples.
What Is Google PageRank? A Guide For Searchers & Webmasters by Danny Sullivan, Search engine land (Apr 26)
The underlying concept to Page Rank is that links to a site are like votes, and some votes are more important than others. Google also uses other factors that involve matching on text to rank results - as Sullivan explains.
PageRank is not something the searcher can see unless using the Google Toolbar and with the ranking meter turned on. People don't turn it on for privacy reasons - Google is tracking what you do - but that may change as people opt for Web History tracking.
Sites in the Google Directory are sorted by PageRank - that's what the green bar is all about. Sadly, Sullivan says that the directory has not been updated with changes from the Open Directory Project for months (maybe years).
Sullivan presents proof that PageRank is not the most important factor in rankings. He also distinguishes between search rank (on the fly ranking) and toolbar ranking (periodic snapshot of the page).
Knowing the search rank could confuse searchers, but knowing the rank of a single page may help in assessing its quality.
"PageRank is one of many, many factors used to produce search rankings. Highlighting PageRank in search results doesn't help the searcher. That's because Google uses another system to show the most important pages for a particular search you do. It lists them in order of importance for what you searched on. Adding PageRank scores to search results would just confuse people. They'd wonder why pages with lower scores were outranking higher scored pages.
In contrast, if you're looking at a single page, such as when you are surfing the web, you no longer want the search ranking but rather an idea of how important or reputable that page might be. This is where PageRank makes more sense."
All in all, PageRank is mainly of interest to SEO specialists.
Why I Love the Google's Supplemental Index, Aaron Wall, SEO Book (May 5)
Supplemental pages that show in Google results are a great concern to search engine marketers. According to this post, Forbes says these are pages that Google "deems to be of low quality or designed to appear artificially high in search results" - and so Google doesn't index them. But Matt Cutts at Google says no - these pages are supplemental because of PageRank - presumably that they don't have sufficient links to them. Aaron Wall in this post posits that it has to do with duplicate content.
Whatever the cause, looks like searchers can ignore those results most of the time.
Danny Sullivan wrote about this last January - "Basically, the supplemental index is a way for Google to hit less important pages in specific instances when it can't find matches in the main index. Trying to search against tens of billions of pages all at once is time consuming and expensive. Far easier to hit just the "best of the web," exactly as Inktomi used to do -- and for exactly the same reasons. But it's a continuing reminder that Google can't do it all. No matter how great those machines are, they have to divide up that index. The "best of the web" might still be tens of billions of pages, but divisions still raise concerns."
From January 2007 Update On Google Indexing & Ranking Issues Search engine land (Jan 11, 2007)
Is Relevance Relevant? Market, Science, and War: Discourses of Search Engine Quality by Elizabeth Van Couvering, Department of Media and Communications, London School of Economics, Journal of Computer-Mediated Communication, 12(3), article 6.
Ouch - search engines are not unbiased or neutral in ranking results, says this writer, who interviewed senior management in several search engine companies between November 2002 to May 2004 .
"The evidence presented here suggests that resources in search engine development are overwhelmingly allocated on the basis of market factors or scientific/technological concerns. Fairness and representativeness, core elements of the journalists' definition of quality media content, are not key determiners of search engine quality in the minds of search engine producers. Rather, alternative standards of quality, such as customer satisfaction and relevance, mean that tactics to silence or promote certain websites or site owners (such as blacklisting, whitelisting, and index "cleaning") are seen as unproblematic."
Also see John Battelle for comments - Search Paper: Is Relevance Relevant?
What Is Google PageRank? A Guide For Searchers & Webmasters, by Danny Sullivan, Searchengineland (Apr 26)
Probably the definitive explanation of page rank to date - "Let's start with how PageRank is used by Google for searchers. First and foremost, it is one of many factors used for ranking pages. You can't see PageRank when you search (ordinarily, that is. further below I'll explain how you CAN see it), but behind the scenes, it helps in part to decide if a page will show up in the top search results or not."
Also explains page rank for the Directory ( Google uses Open Directory (ODP) but is slow to update for updates, and ODP itself is poorly maintained.)
Main message - "PageRank is one of many, many factors used to produce search rankings. Highlighting PageRank in search results doesn't help the searcher. That's because Google uses another system to show the most important pages for a particular search you do. It lists them in order of importance for what you searched on. Adding PageRank scores to search results would just confuse people. They'd wonder why pages with lower scores were outranking higher scored pages."
Autonomy To Reclaim Blinkx, Then Spin It Off, Danny Sullivan, Searchengineland (Apr 25)
"Autonomy is to exercise an option to take over Blinkx, then appears to be spinning some consumer-facing search technology that its owns (and I believe Blinkx was licensing) into an independent company Blinkx, that will go public in London."
Interesting review of the history of Blinkx and of Autonomy with some discussion of textual analysis technology - or "meaning-based search". Refers to a page at Autonomy that describes the IDOL Server and very broadly the technology - "Autonomy's strength lies in advanced pattern-matching techniques (non-linear adaptive digital signal processing), rooted in the theories of Bayesian Inference and Claude Shannon's Principles of Information, that enable identification of the patterns that naturally occur in text, based on the usage and frequency of words or terms that correspond to specific concepts."
Google Declares Stephen Colbert As Greatest Living American, Danny Sullivan, Search engine land (Apr 20)
Google changed its ranking algorithms to prevent pages being ranked high because of terms in the anchor text linking to those pages - or so it was thought. This was to prevent the "miserable failure" bomb against President George Bush (and others). But Stephen Colbert has managed to have himself show as the greatest living american . Sullivan explains this as "I suspect the answer will be that the link bomb fix Google uses is more sophisticated than just looking to see if the words people are using in links, when a lot of links suddenly point at a page, actually appear on a page." - which doesn't help much. Yahoo, Live, and Ask didn't give Colbert the top spot, and Ask no spot at all.
Google Categories Prototype, Google Blogscoped (Apr 16) - screenshot of Google categorizing (or at least grouping) results - but it was a fleeting glance.
The power of links – non-indexed pages out-ranking optimised ones. Search Engine War (UK) (Apr 10)
We get a glimpse of Google's ranking algorithm in this posting - "...when a page is linking to another that is blocked by the robots.txt file Google opts to display the link text from the linking page as the result title, which when you think about it is actually quite a serious flaw in their treatment of the robots.txt protocol ... Google is showing priority to the decision of the linker to link, over the content owner who wants it excluded."
The semantic web - the next upgrade to the web. Neal Goldman, CEO of Inform, speaks to BusinessWeek Online about semantic web technologies and what they may mean for the Web. Semantic search technology analyzes the use of words in text to make it possible to understand "conceptually" what you want. It links words and phrases together and makes more connections in order to fill out meaning. Personalization, as an added component, will use what is known about your interests to further refine results. The process is a blend of human direction to train the algorithms for word connections, and machine learning.
Watch the BusinessWeek video on The Semantic Web. (April 6)
This is part of a special report at Business Week - CEO Guide to Technology.
Taming the World Wide Web - A rising tide of companies are tapping Semantic Web technologies to unearth hard-to-find connections between disparate pieces of online data, Rachel King
Some semantic technologies are in use today for special applications in order to make linkages between data. As the article states, "Those tools are the stuff of the Semantic Web, a method of tagging online information so it can be better understood in relation to other data—even if it's tucked away in some faraway corporate database or software program. Today's prominent search tools are adept at quickly identifying and serving up reams of online information, though not at showing how it all fits together. "When you get down to it, you have to know whatever keyword the person used, or you're never going to find it," says Dave McComb, president of consulting firm Semantic Arts."
ZoomInfo is an example of a semantic search engine. "The engine automatically crawls publicly available business information—from corporate Web sites to press releases and electronic news services to SEC filings—adding semantic tags and organizing information so that it can be easily found later."
Article makes the point that we shouldn't call this Web 3.0 as if it is a software release. But there is a progression from what people are doing with tagging today. "In many user-generated sites grouped under Web 2.0, users often tag their own data, be it photos, bookmarks, videos, or other content. "Web 2.0 is the messy way that the Semantic Web is actually happening," says O'Reilly."
Business Week also has a slide show for Weaving a Web Around Technology and a podcast.
Do you really need meta tags? You bet by Jennifer Slegg, (Apr 4)
Of interest - "Google sometimes uses the description you place in the meta description tag as the snippet when certain criteria is met. This sometimes includes site search or keyword searches when keyword(s) that are searched for also are contained within the meta description."
Cognition Launches New Linguistic Search Engine by Barbara Quint, Newsbreaks (Apr 2)
Seems to be a breakthrough on linguistic search with Cognition.
"Cognition Technologies (www.cognition.com) has launched CognitionSearch, a linguistic search engine that supports ontology, morphology, and synonymy, tapping one of the world's largest computational dictionaries. Initially, the company will market a vertical enterprise service for legal litigation support and for life science and health research. It also offers an open Web service (www.cognitionsearch.com) to demonstrate the technology as applied to MEDLINE and PubMed content, to judicial and legislative sources, and to political blog content."
"In the current launch of the CognitionSearch open Web service, the company selected three subject areas to showcase and demonstrate the technology: health (MEDLINE, PubMed, etc.), legal (U.S. Supreme Court cases, a million Enron emails, etc.), and politics (key political blogs)."
Google and the deep web by Greg Linden (Mar 23)
Through papers Google has released recently, Greg Linden has gleaned much about Google's intentions on whether to index structured data in order to reach into the "deep web" (aka invisible web).
There is specific mention of the "content that lies hidden behind queryable HTML forms", ie dynamically generated answers. Google sees generating queries on those databases based on the user's key words, and - possibly - anticipating these by "surfacing" information beforehand and adding that to the Google index.
In the followup posting - The end of federated search? (Mar 24) Linden concludes that Google will not do federated search (a meta search of other search engines such as those powering specific databases), but opt for copying what it can.
Federated search would require a "virtual schema" with the domains mapped into a common view. That's not going to happen.
From the quote: "The third limitation is our reliance on structured queries. Since queries on the web are typically sets of keywords, the first step in the reformulation will be to identify the relevant domain(s) of a query and then mapping the keywords in the query to the fields of the virtual schema for that domain. This is a hard problem that we refer to as query routing."
A9.com, as Linden explained, was an attempt at large scale federated searching, and presumably it has failed. And even though its creator, Udi Manher is now with Google, Google seems to prefer the surfacing approach, probably for performance reasons.
What does this mean for searchers? We're not going to see a search service that can automatically direct us to the best resource and then exploit the structure and organization of that resource to deliver answers. We will get clues from Google, but not the in-depth search that is required for digging into the deep web. Searchers - it is still up to you to find the resources and learn how to use them.
Addendum: There is an excellent discussion of federated search and metasearch in the comments to the End of Federated Search
Comments on the value of metasearch (the type that fuses results from different but similar search engines) - "On the other hand, with metasearch, each search engine is working across the same corpus, and the whole point is that duplicate content is a good thing. The more often that independent search engines retrieve the same document, the higher our confidence is that the document is truly relevant."
Powerset Aims to Leapfrog Google by By David Needle, Internet News (Feb 9)
In this article about Powerset and its work on a natural-language search engine we get an example on what will happen with a consumer question.
"Bobrow gave a consumer example of how the Powerset service works. When someone types in "Who was Spielberg married to before Kate Capshaw" Google and others give results related to the movie director Steven Spielberg and actress Kate Capshaw.
"Google doesn't give you the answer, Amy Irving, because it's not part of the question. What you really want is the answer, not hundreds or thousands of links. We give you the answer." "
Hakia is another natural-language processing search engine. It had an answer for the Spielberg question -- "The top of its results page said: "You are very curious today. Spielberg was once married to Amy Irving and is now married to Kate Capshaw." That was followed by links to pages related to Spielberg."
Personalized Search - The Feature No one is Asking For, Graywolf's SEO Blog (Feb 8)
Here's a scary thought, as more search engines apply proprietary personalizing routines to selecting ranking results, how do you work with a customer or a student online with a search query - they may get different results from you, and as you move to another computer, results may be different again. How do you help the customer or teach the student?
Michael Gray gets it - "I’ve never met a single person who’s said “wow searching for something at home gives me different results than searching for something at work, that’s not confusing at all, in fact I think that’s an improvement, why can’t my calculator work like that”."
But the search engines don't seem to understand the chaos they will create by using personal factors to rank results. There will have to be a way to turn off these off.
In a Search Refinement, a Chance to Rival Google, Miguel Helft, New York Times (Feb 9)
At PARC scientists have been working on natural language search technologies. PARC is licensing that technology to others.
"The start-up, Powerset, is licensing PARC’s “natural language” technology — the art of making computers understand and process languages like English or French. Powerset hopes the technology will be the basis of a new search engine that allows users to type queries in plain English, rather than using keywords."
But it will be tough. Powerset doesn't expect to release an engine until late 2007. Many doubt that it will be possible to get an engine to answer real questions such as "what companies did I.B.M. acquire in the last five years". Marissa Mayer, Google’s vice president for search and user experience, was quoted as saying: “Natural language is really hard. I don’t think it will happen in the next five years.”
Evolution of a Search Engine by Philipp Lensen, Google Blogscoped (Feb 2)
Very interesting article by Lensen on how search discovery might develop into delivering "knowledge" answers, providing personalized content, and ultimately performing some analysis. Google is the engine under study, but Ask.com may have some of the "knowledge" capabilities today.
"Right now, to answer your queries, Google quotes from the web, and orders the quotes in a list. In the future, Google may combine these quotes into a free-style text for a more direct answer. When the Google AI advances beyond that, it may analyze the texts available to it to come up with conclusions of its own."
Future of Search: The European View, Frank Watson, SEW blog (Jan 24)
Richard Firminger, Director of Northern European Sales for Yahoo Search Marketing sees moves to integrated results from various sources including social search, and natural language search.
"From a single search we will soon be able to receive answers incorporating text (sponsored and algorithmic), video, images and even human knowledge – the latter coming from social search products like Yahoo! Answers, ...."
"Additionally, natural language, or Semantic Search – which enables users to pose queries as a properly phrased question, not with a couple of words – may come to the fore."
Search Scent in the Search Engines by Kevin Lee, Clickz (Jan 19)
Searchers will be interested in what's on the minds of search-engine marketers to improve the searcher's experience (and sell the product). PARC scientists posited the idea of that Web users pick up an "information scent" when navigating between sites; Lee converts that idea to a "search scent" that searchers have for a particular piece of information, a scent that advertisers and site designers can plant.
"Search scent is an extension of the information scent concept, initially developed by scientists at Xerox Palo Alto Research Center (PARC). Information scent centers on the how users navigate the Web, both within sites and from one site to the next while pursuing information on a specific topic. The research illustrates that humans forage for information on the Internet in much the same way animals follow scent and visual cues to find food. Scent is essentially an application of user interface optimization best practices, and search scent is a specific niche based on the fact searchers are even more wedded to a particular information-gathering mission than surfers or casual browsers."
Eye Tracking in MSN Search, Search Engine Land (Jan 12)
Microsoft may be using eye-tracking methods to assess effectiveness of snippets for results. "Adding information to result snippets significantly improved performance for informational tasks but degraded performance for navigational tasks."
Live Search is going to have to do more than this to improve its worth as a search engine.
U.S. LIBRARY OF CONGRESS SELECTS AUTONOMY FOR ITS ENHANCED WEBSITE SEARCH FEATURES
( Dec. 14, 2006)
... "U.S. Library of Congress has selected Autonomy's enterprise search infrastructure platform to offer enhanced search features on several of its websites, including Thomas and the Legislative Information System" Features are "framework for organizing and managing the legislative information, providing multiple guided navigation paths, and new flexibility in searching.."
Search 2.0 - What's Next? Written by Emre Sokullu and edited by Richard MacManus, Read/Write (Dec 13)
Good end-of-year article for looking at trends in search: user interface (sees promise in Snap and the new Live.com), technology (clustering, natural language), and vertical engines.
Comments by others worth a quick browse.
Revenge of the meta-tag!, SEOMOz.org (Nov 17) - those metatags and descriptions can be important in getting indexed at all.
Google adds indexing tools for News portal - "Publishers and webmasters will use a site map to indicate the articles they want Google News to index" By Juan Carlos Perez and Mike Barton, InfoWorld (Nov 21)
"This means that publishers and webmasters will be able to specify through a site map the articles they want Google News to index. A site map is a file that webmasters and publishers put on their sites to guide search engines' automated Web crawlers in properly indexing their Web pages."
Google, Yahoo, Microsoft partner on open source search protocol "Rivals team on how sites are indexed, easing the game for webmasters, improving search results for users" By Juan Carlos Perez, IDG News Service (November 15, 2006)
An "open source, Sitemap Protocol based on XML (Extensible Markup Language)" could improve indexing of web sites significantly and make some of the "invisible" web visible. Webmasters would create a sitemap that will guide the Web crawlers to index areas of the site. These "site maps are particularly useful in highlighting to crawlers the dynamic Web content that is served up on the fly." In the end, crawlers will be able to do deeper indexing.
[Added Nov 19] - Search Engines Unite On Unified Sitemaps System by Danny Sullivan, SEW Blog (Nov 16) - has the complete press release and some comments.
From the press release:
"Las Vegas, November 16, 2006 - In the first joint and open initiative to improve the Web crawl process for search engines, Google, Yahoo! and Microsoft today announced support for Sitemaps 0.90 (www.sitemaps.org), a free and easy way for webmasters to notify search engines about their websites and be indexed more comprehensively and efficiently, resulting in better representation in search indices. For users, Sitemaps enables higher quality, fresher search results. An initiative initially driven by Yahoo! and Google, Sitemaps builds upon the pioneering Sitemaps 0.84, released by Google in June of 2005, which is now being adopted by Yahoo! and Microsoft to offer a single protocol to enhance Web crawling efforts."
Republicans hit in 'Google bombing' by Tom Zeller, New York Times via IHT (Oct 26)
"Fifty or so other Republican candidates have also been made targets in a sophisticated "Google bombing" campaign intended to game the search engine's ranking algorithms. By flooding the Web with references to the candidates and repeatedly cross-linking to specific articles and sites on the Web, it is possible to take advantage of Google's formula and force those articles to the top of the list of search results."
Making Search More Relevant, By Bruce Clay, Search Engine Guide - October 18, 2006
"In recognition of these limitations, search engines are constantly innovating to make search more relevant. Some are providing a means to personalize your search results with shared knowledge, some are experimenting with a new and different results page, and others want to improve relevance with the human touch."
How Can Search Engines Rank Results? Let Bill Count The Ways by Danny Sullivan, SEW Blog (Oct 16) There are over 100 factors a search engine could consider in ranking results. For Sullivan the main takeaway is that "... we are moving further into that world ... where not everyone will see the same search results for the same query." Essentially, ranking search results is getting personal.
Points to an excellent article - 20 Ways Search Engines May Rerank Search Results - by Bill Slawski , SEO by the Sea (Oct 14). Article describes ways that results may be re-ranked after the basics of matching on terms and link analysis.
Why Search Sucks & You Won't Fix It The Way You Think, Danny Sullivan, Daggle (Sep 19) - screenshot tour of search interfaces since the early days of Altavista with substantial coverage of design efforts to cluster results and some on the information visualization efforts at Kartoo, Grokker, and Ujiko. What works? People prefer simple keyword entry and results display. One person commented -- "Search is a verbal process for the most part, so effective display of results is going to be blocks or columns of text. In my experience, anything else is just annoying."
Potential of web search personalization, Geeking with Greg (Sep 27) - Summarizes points from KDD 2006 paper, "A Large-Scale Analysis of Query Logs for Assessing Personalization Opportunities", by Steve Wedig and Omid Madani from Yahoo Research.
Basically there are two ways - "using a searcher's short-term history to change search results, which they call "adjustment", and modifying searcher results using a profile built from their long-term history, which they refer to as "personalization""
Either way, search personalization is something we will be seeing more of, but I fear we won't know when it is being applied. When search engines adopt this, it would be nice if they had a turn on and off button.
Peter Morville, author of Ambient Findability, spoke to the Library of Congress on July 20, 2006. The presentation is available in this webcast. Runs for 45 minutes.
"Peter Morville, widely recognized as a founding father of information architecture, discussed his recent book, "Ambient Findability," in a program sponsored by the Science, Technology and Business Division. Morville describes Ambient Findability as a safari of how people search for information and how they now find their way through a world of information overload. His previous book, which he co-authored with Louis Rosenfeld, "Information Architecture" was named "Best Internet Book of 1998." Morville's work has been featured in many publications including Business Week, The Economist, Fortune, MSNBC and The Wall Street Journal. He blogs at findability.org."
Spying an intelligent search engine by Stephanie Olsen, CNet (Aug 18)
"While most would agree that Google has set the current standard for Web search, some technologists say even better tools are on the horizon thanks to advances in artificial intelligence."
Medstory applies AI techniques to healthcare. "Rappaport [CEO] won't disclose the secret sauce of the company's technology; however, he said, it's a 24/7 process in computing that connects valuable pieces of information together, such as linking one document that explains symptoms of a disease to another document with analysis of a therapeutic drug for that disease."
AI is also being used at Riya to find photos by matching on characteristics - density, patterns, colours.
Microsoft Researchers Inventing New Techniques to Improve Search Engine Accuracy and Relevance --
Papers presented at the 2006 SIGIR conference describe new techniques for analyzing rich patterns of user interactions with search to improve the overall search experience. (Aug 7)
Improving search seem to lie in understanding and anticipating the searcher.
"“Most search engines today use a somewhat two-dimensional approach, matching user queries with the content and link structure of Web pages to return a list of results,” said Eugene Agichtein, a researcher in the Text Mining, Search and Navigation Group within Microsoft Research. “We’re looking at how to add a third dimension — the users themselves — to improve the search experience. By examining click-through and browsing patterns across a large number of users, we are able to learn a great deal about how people interact with search technologies and can thereby improve our accuracy dramatically.”"
Web Search Engines: Part 1 and Part 2 , by David Hawking, Computer - How Things Work (June 2006)
"In this two-part series, we go behind the scenes and explain how this data processing "miracle" is possible. We focus on whole-of-Web search but note that enterprise search tools and portal search interfaces use many of the same data structures and algorithms."
"Part 1 of this two-part series (How Things Work, June 2006, pp. 86-88) described search engine infrastructure and algorithms for crawling the Web. Part 2 reviews the algorithms and data structures required to index 400 terabytes of Web page text and deliver high-quality results in response to hundreds of millions of queries each day."
Researchers look to semantic Web to drive Internet -- "Computer scientists discuss ideas for organizing the Internet's growing mass of data" By Jeremy Kirk, IDG News Service via Infoworld (May 24)
The idea of the semantic web is still strong, although it won't be accomplished by adding keywords to metatags. Labelling is still needed, but people have new hopes on how that might be done.
"Labeling information on the Internet involves tagging it with code and then classifying it into a taxonomy. Customized taxonomies and ontologies, or data models, could be created for different subject matters to connect disparate, rich information tucked away on servers.
It's an approach that differs vastly from current search engine technology, which may be able to find all instances of a keyword and rank a document's popularity but not interpret the context. "
I, for one, am not holding my breath in expectation that such taxonomies will be developed and used.
User Behaviour and Google Site Profiles By Jim Hedger, Search Engine Guide - April 17, 2006
There is a sense that Google is relying less on link analysis for ranking results and more on what it figures out about user behaviour and preferences.
"The term “user behaviours” describes any number of actions taken by people while using a Google branded search tool, while visiting a particular site in Google’s index, and while moving from site to site or document to document.
Basically, Google wants to know what its users like and dislike. Those user-judgements have become important factors in how Google ranks sites in its index and in personalized search results shown to registered users. "
Google Talks the Talk for Search by Ben Charny, eWeek (Apr 12) - Will we be able to ask Google questions by voice, talking to our computers or through a cell phone? "Google co-founder Sergey Brin and three others on April 11 were granted a U.S. patent for technology to let the human voice command Internet search engines."
Search Engine Meeting Caters To Serious Seekers by David Gardner, Information Week (Mar 9) The Infonortics Search Engine Conference will be held in Boston from April 24-25.
"The heart of the conference, Collier said, is still centered on new and offbeat developments, some of which are likely to become mainstream in the future, and he noted that Google's participation this year won't deflect from the conference's main objective of getting search freethinkers and pioneers together."
A Search Engine For Every Subject. "Google and Yahoo rule, but a flock of upstarts is offering new ways to find info". Business Week ONline (Feb 20)
There has been a boom in startups that are seeking to change search from one-box one-million hits. Instead people might use specialized engines.
"Instead, people may use several different search engines, each tailored to a specific task. One might specialize in blog postings, another in video clips, and a third in general information. The shift may look like the evolution of TV, from a world dominated by the Big Three networks to one in which hundreds of cable channels specialize in topics from cooking to history. "People are looking for targeted, specific information that search engines can't provide," says Michael Yang, CEO of Become.com, a search engine focused on Internet shopping."
"Social search" is one angle. But niche engines that have a narrow focus is another that is being used for shopping, real estate, health, and several others.
Cos. Tackle Online Searches at Conference by Matthew Fordahl, AP Via Yahoo News (Feb 9) DEMO Tech Conference saw three new search engines analyse content.
- Plum "lets users group Web pages, e-mail, music, pictures and files from their desktop computers into online collections that can be kept private or made public for others to find."
- Kaboodle for shopping
- Riya for searching photos.
Search engines to be key technology in 2006: Report by Jack Kapica, Globe Technology (Feb 1)
Deloitte's Technology, Media and Telecommunications Predictions 2006 sees search as being an increasingly important technology as digital content increases.
"The reason for the rising importance of search engines is the increase of the volume of digital content on-line — as much as 20 billion gigabytes in 2006 alone. Search tools will be needed to sift through such a volume of data. Searches will also extend to include data held on devices such as PCs, mobile phones, digital cameras and personal video recorders."
Expect changes -- "Technology will change people's behaviour, in the same way MP3 players now enable owners to carry their entire music collection wherever they go, game consoles created a new leisure category, and mobile devices and broadband connectivity have made working at home a reality, Deloitte says."
Europe's 'Google killer' goes into hiding -- Project to launch a European search engine imposes 'news blackout' to avoid scrutiny -- By James Niccolai, IDG News Service (Jan 13)
Thomson didn't like all the news coverage surrounding its search engine Quaero and has shut down the web site. "It was unclear how far the work has progressed, but it seems unlikely that users will be searching the Web with Quaero any time soon. The participants are still determining how they will divide up and manage the various parts of the project, according to one source. And Waibel suggested that some of the language technologies he is working on may be years away."
Quaero as a new search engine being developed in Europe gets press, but what is there to see? This project to provide multimedia search solutions is being billed as a challenge-to-be to Google. But there is nothing to see at the moment. The Quaero home page is blocked with a password signin (presume Thomson, the owner, will remove that), and the technology is still under wraps. One thing - it will need to buy a domain. Quaero.com is already in use.
Quaero, the European Developed Multimedia Engine, Gets Press Attention - overview from Gary Price, Search Engine Watch Blog (Jan 11)
European Tech Giants Craft Search Engine by Angela Charlton, AP via Washington Post [registration] (Jan 11) - expresses some doubt about ultimate success -- "Quaero is the latest in a string of largely French-led efforts to compete with America's dominance of the global marketplace, a theme of Chirac's foreign policy."
Search is About Communication Aaron Wall, SEO Book (Jan 6)
Suggests that one good way to improve search results is to get more information from other people whether intentionally or as part of the system. States that, "Many of the major search and internet related companies are looking toward communication to help solve their problems. They make bank off the network effect by being the network or being able to leverage network knowledge better than the other companies." There are several examples given for the major search engines showing this direction. The technical apparatus in place now for ranking may break down under weight of size of the web and spamming. Sites will thrive if they can build relationships with people.
Google's first Newsletter for Librarians was published on Dec 19 on the topic of How does Google collect and rank results?. Nothing new here but the explanation is clear and would help new searchers understand the principles.
Information Extraction: Distilling Structured Data from Unstructured Text by Andrew McCallum, University of Massachusetts, Amherst, ACM Queue vol. 3, no. 9 - November 2005 -- describes information extraction techniques.
"Information extraction ... is the process of filling the fields and records of a database from unstructured or loosely formatted text. Thus (as shown in figure 1), it can be seen as a precursor to data mining: Information extraction populates a database from unstructured or loosely structured text; data mining then discovers patterns in that database. Information extraction involves five major subtasks (which are also illustrated in figure 2):"
Articles includes some examples such as ZoomInfo.com for extracting information about people from Web sources, CiteSeer.org for citation information from academic papers, FlipDog.com for job openings.
Comments on the accuracy of automated extraction, and looks to future developments. Concludes that methods for information extraction will be critical in being able to access what we need in an ever growing mass of data.
Eurekster Introduces Swickis - Community-Powered Search Engines for Personal and Small-Business Websites; Swickis Are a Powerful New Way to Improve Search Relevance and Advertising Revenue by Harnessing the Knowledge of Online Communities, Business Wire via Marketwatch (Nov 16)
Eurekster has developed a new search engine it calls a Swicki to be used on individual web sites. "Swickis automatically learn from search behavior, without collecting or identifying individual user information, to deliver content and advertising that is highly relevant and valuable to a specific community. "
"Publishers are invited to create their own swickis -- free of charge -- with the Eurekster SwickiBuilder at http://swicki.eurekster.com, and can opt to share in the search-related advertising revenue, a feature that will be available soon."
It's the Google of police tools "Canadian experts invent search engine to find, track down terrorists" by Sarah Staples, The Ottawa Citizen (Nov 17) [Thanks to LT for this story.]
Defence R&D Canada has developed a new search engine called Terrogate for tracking down references to terrorism in documents. At present this works on documents that have been collected, but is to be rolled out to analyze web content and eventually real-time news feeds. The algorithms work with the "vocabulary of terrorism" on five main themes: terrorist tactics, weapons, locations, targets, groups and individuals. Researchers identified 3,000 terms that are exclusively related to terror.
"TerroGate melds two emerging search trends. An "entity extraction" component sifts through documents tagging relevant words for easy retrieval. And the system is one of a handful in the world capable of performing "conceptual" searches, which don't merely hunt for keywords the way Google or Yahoo do, but also notions more vaguely associated with the keyword."
The software grew out of a project by the University of Sheffield, in England, on "entity extraction" done in the mid 1990s.
There are two commercial systems - "AeroText, by a subsidiary of Lockheed Martin, and ThingFinder, by Inxight Software, Inc., which is used by the U.S. Defence Department and the U.S. army -- but they only annotate generic proper or place names in a document."
Plans for TerroGate include:
- "incorporating link analysis software that analyses relationships between references to terrorism in different documents."
- web crawlers
- displaying results in map form.
- languages other than English
The Google Story: An Excerpt "Chapter 26: Googling Your Genes" Washington Post (Nov 14) [subscription] - Excerpt from the Google Story by David A Vise and Mark Malseed - goes on sale on Nov 15.
Reveals that Google founders, Sergey Brin and Larry Page, hope to " empower millions of individuals and scientists with information that will lead to healthier and smarter living through the prevention and cure of a wide range of diseases". Specifically describes a project involving biological and genetic research.
Ambient Findability: Libraries at the Crossroads of Ubiquitous Computing and the Internet By Peter Morville, Online (Nov / Dec)
Peter Morville, author of Information Architecture for the World Wide Web, has a new book - Ambient Findability.
"I envision a future of ambient findability in which we can find anyone or anything from anywhere at anytime. At the heart of this brave new world is a library, or rather a multitude of libraries, that help us find what we need, whether the objects sought (and the libraries themselves) are physical, digital, or in between."
Autonomy's Consumer Division Announces Creation of Conceptual Index of World Wide Web Press Release (Oct 25) - This would be something to see - "brings next generation retrieval features to the web, including conceptual clustering, implicit query, video search and Autonomy's unique Automatic Query Guidance (AQG). Autonomy's AQG automatically returns categories of results based on the meaning of the query, providing an easy navigation facility directing users to the results they require based on a conceptual and contextual understanding of their query." But doesn't look like this will be public - it's intended for enterprises.
Quintura Search - Pandia Search (Oct ) Quintura promises "revolutionary web search software". Software will use dynamic clusterization and semantic maps.
Surfwax Offers Look-Ahead Technology for Web Sites Gary Price, SearchDay (Sept 19) -- "Today SurfWax is introducing a dynamic query suggestion tool that can be easily installed and customized on any web site." Price says that "Technology like this has the potential to save a user a large amount of time and aggravation by helping create a more focused and precise query, thereby getting better results. It can also help when a searcher enters general terms when they're looking for something specific."
A Beautiful, Networked World? Sap Info (Aug 29)
"In a conversation with Andreas Blumauer, project manager at the Semantic Web School in Vienna, SAP INFO online illustrates how far the vision of Berners-Lee, the founder of the Web, has already taken form in the real world and the actual advantages of the Semantic Web of the future – far from any technological infatuation."
On the Frontier of Search by Terry McCarthy, Time magazine (Aug 28) -- Predicts a future where search engines are "smarter and more tailored to the individual, embrace video and music--and be accessible from any device with a chip."
+ Singingfish for image and video is mainstream. But Viisage will recognize faces.
+ Cell phone search facilities to find local services as you walk down the street and even give information on an object you've just pointed the camera-phone at (from Mobot).
+ KnowItAll for getting answers.
+ More tagging and finding through tagging.
+ Blinkx.TV for tracking down video clips.
+ Satellite online maps - Google, MSN, A9.
+ Personalized starting with Findory for news, and now adopted in the new Google desktop.
Autonomy positions itself for content wars By Maija Palmer, FT.com (Aug 28) -- Autonomy, noted for its technology for handling unstructured content, is working with the Chinese to create a service for searching news and video. Prior to this Autonomy had been working with Blinkx for video search.
"Where Google and Yahoo rely on having video clips manually catalogued and tagged so that they can be searched using key words, Autonomy uses voice recognition software – also used by the US Department of Homeland Security to eavesdrop on terrorists – which automatically catalogues every spoken word in hundreds of thousands of hours of footage."
Friday Book Excerpt: More on Perfect Search by John Battelle's SearchBlog (Aug 19) - preview of final chapter of book about search - Search Everywhere.
E-Gang - Eight Masters Of Information Edited By Elizabeth Corcoran, Forbes (Aug 18) - This seventh E-Gang review by Forbes presents "the Masters of Information--those entrepreneurs and companies figuring out how to separate the gold from the gravel on the Web."
+ Barry Diller, IAC/InterActiveCorp which now owns Ask Jeeves.
+ Caterina Fake and Stewart Butterfield - created Flickr for sharing photos
+ Jeffery Jonas - IBM Entity Analytics
+ Ellen Siminoff - Efficient Frontier for picking keywords for online ads.
+ Peter Norvig - Google's director of search quality
+ Jimmy Wales - father of Wikipedia. It has 2.2 million articles in 100 languages.
Entry about Peter Norvig mentions clustering - "Now Google's statisticians develop algorithms that look at how closely one query links to another and how groups of queries interact. Studying word "clusters"helps determine whether a search term like "Blondie" means the comic strip or the punk-pop band from the 1980s. Norvig's crew also aims to accelerate results by learning which irrelevant words (like "like") to discard when indexing a Web page."
Diving deep into the Web by Michael Bazeley, Mercury News (Aug 17)
At Glenbrook Networks, the Komissarchik father and daughter team are developing a search engine that will do "custom data extraction" from databases that standard search engines can't touch.
"Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for. Then, Glenbrook's Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such as the location of their desired job, in the same way a human would."
What's Cooking in Search Engine Labs by Chris Sherman, SearchDay (Jun 30) - Lists the various labs at the major search engines and blogs that discuss developments in search features. Google, Microsoft, Yahoo, Ask Jeeves are here as well as CiteSeer for computer scientists in academia and, at the opposite end, Shopzilla's Robozilla for developments in shopping.
Louis Monier On Why He's Going To Google by John Battelle (June 24) - Louis Monier was one of great search minds at Altavista when it was the best. After four years at eBay, he is moving again, this time to Google. In explaining the move he said, "So rather than chewing on variations of e-commerce for the next few years, I'm very tempted to play with radically new stuff: satellites images, machine translation, ways to extract knowledge from giant bodies of data ... who knows what else? " This might give us some hints about what Google will be doing.
MSN Search and Learning to Rank by Greg Linden, Geeking with Greg (June 21) - translates into layman's language a paper written by Microsoft's Chris Burges (and others) about using neural networks in relevance ranking - as it seems Microsoft intends to do.
Also see Danny Sullivan's comments in MSN Search Gets Neural Net/RankNet Technology & (Potentially) Awesome New Search Commands. MSN Search may have adopted new ranking algorithms to improve its search but on the search that Sullivan ran Google, Yahoo and Ask Jeeves were just as good.
Google's War on Hierarchy, and the Death of Hierarchical Folders
by John Hiler, Microcontent blog (May 10)
Finds that hierarchical organization of information (subject trees or taxonomies) is under attack by the believers of keyword searching and in particular Google. Google, Hiler, finds is anti-hierarchical - witness the lack of folders in GMail and Google Desktop Search.
Article reviews the history of web hierarchies starting with Yahoo Directory, Looksmart and the Open Project Directory. Google's page-ranking algorithm based on linkages vastly improved relevance (at least for a time), and people left the directories in droves to use Google (tho we should remember that Altavista was a strong search engine then too). In March 2004 Google sidelined its use of ODP and, according to Hiler, killed directories.
"As Google's Director of Search Wuality put it, "We analyzed what people were using, and [directories] become less popular over time. As the web grows, directory structures get harder [for consumers] to use.""
The last is an interesting statement. I don't think directory structures get harder to use at all, and in a world of unmediated search results, some classification is an aid for providing context. However, it is true that manual classification is very labour intensive.
Hiler reviews the history of folders used for organizing email in Outlook, Hotmail and other web mail programs. And then came Google's GMail where the bins have been reduced just an inbox and an archive (tho you can add labels). You keyword search for "conversations".
Desktop is the third area. People could relate to the filing cabinet metaphor but who could find the right drawer? Desktop search from Google, MSN and others make it much easier to search across folders - in fact to ignore folders. Hiler, doesn't mention though, that you might wish to restrict the indexing to specific folders.
Article concludes -- "But Folders rarely solve the core problem that they address - and often create new ones, like forcing you to create new folders just to manage new information. Solutions like Search, Archives, Stars and Labels get more directly at the core problem... and promise that the future of information management will look very different from its past."
Enough Keyword Searches. Just Answer My Question by James Fallows, New York Times (June 12). James Fallows finds - "Search engines are so powerful. And they are so pathetically weak." He describes the difficulty of determining the right keywords to find information on changes in California's spending on its schools - and of "trying to outguess the engines". How much better it would be to use something like Aquaint, a project by US federal intelligence bodies, that will handle ""advanced question answering for intelligence".
Fallows mentions two engines whose added features he does appreciate - Ask Jeeves for broadening and narrowing results and offering suggestions, and Vivisimo for categorizing, Grokker for visual presentation, and his favourite Mr Sapo "because it allows quick, easy comparisons of the results of the same search on virtually any major engine."
While I appreciate his frustration with word guessing, this would be an occasion for bringing in an information professional who knows how and where to look for statistical and specialized databases, free and for-fee. Using Google or any other general purpose search engine with or without search aids would only find some bits and pieces on this question.
In the article he also endorses Roboform for handling the myriad of usernames and passwords. And mentions that Google Map's satellite views of some places in the US are camouflaged: the vice-president's compound in Washington DC ( though not the White House,) and downtown Albany.
Longhorn goes beyond search By Rafe Needleman, CNet Reviews (June 1) - Advance look a the user interface for Longhorn, the next Windows operating system. Describes folders, search, tags, and visual aids and mentions that many of these are available as add-ons today.
Sees a future of -- "I'm betting that contextual and audio/visual searching can't be far behind. And at some point in the future, we'll be able to search for documents on our hard disk "about rent" without having to match search terms, or direct our system to find pictures of Grandma given just one picture of her, or find orchestral-sounding music given a sample of it. Whether Microsoft ships these tools first is an open bet; but I'd wager that this is what Google, Yahoo, Apple, and other search companies will try to do to stay ahead."
Also, the next generation of tools will have to be able to handle "digital assets" in general - on mobile devices, online services, digital media - not just the home computer.
SearchTHIS: Clutter, Relevancy, & Search - by Kevin Ryan, IMediaConnection (May 10) - there is so much going on in search with new services for multimedia, new and varied applications for social groups, and smart answers, (to name a few) that Ryan asks -- "With all of this activity, ... Just how thick with relevant results can a search engine results page become before relevancy gives way to clutter? How will the searching public react to all of these changes? History has taught us that clutter equals disaster in search, and we might just have to take a breath before integrating everything but the kitchen sink into search."
If Search Engines Could Read Your Mind by Chris Sherman, Searchday (May 11) - Artificial Intelligence is almost here for search. Sherman tips us off to 20Q.net, a program that asks 20 questions about an object you think of and can often guess the object. It's based on neural networking as Sherman explains.
"To a certain degree, search engines already employ similar systems. Just as 20Q.net starts out with broad questions (is it animal, mineral, or vegetable) to "prune the tree" of possible branches, search engines do the same thing with the few clues offered by your search terms, eliminating thousands or millions of possibilities before even considering possible matches."
Presentations from the 2005 Search Engine Meeting are Now Available Online - listing of presentations from the Infonortics Search Engine Conference held in April 2005. This is always an excellent conference with analysis and a view to the future. Gary Price has picked out the presentation most on search. Full listing is at Search Engine Meeting 2005.
Search Engine Watch Forums has a threaded discussion about the future of search and indexing as visualized by the Microsoft Research project, Stuff I've Seen (SIS). In particular, it points to a presentation by Susan Dumais delivered to the Infonortics search engine conference.
http://www.infonortics.com/searchengines/sh05/slides/dumais.pdf (April 2005)
PC Users Drowning in Data, Microsoft Says - Ted Bridis, AP in Globe and Mail. (Apr 26)
"Computer storage technology is getting so cheap a person could record every conversation of a lifetime and decades of photographs, but experts must improve search systems so users can make sense of such mind-boggling amounts of information, Microsoft's top research executive said Tuesday."
Search Engine Algorithms & Research By Christine Churchill, SearchDay (April 14) -- a peek into the ways the algorithms work to rank results.
+ Ask Jeeves / Teoma: ""Lahiri confirmed that Ask Jeeves looks at the web as a graph and looks at the link relationships between them, attempting to map clusters of related information. By breaking down the web into different communities of information, Ask Jeeves can rely on the "knowledge" from authorities in each community to better understand a query and present more on-topic results to the searcher. If you have a smaller site, but one that is very relevant within your community, your site may rank higher than some larger sites that provide relevant information but are not part of the community."
+ Co-occurence: identify and use semantic associations between terms.
+ Future: "introduction of probabilistic latent semantic indexing and probabilistic hyper text induced topic search"
Yahoo focuses on research CNet.com (Apr 12) - Competition in the search labs at Google, Yahoo and MSN. Yahoo has hired Usama Fayyad from NASA to head up the Yahoo Research Labs. Of interest, Gary Flake who used to be principal scientist at Yahoo Research Labs has moved to Microsoft.
"Yahoo's lab will be developed into a center for innovation with scientists from all over the world, the company said. The lab will tackle scientific problems in search and information navigation, personalization and mobility, Yahoo said. It also will work on designing algorithms to support new technologies."
Yahoo Next is Yahoo's showcase for new tools.
Seroundable.com noticed that Google is Showing Dynamic Titles. That means the title it shows for a page may vary with the search terms you use. The example given was rustybrick. Search for rustybrick alone and get the home page with that as the title. Search for rustybrick web - get another title, in fact the actual title of the page. The first title comes from the entry in the Google Directory for Rustybrick.
Shows that Google is using its directory for something. But what happens if you search for word in title -- intitle:rustybrick? You get neither version, although Rustybrick does show as a Sponsored link.
The Evolution Of Web Search by David M. Ewalt, Forbes (Apr 6) -- Future of search is in the Semantic Web - tagging and identifying relationships.
"A more familiar example of tagging might be Froogle, Google's comparison shopping service. Retailers who want their Web sites to show up in Froogle searches have to update their product pages with hidden labels on things like price, name and manufacturer. Everyone uses the same tag for price, regardless of what they actually call it, so Google can easily collect product information from thousands of different stores, even if they're in different languages. "
A Conversation with Tim Bray in ACM Queue vol. 3, no. 1 - February 2005 -- "Searching for ways to tame the world’s vast stores of information".
Tim Bray, co-founder of Open Text, is director of Web technologies at Sun Microsystems. In this interview he talks about his work with the OED (Oxford English Dictionary) project at Waterloo University, Open Text, use of SGML and development of XML and RDF.
Search For Tomorrow - by Thomas Claburn, Information Week (Mar 28) -- "Google may lead in Web searches, but investment in emerging technologies will open up new ways of searching digital information. Part 3 in the series The Future Of Software"
This is the future: "Google may have the market lead looking for Web pages, but fast-growing business and government investment in emerging IT areas such as Internet phone calls, electronic medical records, and anti-terrorism technology is driving demand for new ways of searching digital information. The goal is to extract information from databases, Web pages, documents, or audio and video clips automatically; recognize the names of people, places, organizations, dates, and dollar amounts; and find the relationships among them. Mining sounds and images for meaning is also important as companies expand call centers and switch to Internet-based phone calls and as the government pours money into IT for intelligence and homeland security."
Yahoo's game of photo tag -- Stefanie Olsen, Cnet (Mar 22) -- Discusses the free tagging of photos at Flickr, the online photo sharing service that Yahoo just bought, and the possible expansion of such "folksonomies" to a "global categorization of information".
""The future of folksnomies involves meshing these user-generated categorizations with more standardized categorizations, such as the Library of Congress or the Getty Thesaurus of place names, so you could start to connect data to allow more of these associations to be made," Merholz [Peter Merholz, a founder at Adaptive Path] said."
Next big step for the Web--or a detour? by Paul Festa, ZDNet (Mar 9)
Speakers at the Semantic Technology Conference discussed whether enterprise applications for the Semantic Web will be the next wave.
"Just as the Web encompassed existing Internet technologies while adding its revolutionary system of hyperlinks, so, they claim, will the Semantic Web give birth to vastly more powerful ways of gleaning information from the world's computer network."
First they have to sell the concept -- "The Semantic Web protocols aim to let computers distinguish different kinds of data. Armed with those distinctions, applications could more automatically trade information, for example between an online address book and a cell phone. A Web site could automatically reconfigure itself on the fly based on the needs of a particular visitor. Search engines could narrow down results with greater precision."
Article points to a few real-world implementations of the Semantic Web. But a world of interchangeable data does come with concerns about security and privacy.
Folksonomies - Cooperative Classification and Communication Through Shared Metadata by Adam Mathes, at Computer Mediated Communication, University of Illinois Urbana-Champaign (December 2004) - examines the user classification done at services like Furl.net, Flickr, and del.icio.us. Argues that "The primary problem with this approach is scalability and its impracticality for the vast amounts of content being produced and used, especially on the World Wide Web." On the other hand, involving users in organizing information may mean picking up new terms earlier and patterns of use.
Slashdot has an entry on Google's Technology Explored (Mar 3) gleaned from several articles.
Especially Peeking Into Google By Susan Kuchinskas, InternetNews (Mar 2)
Of interest: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
Google is also applying machine learning to know that one thing can relate to another even though there isn't an exact match on words. Clustering is part of the process.
"To do this, the system tries to cluster concepts into "reasonably coherent" subclusters that seem related. These clusters, some tiny and some huge, are named automatically. Then, when a query comes in, the system produces a probability score for the various clusters. " Google uses this for its contextual ads and to cluster news stories in Google News. Now - if they would just add it to web search results.
Net searcher has its ears to the blog Faster information on trends promised by prototype tool by SARAH STAPLES, CanWest News Service (Feb 28) -- Accenture Technology Labs, in Palo Alto, Calif. has a tool - Online Search -- that "focuses on several thousand influential sources of online news and gossip that have traditionally been less accessible to search algorithms - from chat rooms and bulletin boards, to Usenet groups, fan sites and blogs written by amateur scribes. From those, it identifies hot topics and monitors people's positive or negative reaction to the next new thing."
"In conversation with..." Jim Lanzone & Apostolos Gerasoulis of Ask eeves/Teoma by Mike Grehan, e-Marketing News (Feb 2005) -- In conversation with Jim Lanzone, Senior Vice President, Search Properties at Jeeves and Apostolos Gerasoulis, founder of the Teoma search engine. Lots of gems in this. Grehan undertook this conversation, reproduced here as a transcript, as part of his research for a book on search engine marketing.
+ Reviews ranking technologies. Grehan refers to another research paper he wrote about use of link analysis in Google's Page Rank and the topic clulstering doen by Teoma based on Kleinberg's algorithm. Gerasoulis says that Google is only using its Page Rank to break ties -- "The importance has diminished because PageRank is just one piece of the ranking algorithm over there. The ranking algorithm is so much more complex now. And PageRank is just used when they want to break ties."
+ Ask Jeeves doesn't intend to absorb Excite, iWon and MyWay, but it might switch these portals over to the Teoma search.
+ Is it wise for Yahoo to index XML feeds and web sites? These three men say not. "It's mixing apples and oranges, the structured data with the unstructured."
+ "Majority of searches on the web are non-commercial".
+ Gerasoulis expects 2005 will be an exciting year for search engines. "Now it's not just about communities, it's about the users. There are new technologies coming in which will change the way that people access information."
Web searching made more successful with automated, personalized assistance system - from Penn State, PHYSOrg.com (Feb 18) -- search software in the future might give better advice by watching what the seasrcher does.
Of interest >> "A Penn State researcher has developed software that improves Web searching with a personalized system that offers automated assistance for structuring and refining queries, evaluating search results and finding more relevant information. "Research shows 50 percent of all Web results retrieved are not relevant, pointing to a need for improved searching techniques," said Jim Jansen, assistant professor of information sciences and technology. "This technology enabled a 20-percent performance increase.""
Google Watchers See Shift In Algorithm by Shankar Gupta, Online Media Daily (Feb 22) -- Signs that Google has changed its relevance ranking algorithm -- "... new formula appears to give more weight to sites that have content, not just sponsored links and a navigation bar. And Google apparently now evaluates the anchor text to determine if it's related to the site content, or is just the same word over and over again--in which case the site's rank would fall."
In search of more: the ‘friendly’ engines that will manage the data of daily life By Richard Waters, FT.com (Feb 1) Futuristic view of what searching may become.
Of interest -- "Users will want more direct responses to their search queries, the experts acknowledge. "The biggest change we will see in the next five years will be in the way people use computers," says Mr Silverstein. Mobile handsets will become the most common way to find information on the internet, he adds. At that point, most queries would best be made and answered by voice. f the search companies become a more integral part of everyday life, how far will their influence eventually extend - and what impact will they have on other companies that exist to create or distribute information?"
Seeking Better Web Searches "Deluged with superfluous responses to online queries, users will soon benefit from improved search engines that deliver customized results" By Javed Mostafa, Scientific American (Jan 24) Sweeping article about the trends in search, starting with a review of the ranking algorithms, personalization initiatives and the potential for full customization that will include location, plus advances in searching for images and music.
Conclusion: "By leveraging advances in machine learning and classification techniques that will be able to better understand and categorize Web content, programmers will develop easy-to-use visual mining functions that will add a highly visible and interactive dimension to the search function. Industry analysts expect that a variety of mining capabilities will be available, each tuned to search content from a specialized domain or format (say, music or biological data). Software engineers will design these functions to respond to users' needs quickly and conveniently despite the fact they will manipulate vast quantities of information. Web searchers will steer through voluminous data repositories using visually rich interfaces that focus on establishing broad patterns in information rather than picking out individual records. Eventually it will be difficult for computer users to determine where searching starts and understanding begins."
Some bits are turning up about clustering search results including a paper written about why and how to do it -- Learning to Cluster Web Search Results, Microsoft Research Asia. Paper finds that current clustering approaches don't produce good labels, and they propose a new method that uses and ranks "salient names".
"Our method is more suitable for Web search results clustering because we emphasize the efficiency of identifying relevant clusters for Web users. It generates shorter (and thus hopefully more readable) cluster names, which enable users to quickly identify the topics of a specified cluster. Furthermore, the clusters are ranked according to their salience scores, thus the more likely
clusters required by users are ranked higher."
Other experimental bits are mentioned in Web Search Clustering from Microsoft (and other Clustering Tools) Search Engine Watch Blog (Jan 11)
A Look Ahead by John Battelle (Dec 22) - predictions for 2005 in which the blogosphere will get more fractious: Firefox will win over 15% of the browser market but Microsoft will release a good upgrade; Yahoo and Google will do even more for merchants - and several more.
What’s Next for Google By Charles H. Ferguson. Technology Review (Jan 2005) Sees that the "search industry is ready for an architecture war" -- "Architecture wars (also known as standards wars) occur because information technology markets require standards in order to manage complexity, communication, and technological change." Google and Microsoft are the main contenders. Examines strategies, past and present, of each and observes Google to be in the more precarious position. Shareholders, take note.
Google Now Indexing Up to Six Url Variables Search Engine Roundtable (Dec 7) Google has been seen spidering URLs that contain 6 variables, showing that it is getting better at penetrating into databases.
Searching Smarter, Not Harder by John Gartner, Wired (Nov 30) Some organizations are constructing topic maps to categorize content and show aspects and relationships. An example given was William Shakespeare -- " ... would be mapped to essays about him, his plays and his famous quotes." Topic maps are created by computers and modified by humans. Mentions work in Europe at Ontopia, Mondeca and Empolis to develop commercial applications.
Narrowing the search November 22, 2004, By Raul Valdes-Perez, News.com -- Notes several drawbacks to the personalization of search, most particularly that it's difficult to infer interest from what people click on. He sees a better future in clustering techniques. Mind Valdes-Perez is CEO and co-founder of Vivisimo and is responsible for the leading technology for clustering results.
"Advanced Search Techniques using Natural Language Processing" by Tony Rose, Freepint Nov 25, 2004 - overview article about work to improve information retrieval using natural language processing techniques.
Basis Technology to Enhance Multilingual Search in New MSN Search Engine Business Wire via CBS Marketwatch (Nov 17)
"Basis Technology today announced that Microsoft Corp. has chosen the Rosette Linguistics Platform to support Web searches in its new MSN search engine." ... "The Rosette Linguistics Platform uses state of the art Natural Language Processing techniques to improve information retrieval, text mining and other applications and apply them to global markets. Rosette provides capabilities like identifying the language of incoming text, providing a normalized representation in Unicode, and locating names, places and other key concepts."
A Google-Microsoft War by John Dvorkak, PC Magazine (Nov 16) Predicts an all out war between Microsoft and Google with similiarities to the Netscape - Microsoft war. Who's Netscape in this competition? Google? Is Google trying to create a browser-centric online environment?
A Conversation with Matthew Koll by Gary Price, SearchDay (Oct 18) Matthew Koll, once CEO of Personal Library Software and now of Wondir, spoke to Gary Price about the state of the web search industry. Some comments:
+ Google is in the business of advertising and maybe 50% in information retrieval.
+ Searchers do need specialized tools but " knowing where to look is the first and biggest obstacle to overcome in searching."
+ Future - "voice access and task integration"
The Answer Search natural language search engine " The Norwegian company Stochasto is getting ready to launch their natural language search engine, Answer Search, in English. " Pandia (Oct 15) - look promising but won't be available until Q1 2005.
The Meta Description Tag and Search Engines Jill Whalen. ISEDB.com (OCt 14) "The keywords and phrases you use in your Meta description tag don't affect your page's ranking in the search engines (for the most part), but this tag can still come in handy in your overall SEO campaigns."
Author tested the use of meta description tags at Google, Yahoo, Teoma, MSN.
Google - will use a snippet from the meta description tag if the search term is used in the text and in the description tag.
Yahoo - does show the meta description tag on some keyword queries depending on occurrence of words in the text (exact rules are not clear). It will also search on the tap and display the record even if search term appears only in the description tag. And lastly, on a url search it shows the meta description tag if available.
Teoma looks at the meta description but does not necesssarily display.
H-Bot is an "automated historical fact finder" developed at Center for History and New Media. It responds to natural language questions. When did Scott go to Antarctica? When did Louis Riel die? (But it can't tell you what Louis Riel did.) Interesting.
Reviewed by Tara Calishain -- H-Bot Answers Historical Questions (Oct 12)
Google's Web 2 Demo and the UI Plunge by John Battelle, SearchBlog (Oct 12) Reports on Google's demos at the Web 2 conference for language translation (seemed powerful), named entities and clustering.
Named entitity extraction: "essentially identifying semantically important concepts and the meaning wrapped around them".
Also predicts that Google will follow Ask Jeeves, A9, and Yahoo in using search history and personal data to filter and rank search results.
Google Sets Sights on Clustering, Translation By Matt Hicks, EWeek (
October 7, 2004) - Finally, the improvement we've all been waiting for. Google previewed work in clustering entities and words at the Web 2.0 conference. Unfortunately a beta version is not available yet.
Search Me: Online Search Shifts from a Navigational Tool to a Customer Service and Educational Tool Tim Carpenter, Senior Analyst, Watchfire GómezPro, Insurance Technology Online (Sep 22)
"... search has taken a different turn in the financial services industry and is being used increasingly as both a customer service and educational tool, with the goal being to answer precise questions rather than to direct users to a specific product or area of the site".
We should start seeing more FLASH (swf) files in search results now that Macromedia has made it easier for search engine spiders to read and index the files. Major search engines are said to have adopted the patch provided by Macromedia.
Search Engines Can See the Movies - Macromedia FLASH SDK Internet Search Engine Database (Sept 20)
Web IR & IE - Information Retrieval and Information Extraction Has publications, mailing lists, newsgroups, and names of people active in this area.
Reviewed by Chris Sherman in Search Engines 201 (Sept 13)
Kozoru wants to give relevant answers to your questions Lars Iselid, Pandia (Aug 22) John Flowers hopes to create a natural language search engine - Kozuro - by building up a knowledge database. Good luck.
Next-generation search tools to refine results By Michael Kanellos.
CNET News.com (Aug 9)
Report from New Paradigms for Using Computers Conference, held at IBM's Almaden research lab. New ways for searching for information will involve connections either assigned (classification) or discovered (latent). Mentions work by University of California at Berkeley on Flamenco for searching art and antiques that uses faceted classification. Also Inxight's software to find connections between people and institutions according to information on the Web. There are also the many projects to index the desktop especially the MyLifeBits by Microsoft. Predicts the end of the file system.
Has figures on amount of information in the world.
- 100 million written books
- 2 million to 3 million audio recordings
- 100,000 to 200,000 theatrical movies - 1/2 from India
So Much Information, So Little Relevance by Steve Johnson. Computerworld (Aug 2) - Consumers are more interested in receiving personalized Web services and the Web services - especially search - are interested in presenting the right advertisements (if not the right search results). Collaborative filtering was an early approach used for recommending music and books but it is notably error prone. Attributized Bayesian Choice Modeling (ABCM) is better at understanding why people like the content. Still, it is not for every web site. Companies must know when personalization will be most useful.
Gary Price interviewed Dr Gary Flake, Principal Scientist & Head of Yahoo! Research Labs. Part 1 starts in Behind the Scenes at Yahoo Labs (June 24) Flake describes the work of the Yahoo! Research Lab and reflects on the state of web search engines -- "Today, search engines have almost no understanding of words or language in any significant way. " His intention is to get closer to the perfect engine -- "If web search were perfect, then it would produce an answer to every query that would be as good -- or better -- than if the smartest people in the world had as much time, data, and contextual information (about the user) required to fulfill the query; and it would do all of this in a split second. "
In Behind the Scenes at Yahoo Labs, Part 2 Flake discusses structured and unstructured data and the possibility of extracting implied data from pages. Personalization is an important development area - he foresees more tailoring of the relevance ranking functions.
Behind the Scenes at Yahoo Labs, Part 3 (July 7) covers a variety of topics - Yahoo! shortcuts as answers, local search, filtering out spam, and new features. Flake is certain that personalization will make the difference.
Perhaps good decisions can come from a crowd. That is the message of The Wisdom of Crowds: Why the Many are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations, a new book by James Surowiecki. According to Nigel Waters in his review of the book - A crowdy crystal ball [The Globe and Mail Book Section for July 17, 2004] -- Surowiecki "is able to show that, for certain types of problems, the group is wiser than the individual. " He cites as one example Google's ranking according to links or votes from other sites.
Contrast this to conclusions by Terrence Brooks of the University of Washington's Information School that Google's method for ranking search results replaces judgements by experts with that of the crowd. [See The Nature of Meaning in the Age of Google Sitelines (July 11)]
But if Surowiecki is right about crowd wisdom maybe Google's approach is actually better.
The nature of meaning in the age of Google by Terence Brooks, University of Washington. Information Research April 2004
Apple unveils its answer to users' searching questions by Laurie Flynn. International Herald Tribune (June 29) Apple will be introducing an all-purpose search in its next version of the Macintosh Operating System. The search is called Spotlight and will be able to find data anywhere on the hard drive.
"GIS Enabling the Internet" By Chris Kutler. FreePint (July 1, 2004) Geographic identifiers for web sites would greatly help in localizing information. This article describes current situation where web site owners must "register" their sites by location. It looks to Dublin Core to establish location as metadata, and to the search engines to use it.
A ResourceShelf Interview: 20 Questions with Gary Flake, Head of Yahoo Research Labs (June 3) Gary Price asked what is wrong with web search today to which he replied, "Today, search engines have almost no understanding of words or language in any significant way. They exploit the statistical properties of words and links, but in no way is there anything going on akin to understanding. Search engines don't recognize user intent, can't distinguish goal-oriented search from browsing search, and are completely ignorant of the subtleties of how different concepts relate to one another. Moreover, they completely lack wisdom -- i.e., they are very poor at distinguishing between trivia and something profound.".
That said, sounds like Yahoo will be pushing into personalization and expanding content.
Quoting Gary Flakes in part 2 of the interview: "My hunch is that personalization will be so good that most users will look back to web search circa 2004 as ridiculously outdated."
From IR to Search and Beyond ACM Queue vol. 2, no. 3 - May 2004 by Ramana Rao, Inxight Software -- History of search and information retrieval from the 1960s to the present. Describes several models and considerations. Sees a future with a "richer user model of information space".
There's a running list of Google Bombs at Google Blogscoped -- Googlebomb Watch.
What Lies Ahead For Local Search Engine Technology by Andy Beal webpronews.com (June 2004 ) Andy Beal spoke with Arnaud Fischer, head of the Search & Directory division at Infospace about developments in local search. Search engines are putting their R&D dollar into finding ways to better deliver results that are specific to the area you are in - especially the advertisements. Infospace is one of these players as a service for yellow pages.
"Geo-targeting Web search content, both organic and paid, requires search engines to better understand users and queries, inferring local intent by extracting geo-signals and leveraging implicit and explicit user profiles. "
Fischer also commented on desktop search saying that, "Both Microsoft Longhorn and IBM WebFountain will eventually make search a lot more transparent and integrated to end-users' broader task-centric activities. "
SEARCH AND DESTROY by James Surowiecki. New Yorker (May 24) - about the manipulation and misuse of Google's ranking system. Considers Google bombs a prank, but search engine optimization a "racket". Prospects are not good. "Google works best when no one knows it’s there—when people are making their own decisions about which sites are useful or good". But that is no longer the case.
"The Semantic Web is Your Friend" By Libby Miller and Simon Price in Freepint (May 27) - finds that semantic web is emerging through social networking software - refers to the Friend Of A Friend (FOAF) project
Building Nutch: Open Source Search Sponsored by Verity, Effectively Evaluate Enterprise Content.
ACM Queue vol. 2, no. 2 - April 2004 by Mike Cafarella and Doug Cutting, Nutch - about the experimental search engine Nutch and writing "an open source search engine".
Abebooks Selects Endeca - Abebooks, an online marketplace of 12,000 bookstores, will be using Endeca InFront to power search and navigation at its international sites. Endeca uses taxonomies to improve navigation. Barnes and Noble uses Endeca as well.
Gary Price said , "What I like most about Endeca is the ease with which a user can refine their results by simply pointing and clicking the refinements listed on the right side of a results page. " Resouceshelf
Google Moves Toward Clash With Microsoft By JOHN MARKOFF New York Times (May 19) Google has been testing a "powerful file and text software search tool for locating information stored on personal computers." This puts Google in a full head-to-head with Microsoft who will have similar function in the new Longhorn system. Microsoft's intentions seem to be to remove the need for a browser. This is Google's response -- "The disappearance of the Web browser and the integration of both Web search and PC search into the Windows operating system could potentially marginalize Google's search engine. Google, well aware of this threat, hired a Microsoft product manager last year to oversee the Puffin project as part of its strategy to compete with Microsoft's incursion into its territory." But will embedded search also mean advertising? Likely. Are there privacy issues? Yes. Article did not ask if the indexing of personal files will slow down a personal computer.
Also available at IHT as Google Invades Microsoft's Turf.
Web Search: On to "Sense-Making" by Ben Elgin. Business Week Online (May 6) "IBM's Dan Gruhl and Andrew Tomkins explain how Big Blue's WebFountain technology tries to answer "why" questions"
Of interest:
"Just as intriguing, WebFountain is attempting to bring a time axis to Internet search. Today, search engines provide a snapshot of how the Web views a certain topic. But it's largely a medium without a memory. That makes it next to impossible to spot trends or easily analyze how things shift over time -- which could be compelling information. Imagine the value a marketer would get from an answer to the question: "How have mentions of my brand changed over the last six months?" "
Search engine tackles tricky lists New Scientist Print Edition (07 May 04) Work by Oren Etzioni at the University of Washington to create a search engine that can make understand sentences well enough to extract lists of scientists, or botanists, or anything.
"Etzioni's ultimate aim is to have KnowItAll answer questions such as "list all British scientists born before 1900". The software cannot do that yet, because it lacks a module that can understand "natural-language" questions of this type. That will come later, he says. "
Do What I Mean by Robert Cringely. PBS.org (April 22) - MeaningMaster is a technology developed for natural language queries based on the use of a lexicon that has 200,000 words interconnected based on meaning. It has been years in the making. Article mentions that both Google and MSN are interested. First reason for interest, of course, is for contextual advertising.
Two excellent articles on problems with searching in the April 2004 issue of ACM Queue about Enterprise Search.
Enterprise Search: Tough Stuff - When searching fewer documents, shouldn't it be easier to find what you're looking for? by Rajat Mukherjee and Jianchang Mao, Verity.
Searching vs Finding: How do you help computers find the information people really want? by William Woods, Sun Microsystems Laboratories
Source: TVC Alert
Web Search for Tomorrow by Ben Elgin. Business Week Online (May 6) -
Describes some new developments in search technology to watch for:
- personalization but may take another couple of years to get it right.
- trend searching - taking snapshots over time. IBM's WebFountain does this but it will be along time before the technology is available for consumer search.
- desktop search. Microsoft has the lead with "Stuff I've Seen" being tested by staff.
- better results through clustering. Vivisimo excels at this. ixMatch is mentioned - has software for corporate use.
Presentations from the annual Infonortics Search Engine Meeting held for 2004 in The Hague are now available for viewing. All will be fascinating but in particular:
The Subtle Side of Retrieval - Elizabeth Liddy, Syracuse University, New York, USA
Search and Guided Navigation for Unstructured Content? - Peter Bell at Enteca
A Holistic Approach to Search - Tuoc Luong, Ask Jeeves
Convera: Fundamental Approaches to Categorisation - Iain Fletcher
Social Software and New Search - Stephen Arnold
Human Intervention in the Search Process - Martin Belam, BBCi Search
Turbo10: The Mechanics of a Deep Net Metasearch Engine - Nigel Hamilton, Turbo10.com, UK
Google In Controversy Over Top-Ranking For Anti-Jewish Site By Danny Sullivan, SearchDay (April 25) Sullivan tells a blow-by-blow story of the appearance, disappearance and re-appearance of the anti-jewish website, Jew Watch, at Google in top-ranked results all the while trying to divine Google's actions. Bottom line: Google states "The only sites we omit are those we are legally compelled to remove or those maliciously attempting to manipulate our results." Controversy over ranking may be with us for a very long time.
Humans vs. Computers, Again. But There's Help for Our Side. By JAMES FALLOWS. New York Times (April 18) - the next great breakthrough in search will be of internal documents - the stuff on our own computers. Fallows looks forward to the day when there is a tool that has the clarity of Google to solve this. Microsoft says the solution will be in LongHorn. Others are trying too -- One Note, ADM, askSam, BrainStorm, Chandler, Enfish, InfoSelect, iRider, Lookout, Onfolio, TheBrain and Zoot - and many more.
Search prototype gets the picture By Michael Kanellos and Stefanie Olsen
CNET News.com (March 30) -- "Researchers at Purdue University have developed a search engine that retrieves results based on an image or a sketch." - reviews standard image search (largely based on text) and the implications of being able to search based on a sketch of a shape. Article was unaware of the work of Idée in Toronto.
Mapping Knowledge Domains is the major subject in April 6, 2004 from PNAS Online - Proceedings of the National Academy of the United State. Some articles are free.
Of interest:
Extracting knowledge from the World Wide Web by Monika Henzinger * and Steve Lawrence - talks about communities on the Web.
The world of geography: Visualizing a knowledge domain with cartographic means by André Skupin *
[Source: ResourceShelf Professional Reading]
Winning the Name Game Technology tools are helping companies monitor their reputations on the Internet. Alan R Earls. ComputerWorld (April 5) - features WebFountain and Factiva collaboration to track corporate reputations by analyzing Factiva's information sources daily. Aslo Biz360 - it monitors some 50,000 print, online and broadcast sources.
Susan Feldman, a Director at IDC, offered some technical explanation about how these tools work -- "The key to Factiva and some of the other reputation management offerings is text analytics ... That capability lets you look inside documents and pull out the information you need on a specific topic -- it parses the document the way you would parse a sentence in fifth grade" Tools use syntactic analysis - "It can distinguish the difference in meaning between the statements 'Bill hit Fred' and 'Fred hit Bill." "If you want to look for ideas rather than just words, you can store them as a block that includes the subject, object and verb relationship" ... "Then you can match those similar concepts."
Gopher is still alive. This was the directory-like service used in the very early days of the Internet pre-Web for organizing and retrieving documents. There are still 250 active gopher services. Gopher: Underground Technology in Wired (April 12) - John Goerzen in Kansas is doing the most to preserve it. He sees some future for gopher in data exchange.
The nature of meaning in the age of Google by Terrence A. Brooks at the Information School, The University of Washington, Seattle, USA
Brooks argues that Google that succeeds with PageRank in dealing with an unruly and wild Web. By aggregating links, it does capture fairly well the "subjective sense of Web-page importance" and serve the average searcher. But it cannot extract meaning well. Article looks at the efforts to create a Semantic Web and contrasts it to "historical ambitions" such as the WOrld Brain by H.G. Wells in 1937. The Web is unruly and won't be easily wrestled into control through metadata.
Abstract: "The culture of lay indexing has been created by the aggregation strategy employed by Web search engines such as Google. Meaning is constructed in this culture by harvesting semantic content from Web pages and using hyperlinks as a plebiscite for the most important Web pages. The characteristic tension of the culture of lay indexing is between genuine information and spam. Google's success requires maintaining the secrecy of its parsing algorithm despite the efforts of Web authors to gain advantage over the Googlebot. Legacy methods of asserting meaning such as the META keywords tag and Dublin Core are inappropriate in the lawless meaning space of the open Web. A writing guide is urged as a necessary aid for Web authors who must balance enhancing expression versus the use of technologies that limit the aggregation of their work."
From Information Research Volume 9 No 3 April 2004 published by Professor Tom Wilson of the Department of Information Studies, University of Sheffield.
Beyond Googling: tech industry generating next generation of search engines by ANICK JESDANUN Canadian Press (March 26) Article mentions Mooter, Dipsie (that claims it will get past dynamic web pages), Eurekster (social networking), Superpages, Factiva.
Search tool aids browsing By Kimberly Patch, Technology Research News ( March 10/17, 2004 ) Researchers from Carnegie Mellon University have developed software that will help assess relevance of links in a search.
"The software, dubbed ScentTrails, shows a user how strongly the links generated by a Web search correlate with the topics she is searching for. The software grades the links a search engine returns by increasing the font size of links that have more connections to relevant pages. "
It's still in the prototype stage.
Google to find place for Orkut network in search by Michael Kanellos. CNet (March 22) Google will do an Eurekster thing and integrate social networks to search. "Schmidt [Google CEO Erick Schmidt] said that such services are a natural complement to the sort of automated searches that Google now provides, because it allows visitors to connect to experts or at least to people with knowledge. "
Federated Searching: A Viable Alternative to Web Surfing by Barbara Fiehn. TechNews World (March 22) "A possible solution to the Google-only research approach is making its way into schools via library media center automation systems. Imagine searching your local library media center and other library collections, Web sites, and subscription databases with a single click of the mouse. " - examines the good, bad, and ugly - are still some wrinkles to work out.
The high cost of not finding information By Susan Feldman of International Data Corporation (IDC). KM World (March 2004) There is a cost to not finding information. Fifty percent of searches are likely abandoned. Studies show that knowledge workers spend up to 35% of their time looking for information. IDC extrapolated from these studies some estimated costs of not finding information. The figures are shocking. Can technology help? Susan Feldman does name a few companies -- "Autonomy autonomy.com ClearForest clearforest.com Convera convera.com Endeca endeca.com FAST fastsearch.com InQuira inquira.com Inxight inxight.com iPhrase iphrase.com Mindfabric mindfabric.com Siderean siderean.com Verity verity.com".
First Monday (March 2004) has two articles
Finders, keepers? The present and future perfect in support of personal information management by William Jones. Looks at costs of deciding what information to keep or destroy. Wants to "Develop tools that decrease the likelihood that "keeping" mistakes are made in the first place."
Do you "google"? Understanding search engine use beyond the hype
by Eszter Hargittai - warns against drawing conclusions about search engine behaviour based on seeing Google as the most popular. Millions of people don't use it.
David Seuss, CEO of the revived Northern Light gave a presentation to Computers in Libraries about "Ten Years into the Web: The Search Problem is Nowhere Near Solved." (March 2004) [Powerpoint]
Opened with a history of access to information beginning with the Puritans in Massachusetts. Noted that there are unintended consequences to information technology. Web search began as a good thing but today we are seeing that "web search results decline with the size of the Web databases". Junk is a big problem. Also, innovations in search are driven by revenue objectives - it's all about improving advertisements. Seuss does have an alternative -- "organize content into many high quality databases for professionally-oriented Web searching".
Spotted at the ResourceShelf.
In search of the deep Web by ALex Wright. Salon (March 9)
"Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives. "
Considers implications of this on publishers and prices.
Here we go again - a search tool that will learn from us (it says). AOMI's Artificial Intelligence Search Product Will Render Most Traditional Internet Search Technologies Obsolete. PR Newswire via News Alert (March 1, 2004) AOMI is just vaporware so far - it's coming later in 2004.
New Web tools aim to customize searches By Michael Bazeley. Mercury News (March 1, 2004) - review of work at Google and Yahoo in personalizing search.
The Future Of Search Engine Technology Andy Beal WebProNews.com (Jan 29) - notes that search engines are trying to "anticipate the intentions of the searcher" but it tends to be for finding neighbourhood pizza shop. Argues that the future is based on personalization. "However, in order to achieve this new search nirvana we, as consumers, must quell our fears and trepidations surrounding the protection of our privacy. In order for the search engines to develop technology that will be intuitive and anticipate our every need, we must first relinquish at least some of the privacy that we currently hold so dear. Let’s take a look at some of the ways that search technology could improve and you’ll soon get the idea why it will require us to cooperate with the search engine providers. " Has other scenarios of a rosy future in which search improves because the operating system monitors all activities. Let's remember - all solutions bring new problems.
Gary Price interviewed Jason Wiener, CEO of Dipsie. Dipsie is working on indexing the invisible web - "We can index pages that utilize cookies, database backends, forms and client-side scripting, among others. Our scalable technology will allow us to have over 10 billion pages within our first year alone." Ranking methods will be "language based".
Yahoo Keyword Density Analysis Comparison to Google Research by goRank.com compiled on Feb 17 comparing "keyword density elements of Yahoo's new algorithm with Google's algorithm".
Found that Yahoo seemed to have a preference for more words on a page and more frequent exact word matches. Google's lower figure for keyword density (2% vs Yahoo at 2.8%) may because it does semantic word matching.
Both engines care about keyword density in the title. (Google = 16.9% and Yahoo = 19.6%).
Link text is a factor too, where Yahoo may prefer less text and better matches in the links.
Bolding could make a small difference. Yahoo likes it.
LeanIndex from 312inc.com might be the solution for anyone who needs to create their own niche search engine that will also alert them to new information.
312, Inc. Launches LeanIndex and LeanSwap, a Powerful Personal and Social Search Solution for Windows, UNIX, Linux and Macintosh Users Press Release (Feb 23)
"LeanIndex personal search engine is simple to use and finds information fast. It runs from a profile created by the user that contains keywords to look for, Web sites to search, the time between searches and how the user wants to be alerted. LeanIndex only searches Web sites the user pre-selects and trusts to keep them up-to-date with reliable news and information."
"LeanIndex simplifies a user’s ability to find what they need allowing them to make better-informed personal and business decisions. Three Twelve’s LeanSwap service creates a new Web community for sharing LeanIndex search profiles, tips, tricks and ideas. “312 created LeanSwap so people searching the Internet can now find other people who have similar interests and exchange ideas, tips and Web information sources,” said Brian Neilson, 312’s co-founder and chief executive officer."
Search For Tomorrow by Joel Achenbach, Biz Report (Feb 16) - presents a history of web searching as background to some comments about its future. The future is to be ruled by agents.
""I often use the analogy of Web agents being like travel agents," says James Hendler, a computer science professor at the University of Maryland. "When I go to my travel agent and say where I want to go, they don't usually just say, 'Yes, you can get there.' They give me some options of different ways to get there. They think about some things I might have forgotten. Do I need a car, do I need a hotel reservation? And then they go do it for me.""
Looks to the metadata of the promised Semantic Web to make it easier for search engines to "understand" what it's looking at.
Search Beyond Google by Wade Roush. Technology Review (March 2004) [Requires free registration] -- "Google reigns supreme as the search engine of choice—but for how long? A pack of startups—and Microsoft—are developing technologies to find what you want, faster."
Excellent article on the challenges of search in an ever expanding web of information. Notes that Google has reason to be anxious. Page ranking by popularity, while it was a huge boost 2 years ago, is now plagued by spammers and may also not scale well. Many are working on alternatives.
"For example, there’s Teoma, which ranks results according to their standing among recognized authorities on a topic, and Australian startup Mooter, which studies the behavior of users to better intuit exactly what they’re looking for. And then there’s the gorilla from Redmond: Microsoft is turning to search as one of its next big business opportunities. Its researchers are devising a new operating system that melds Google-like search functions into all Windows programs, as well as software that scours the Web for definitive answers to questions you phrase in everyday English. Meanwhile, Yahoo! launched its own research laboratory in January, and Cutting himself is building an open-source alternative to Google (see “Keeping an Eye on Google”). “Nowadays,” he says, “I’m not convinced [Google is] markedly better.”"
Article describes how Mooter works - a clustering search engine that learns from what you look at.
"Mooter analyzes the potential meanings and permutations of the starting keywords and, behind the scenes, ranks the relevance of the resulting Web pages within broad categories called clusters. The user first sees an on-screen “starburst” of cluster names. ... "To develop a more precise understanding of what the user is probably looking for, the Mooter engine notes which clusters and links get clicked and uses that information to improve future responses. Suppose a user enters the term “dog,” clicks on a cluster called “breeds,” and then spends a lot of time looking at sites about Schnoodles (a popular Schnauzer-Poodle mix). When the user clicks on a new search result, Mooter will personalize the ranking to reflect this apparent pattern of interest, which might, for example, lead to sites about “dogs” plus “breeds” plus “Schnoodles” appearing higher. A refined set of results appears on every page; the engine continues to adjust the rankings based on the user’s behavior."
Another newcomer, Dipsie, intends to index the Deep Web of content in databases.
Teoma has been using its analysis of links between sites to identify web communities.
The upside down of search "Commentary: At what point is search too good? " by Bambi Francisco. CBS Marketwatch (Feb 10)
Search can be improved by utlizing social-networking should we do it? Article talks about Spoke Network and its work to "is make the search process, or at least the searching-for-people process, more personalized and relevant".
"By organizing information based on social networks drawn from members' address books and the people they communicate with through e-mails (and instant messaging in the future, I'm told), Spoke improves upon the average search engine's results. ... On the other hand, the data it pulls together includes information about millions of people who are not members and suggests a dark underside to search precision." Also mentions Vivisimo's clustering as a search technology that will improve search. Concludes -- "The consequence of it all: There is no privacy left. We're more accessible. We're more targeted (Do we really need improved targeting for spam?). The channels to get to us are better defined. "
Microsoft's plans for a new search engine technology by Andy Beal. Pandia.com (Feb 2004) -- "Guest Writer Andy Beal talks to Robert Scoble from Microsoft about the future of search engine technology, Google and how search will be handled by the next incarnation of Windows. "
Microsoft is working hard at improving searching of the hard drive but what about the Internet? Robert Scoble sees "social behaviour analysis tools like Technorati becoming far more important". Also search engines will become more specialty - just RSS, just news etc. And users want more ways for search results to be delivered.
Monster librarian at work By Dean Takahashi. Mercury News (Feb 5) - says that IBM computers gather 250 million web pages a week as grist for WebFountain's high-powered analysis. WebFountain looks for associations of names and words.
"Now IBM has begun licensing the technology to create ``buzz reports'' for corporate clients. WebFountain scours Web logs, chat rooms, newspaper stories and every other source of information to determine whether the chatter about a new product is good or bad; is a certain rock group on the way up or a one-hit wonder?"
Hack Your Own Search Engine Crawler By Chris Sherman. SearchDay (Feb 4) - Reviews the new book - Spidering Hacks by Kevin Hemenway and Tara Calishain. The book "offers "100 Industrial Strength Tips and Tools" for creating and running your own spiders. Among these tips and tools, of course, are instructions for creating your own personal web crawler that works much like those used by the major search engines."
The Future of Search Engine Technology by Andy Beal. Pandia (Jan 28, 2004) - foresees changes in personalized results - more tuned to your real interests. Related to this will be advertisements in web-based email that are more relevant - especially if Google does go ahead with an email service. Desktop search is sure to develop. (Google deskbar sure is handy.)
On Search, the Series By Chris Sherman. SearchDay (Jan 29) - describes a series of essays that Tim Bray, CEO of Antartica, has written about search as "almost a virtual textbook on search engine technology ... highly readable, and replete with Tim's personal insights and opinions."
There are 15 installments to On Search, the Series.
FAST Debuts Enterprise Search Platform by Paula Hane. Newsbreaks (Feb 2) -- "FAST ESP (Enterprise Search Platform) creates a single point of access for all information across an enterprise—in real time, regardless of data format, structure, or location." Susan Feldman, a Director at IDC and author of the article "The Answer Machine" (Jan 2000) said “FAST ESP is the first approximation of an ‘answer machine’ that I have seen.”
Fast Search & Transfer Seeks More Customers With New Service
BY PETE BARLAS, INVESTOR'S BUSINESS DAILY (Jan 27)
"Fast's new search service provides a speedy and more efficient system for companies and their customers to retrieve information on Web sites and private intranets. The service also helps businesses abide by federal compliance laws by locating key relevant documents."
Learning About Search Engines From Google Engineers By Chris Sherman, SearchDay (Jan 26) -- "A new archive of publications by Google employees offers deep insights into many aspects of the search engine's operation. " See Papers Written by Googlers.
Yahoo! bets on search by Stephanie Olsen. Silicon.com (January 21 2004)
Gary Flake, previously of Overture, will head up Yahoo's new Research Lab.
"Much of the research is designed to improve web search and the relevancy of sponsored listings so these companies can win the loyalty of visitors and advertisers. "
"Related to search, for example, the lab will focus on how to personalise the experience for people across the Yahoo! network.
"We're here to help, not just in one or two areas, but across the whole spectrum of Yahoo! products," such as finance, news, IM and email, Flake said. "
Work is described at the Yahoo Research Lab website - http://labs.yahoo.com/
A Fountain of Knowledge 2004 will be the year of the analysis engine By Stephen Cass, IEEE Spectrum Online (Jan 4, 2004)
Cass describes the intentions and workings of IBM's Web Fountain. Search engines list documents with matching words. Web Fountain will analyze to make sense of it.
"WebFountain works by converting the myriad ways information is presented online into a uniform, structured format that can then be analyzed. The goal is to provide a general-purpose platform that can allow any number of so-called analytic tools to sift the structured data for patterns and trends. "
WebFountain will convert to structured data the content of web sites, blogs, newsgroups, mailing lists and more.
"WebFountain is not intended for casual surfers. Its target audience includes the business executives who have already shown they are willing to pay for the insights that mining corporate databases can supply. Analytic tools can ferret out patterns in, say, a sales receipt database, so that a retail store might see that people tend to buy certain products together and that offering a package deal would help sales. WebFountain will allow executives to go beyond their own databases and analyze up-to-date information from any online source. "
Factiva has partnered with IBM and will be launching a WebFountain-based service to track the online reputation of companies.
Google's (and Inktomi's) Miserable Failure by Danny Sullivan. Search Engine Report (Jan 6, 2004)
The practice Google introduced of link analysis for ranking results seems to have broken under the strain of Google Bombing. Google Bombing is where bloggers (and others) mischievously or maliciously use links and related text to jack up a target site (often a spoof) to top ranking. The latest in this is "miserable failure" to bring up the official biography page of George W Bush. Sullivan finds that Google and Inktomi have failed to counteract the undue influence these blogger bombers have on search results. He notes that Teoma is unaffected.
Danny Sullivan picked out an article in JimWorld as a gem because it pointed out that the patent for the famous Page Rank is owned by Stanford University. Has Google been trying other algorithms for ranking results in order to end its dependence?
The "Florida Update" ... Exposed ? in JimWorld by J Cokos (Dec 22, 2003)
In the Wake of the "Florida" Update by Karon Thackston. High Rankings Advisor (Dec 31, 2003) -- More about how Google is moving to semantic-based algorithms for ranking results and how this will affect copywriting by search engine optimizers (SEO). Mentions that Google is picking up more information-based directory sites and information pages and possibly less commercial.
"The reports are true... Google IS moving to a semantic-type system.
But that doesn't mean keywords are on their way out at all. After the
changes are made, Google will be going beyond *just* looking for
keywords on your page. They'll want well-written copy... actual
language that speaks to your site visitors. That means your copy will
take on a more important role than ever before. And that's great
news!"
How Search Engines Make Money By Grant Crowell, Guest Writer SearchDay (Dec 16) -- report from "Search Economics, Search Monetization Strategies," at the Search Engine Strategies conference in San Jose, August 2003.
Google's Florida Update: One Month Later By Gord Hotchkiss. SearchEngine Guide (Dec 15) It's widely recognized that Google has changed it algorithms for ranking results. Hotchkiss refers to Danny Sullivan's observation that Google could have two systems working now, one for more competitive (commercial) searches and the other for less competitive. But Hotchkiss goes another step and wonders if Google is starting to use the concept technology it acquired from Applied Semantics.
"Applied Semantics Concept Server used language patterns, including semantics and ontology to try to both determine the real meaning of the words on a website page and also to anticipate what people are looking for. It tries to interpret concepts based on the use of words, their proximity and the patterns they occur in. What if Florida was Google's first attempt to start introducing this concept to their search engine?
The other unique aspect about Concept Server is that it can refine results on an ongoing basis as it becomes "smarter". It starts by feeding concepts or results that it feels matches the searchers intentions. If the response isn't positive, it will try to do a better job next time. "
Is this what is really at work and the system will become self-regulating?
Meta Tags - What Are They and Which Search Engines Use Them? By Richard Zwicky. SearchGuild (Nov 28) Meta Tags are used in creating web pages to provide additional information about the page - author, description, keywords, perhaps copyright information. This article describes what they are and how to use them but doesn't identify which search engines use them. In general search engines don't use the metatag for retrieving or ranking results but may use it for the description.
VOICE: A bluffer's guide to search (Nov 12) NetImperative.com -- Ask Jeeves VP of production and technology Chris Martin gives a primer on web search. He identified three challenges search engines must content with - the user's query, matching query to indexed pages, and weeding out spam.
Relevancy is determined -- "Through looking at the language and words used in a web page, its context, and discovering associations between them. Secondly, through checking incoming links to a page to assess its link popularity. Discovering domain expert pages through subject specific linkages. Checking where the site is also referenced elsewhere - and 'spidering beyond the page', going to other linked sites, then going back to the original site and checking the association. Finally, through seeing if another search engine is listing the site."
Microsoft news site to customise content NewScientist (Nov 18) -- Raul Valdes-Perez, president of Vivisimo, commented on the customization that is to be part of the new news search engine from Microsoft. (uk.newsbot.msn.com). Specifically - "Now the way to improve the user experience is to work on the next layer of algorithms that determine the presentation of the "search and rank" results." Microsoft has not revealed how it will do the personalization - possibly something similar to Amazon's recommender system or through a system that looks for more-like-this. Vivisimo is also working on a news search and will be introducing news search that "spontaneously clusters links to news articles according to subject."
Sitelines comments on the effects of the use of link analysis by most search engines in ranking search results -- Rich-Get-Richer with Link Analysis (Nov 12)
Serge Thibodeau explains The Google API's and their uses at ISEDB.com (Nov 11). It's directed to programmers who need to access Google's web search database to build queries.
Never mind the talk about Microsoft wanting to buy Google or develop its own search engine. Microsoft is going full barrel into managing search through MS Office 2003 judging from announcements at the ResourceShelf -- "Microsoft links Excel to Edgar Online company data" and "eLibrary Integrated Into MS Office 2003". See ResourceShelf Business Research (NOv 4)
It's in the algorithms A glimpse into the future of mapping the Web By Paula MacKinnon. Information Highways (Nov/Dec 2003) - Search engine technologists continue to seek methods for improving relevance of results. Google is exploring personalization. MITACS (Mathematics of Information Technology and Complex Systems) in Halifax, NS is investigating the focused crawler that takes it clues from the user's Web browsing behaviour. The IBM Webfountain, Nutch, and Netnose are three others entering the fray.
Queries Guide Web Crawlers Technology Research News October 22, 2003
"Researchers from Contraco Consulting and Software Ltd., T-Online International and Siegen University in Germany have written an algorithm that improves Internet search results by factoring in what people are looking for. ... The algorithm, dubbed Vox Populi, picks up trends by analyzing patterns in people's Web search behavior. The algorithm might flag an increase in queries about soccer near the time of the World Cup, for instance."
Local Search Part 2: Google & Mobilemaps Bring Back Geosearching by Danny Sullivan. Searchday (Oct 21) -- "crawler-based methods being used by Google and Mobilemaps to improve local searching when tapping into a web-wide database of content." But it is still all very tentative and experimental. Article reviews the earlier work on being able to find local listings and map them.
But while Google may not have geo-searching entirely figured out for web searches it can do more regional placement for advertisers -- Google Launches Local Search Targeting & Search Forum Spotlight (Oct 24) People will see ads for their local geographic areas first; if there are none, they'll see national ads.
The State of the Search Engine Industry by Dana Todd. SearchDay (Oct 22)
This article is a short account of a panel discussion at the Search Engine Strategies Conference in August 2003. Topics touched on were paid inclusion, vertical engines (travel is doing very well, and Singingfish's multimedia search), and mobile search on cell phones and PDAs. Panelists were asked for their wishlists. Greg Notess of Searchengineshowdown asked for "truncation and proximity locators". Brett Tabke of Webmaster World hppes for "a subscription service for an ad-free search environment".
The Web: Search engines still evolving By Gene J. Koprowski. UPI Technology News (Oct 21)
Of interest -- "Using a combination of statistical mathematics, heuristics, artificial intelligence and new computer languages, researchers are developing a "Semantic Web," as it is called, which responds to online queries more effectively. The new tools are enabling users -- now on internal corporate networks and, within a year, on the global Internet -- to search using more natural language queries. "
"Key word searching is common today," Wiener said. "But the next generation of the Web is making documents more contextually relevant. The relevance of each document to a particular topic, or search, will be related by the semantic tagging language that developers are working on now in fields from artificial intelligence to relational databases to statistics. People have been actively pursuing this for two or three years now to evolve the Web. Several efforts are starting to rollout. I predict that in the next six months to a year, you will begin to see semantic relationship searching on the 'Net."
Mentions the work of ClearForest with unstructured data.
August 2009: How Google beat Amazon and Ebay to the Semantic Web by Paul Ford. (July 26, 2002) fTrain.com -- Futuristic article on how Google succeeded in becoming the largest online marketplace, easing out Amazon and eBay by using semantic web constructs. Mentioned by Stephen Downes in the OLWeekly Oct 17, 2003
Researchers search for faster searches AP via Globe and Mail (Oct 16) -- Carnegie Mellon University researchers are trying to make image search better by attaching labels that have been created through a computer game played by people. Some are skeptical this will work.
IBM WebFountain - taking web search to the next level it-analysis.com (Oct 15)
Describes Web Fountain as a "text analytics system".
Of interest -- "WebFountain runs on an IBM supercomputer and monitors everything on the World Wide Web. WebFountain contains over a petabyte of storage with over 3 billion pages indexed, 2 billion pages stored and the ability to mine 20 million pages a day."
"Web Fountain is not about building a better search engine; it is about identifying patterns, trends and relationships that can be used by businesses to transform the way they work. WebFountain can spot trends in public opinion and popular culture as they emerge and watch them catch hold around the world. WebFountain can be used as a surrogate for public opinion, providing instant, comprehensive virtual market research in the place of newspapers, Web page research or a professional report."
Google also owns pattern finding, meaning extracting technology through recently acquired Applied Semantics but it's being used to deliver targeted ads.
Study: Personalization not Secret to E-Commerce by Sharon Gaudin. Datamation (Oct 14)
"Jupiter Research released a study today that shows that only 14 percent of consumers say a personalized Web site lead them to buy more often from online stores. And just 8 percent say personalization makes them more apt to visit news, entertainment and content sites more frequently."
Study found that consumers want improvements in site navigation and more contact information. When they go online they have a task in mind - to find a particular CD - they don't need to be bothered by suggestions and distractions.
Surprisingly the article did not say anything about the privacy issues - people not wanting to give the information that would assist in personalization or have their activity tracked.
However, this was covered in Report slams Web personalization by Paul Festa in CNet News.
"More than 25 percent of consumers surveyed by Jupiter said they avoided Web site customization because of concerns that marketers would misuse the information. A similar proportion avoided registering with a Web site, for the same reasons."
Study indicated that personalization costs too much for little to no gain. However, there were some individual successes such as at Rand McNally, and 35% of the surveyed companies intend to go ahead with personalization plans.
The next thrust for web search engines is local search - being able to let you narrow your search to a particular city or even zip code.
Google has a beta site in its lab area for Location Search in the US -- http://labs.google.com/location
The little Gigablast will work with geo-sensitive metatags.
Overture has been testing localized search.
Pandia had an overview article Google and AltaVista test local search (Sept 23)
Danny Sullivan reported on New Developments In Local Search: Part 1, Moves By Overture in SearchDay (Oct 14)
Mainly the effort seems to be to make web search engines serve as yellow pages. But then - why not use yellow pages? Because the search engines want to serve up "localized sponsored matches" - ads for your area. If there are none, the search engine may be able to pull from the yellow pages.
Sullivan said, "Overture has a separate database of listings that involves a small number of its US national advertisers taking part in a pilot program. Additional "backfill" results are also provided by yellow pages and data provider Acxiom."
So far all the work is being done for US locations.
Copernic Launches Enterprise Search Product for the SME Market by Paula J. Hane, Information Today Newsbreaks (OCt 6)
"Copernic, a company known for its consumer metasearch product, Copernic Agent, has officially launched its first enterprise search product, Copernic Enterprise Search. ... a product that is specifically designed to meet the needs of the Small-to-Medium-sized Enterprise (SME) and departments of larger enterprises. "
"Copernic Enterprise Search uses advanced linguistic and statistical technologies that can identify the key concepts and the key sentences of indexed documents. It is able to rank a document whose main theme corresponds to search keywords higher than a document that only contains search keywords once or twice. The results ranking can be fine-tuned by altering the weight of different ranking factors. The software also does automatic indexing of new and updated documents in real time, ... "
The search for 'smart data' pays off - Business will benefit: Customers will be able to find data much quicker by Danny Bradbury, Financial Post (Sept 29, 2003)
Of interest ..
"Wouldn't it be useful to have a network of "smart" information -- data that understood itself? Web searches would be easier if a Web browser was able to search for related concepts rather than just looking for key words without their context." - describes the objective of a semantic Web.
"The biggest problem for semantic Web technology is that it is mostly aimed at specialist applications. It needs a dictionary of concepts called an ontology that will enable it to hook information together. It is relatively easy to create an ontology for a specific subject such as healthcare, aerospace or tigers, but introducing semantic technology on to the wider Web would require a huge ontology, or at least many ontologies linked together." - identifies why the semantic web will be largely limited to specialist applications.
Web searches tap databases By Kimberly Patch, Technology Research News (Sep 24)
Birkbeck University researchers have developed software that makes it possible to search different types of databases / sources at the same time.
"The researchers' software automatically constructs trails across tables in relational databases, according to Wheeldon. The software treats each database row as a virtual Web page, and builds links according to database settings,..."
The spokesman, Richard Wheeldon, said the software could be ready in less than a year.
Berners-Lee Talks Up Semantic Web By Thor Olavsrud Internet News.com (Sep 23)
Tim Berners-Lee spoke to the Royal Society in the UK about his vision for the semantic web. "It's like a great big database.", he said.
"For instance, he explained, consider an event listing on the Web for a lecture. It would include data like the location, start time, end time, the speaker, a phone number to call for more information and so on. But the data is fairly static. It can be read by humans, but not by machines. However, metadata could be applied to those datapoints which identify to machines what they are. Then an interested party could click to attend the event, and whatever calendaring application that person uses could immediately schedule the event in the planner, denoting where it is, what time it will start and what time it will end, and who will be speaking. It could provide a map to get the person to that event, and supply information about the speaker. "
IBM’s WebFountain Launched–The Next Big Thing? by Barbara Quint Information Today Newsbreaks (Sep 22)
More positive comments about IBM's WebFountain -- "a Web-scale mining and discovery platform that extracts trends, patterns, and relationships from massive amounts of unstructured and semi-structured text."
Semantic Web: Out of the Theory Realm By Michael Singer Silicon Valley.com (Sept 12, 2003)
New Search Algorithm Hears 'People's Voice' By Mike Martin NewsFactor Network (Sept 16)
New Internet search algorithm called "Vox Populi" (Voice of the People) developed in Germany assigns relative weights to search words.
"Someone typing "free MP3 downloads" in Google might be taken to all MP3 download sites. In the Vox Populi algorithm, however, if "free" has a larger relative weight than "downloads" (based on statistics showing how many users searching for MP3 downloads are looking for free ones), the algorithm will take searchers to free download sites first. "
Hmmm - I'd like to set the weightings myself.
IBM unveils new advanced search engine MCN International - Channel News Asia (Sept 18)
Describes WebFountain as a new search engine "capable of extracting minute data from among billions of Web pages."
"The system, run by a supercomputer that absorbs 25 million Web pages a day from the Internet, learns to recognise and put into context particular phrases and groups of words on command."
More information at http://www.almaden.ibm.com/WebFountain/
Reports on The Infonortics Search Engine conference April 2003
Information Overlook By Martin White. EContent July 2003 Issue - saw the theme as being between searching structured and unstructured data - an issue most relevant to enterprise search. How users search was discussed - in particular the dictum "You have 12 minutes before a user gives up". Recommended an IIR Evaluation Model in Information Research, an international electronic journal Issue 8-3.
Meeting report from the 2003 Infonortics Search Engine Meeting, Boston in Unstruct.org - a weblog about unstructured information.
Idée Inc. and Wonderfile Corporation announce the release of SimSearch, the first commercial implementation of visual search for a stock photography website. Press Release (Sep 9)
Wonderfile offers professional users "royalty-free" stock images in digital format for purchase. These are searchable online and available on CDs. Online search is by keyword and visual likeness.
The visual search software is Espion from Idee, a Toronto-based company. Wonderfile is a Masterfile company, also in Toronto.
At Wonderfile, find an image you like and use SimSearch to find others that are visually similar. For example, search on Venice and pick a canal scene. Simsearch locates mainly water or canal scenes from the collection.
Vivisimo press release explains why clustering is useful to searchers.
- can "discover" themes and explore more listings
- can focus on a folder and find more relevant results faster
- are drawn by folders to go past the first page.
Clustering of Search Results Increases Click-Through Rates Silicon Valley Biz Inc (AUg 19) PRNewswire
Yahoo Adds an RSS Reader to My Yahoo Research Buzz - Supposedly a place to put headline news from blogs. May have worked for Research Buzz, but doesn't for me. MyYahoo will have to do better.
Vivisimo Announces Release 4.0 of its Award-Winning Clustering Engine PR Newswire via Silicon Valley (Sept 4)
"Vivisimo's Clustering Engine automatically organizes search results into folders, without pre-processing the information. Release 4.0 enhances the functionality and features of the solution and contains fundamental breakthroughs in quality, enabling customers to increase their return-on-investment in enterprise search tools and improve end-user satisfaction by significantly reducing total cost of ownership and improving performance."
Supports metadata clustering (folders grouped around author, sources, set topics etc), and Show-in-clusters (for a particular result identify its cluster).
Public web site at www.vivisimo.com
Google is most popular but others may do it better by Lee Gomes. Wall Street Journal via SFGate.com. (AUg 18) - searches for God at Google and Teoma and prefers the answer from Teoma. In so doing, describes the fundamentals of link analysis.
Maybe Overture will do more for search than place ads. It has opened a new web site to feature the work of its research department - Overture Research.
"Through creativity, invention, and scientific contribution, Overture Research has the mission to position Overture as a pioneer in the next online revolution. Our goal is to develop novel algorithms and technology to empower users, consumers, businesses, advertisers and publishers worldwide to maximize the social and economic potential of the Internet."
Groxis Announces Web-Enabled Version of Award-Winning Visual Information Software PR Newswire (Aug 19) "Embedded Grokker(TM) Enables Search Engines, Enterprises and Other Organizations to Integrate Grokker Into a Web Page"
"With Embedded Grokker, the software is integrated into a Web site as a simple browser-based application. Embedded Grokker uses the core Grokker technology to turn thousands of pieces of information -- for example, search results -- into a simple, graphical map. These embedded maps are filterable, customizable and can be saved and shared. A visitor to a Web site can perform a search, reorganize the Grokker map, and then save it and mail it to a friend or colleague, who can reopen the map on the originating site."
No mention of particular public web sites that have adopted this yet.
Project searches for open-source niche by Stephanie Olsen, CNet News (Aug 18) Nutch is developing open source software for searching that will show how it determined the rankings.
"... the project is not-for-profit and aims to advance search by supplying a technology for experimentation. Academic researchers or developers will be able to download the software and adapt it without having to reinvent the wheel, Cutting said. Foreign governments could use Nutch to develop a noncommercial search site for citizens rather than licensing a proprietary, ad-supported technology, he said. Or corporate entities could build a for-profit business around the technology. "
This is more likely to be used for "private" purposes - an organization or a specialized service rather than the spammer-infested web.
Pandia recaps news regarding possible new search engines in More new search engine development (Aug 11) -- Kaltix, IBM's Web Fountain, and Nutch.
IBM developed a search engine for a record company that may have wider applications. Called Web Fountain, "The technology reads and understands text, and uses natural language to make correlations between words. Unlike traditional search, Web Fountain searches everything on the Web, including chat rooms, when set to that parameter."
IBM's Path From Invention To Income by Lisa Di Carla Forbes.com (Aug 7)
See Gary Price's comments and analysis Web Search - IBM (Aug 10)
Also - IBM Takes Search to New Heights by Barry Taft eWeek (Aug 11) - provides short description of Unstructured Information Management Architecture, which is the basis for Web Fountain.
Quigo has technology for sponsored searches that Overture wants. Quigo's technology can deliver more relevant ads based on its system that mixes semantic algorithms with human intelligence.
"For example, a web page featuring a travel article about Hawaii could offer advertising for hotels in Hawaii, airlines flying to Hawaii, unique tourist attractions in Hawaii and more. One advantage of AdSonar is in giving publishers the option of a human editorial setup for defining relevancy parameters and 'teaching' Quigo's machine learning algorithms which parts of each page should be targeted. The human editing process ensures that only the relevant parts of each document are targeted for ads, significantly improving the relevancy of the results." - from Press Release
Quigo offers the online publishing and ad serving industries a new contextually targeted advertising system Press release (Aug 13)
Overture picks Israeli start-up Quigo to lead search engine battle against Web giant Google By Galit Yemini Haeretz.com (Aug 14) --
Press release
Searching for the personal touch by Stephanie Olsen. CNet News (Aug 11) -- In general article reviews aspirations of web search engines to enhance their services through personalization. In specific terms, article puts spotlight on Kaltix, a new start up company that may have technology to speed up Google's PageRank computations and enable consideration of personal interest profiles.
Natural Language: National Library of Medicine offers COSMO for answering frequently asked questions in a natural language style. http://wwwns.nlm.nih.gov/ Uses NativeMinds software. See Gary Price - Natural Language Searching (July 28)