WSG Newsletter: Gone Googlin'
Issue: August 17, 2001
Note: Revised January 9, 2003 to remove invalid links or references.
Who doesnt use Google for web searching? Launched in
1998, Google now processes over 100 million queries a day. In June 2001 Google
had a stunning 13.4 million visitors. This is sure to climb further. Over the
last 6 months it has added further enhancements to make it an extremely
sophisticated search tool. This newsletter reviews the search features at
Google and shows ways to exploit them for best results.
The Biggest
Google has long had the greatest reach into the Web. It has fully indexed
about 700 million pages and from links on those pages knows enough about
another 700 million pages to give it an effective reach of 1.38 billion.
Included in this number are 26 million pages in pdf format a
document format from Adobe that is used widely by companies, government, and
universities to publish everything from new research to user manuals. Google is
the only major search engine to index these pages.
Google has indexed images too. Introduced in June 2001 the image database
holds information on 250 million images on the Web.
Parlez vous?
The search interface can be
set to any one of 57 languages including the Russian Cyrillic likely the
reason Google is very popular in Russia. There are some wacky language
interfaces too such a bork, bork and pig latin for people who find
searching the Web lacks zip.
The search can be restricted to pages in a particular language, especially
useful for words that belong to many tongues. Laissez-faire is that to
be in an English document or a French? Google will offer a Translate link when
pages are in a foreign language, but it can also translate pages in Italian,
French, Spanish, German, and Portuguese automatically to English.
Language settings are done from the Preferences page. Preferences is
also the place to set up SafeSearch to block explicit sexual content.
The Basics
Google looks for all the words except prepositions, articles and
other very common words. The excluded words are always identified. Google
retrieves pages that have the search terms (preferably close to each other) and
ranks the pages using Googles PageRank system.
PageRank is a measure of the importance of the page. It is derived
from an analysis of who links to whom. A page with many links to it, and
especially links from authority sites such as major directories,
will have a higher page rank. The reasoning is that this page is more likely to
provide a good answer for this query. Google also considers what
the other page said about the target page.
(Incidentally, PageRank gets its name from one of its creators, Larry Page,
co-founder of Google.)
This method of ranking by external validation makes Google fairly impervious
to spamming tricks embedded in metatags, titles, and dummy pages.
PageRank does work. For example, a search for
web searching brings up Greg Notesss Search
Showdown and Danny Sullivans Search Engine Watch in the first two spots
and, further down the page, the web-search tutorial from the University of
California Berkeley Library. These are among the best sources on the Web.
Other search engines have incorporated link analysis in their ranking
algorithms but the results are rarely as sharp for simple searches. For our
query on web searching, Altavista, Excite, Hotbot, and Northern Light come
nowhere close.
On most searches Google will also identify a category in its
directory. For web searching it is Computers > Internet >
Searching > Help and Tutorials. Here one will find a longer list of
sites that have been handpicked and classified as tutorials in web searching.
The Google Directory has the same content as the
Open Project Directory but
Googles PageRank has ranked the sites for popularity. This usually brings
the best to the top of the list.
Google can help with spelling and definitions. I had heard
that the hydroxyl radical can be used to clean wastewater but had forgotten the
complete phrase and the correct spelling. I took a stab at hydroxayl and
Google proposed hydroxyl. Google provides a link to definitions of the search
terms from the top blue bar. Searched for hydroxyl points to
definitions from four dictionaries. This is a good tool for checking meaning or
finding additional search terms.
Google lists the main page at the site first, and, if there is more than one
page at that site containing the words, it indents the second page. A link for
More results from
brings up all remaining pages.
Google is the only service to let users see the page as it was indexed
or cached as they say in the biz. This feature is invaluable when
a site is unavailable or has changed since Google indexed it. Use this anytime
you cant reach a site you know must still exist. Search Google, find the
page, look at the cached version.
Cached also comes in handy with pdf files found through Google.
Rather than opening the page in Adobe Acrobat Reader, click on cached to see a
plain text version.
Beside Cached is a link to Similar Pages. This is sometimes useful
for finding pages that are about the same topic as the target page. In the
search for hydroxyl the top hit was Hydroxyl Systems, a Canadian company
involved in wastewater treatment. Similar Pages in this case shows sites for
other wastewater companies, and less usefully a site on heli-fishing. Similar
pages is still a very crude tool.
To narrow the search to hydroxyl radical, use Googles search
within results to enter the additional term, radical, in a new
search box. Its often easier than retyping and redoing.
Syntax
Syntax specifies use of words and characters to construct search queries.
Syntax gives the searcher more control. Altavista and Northern Light excel at
this, but it is not Googles strong suit.
+ can be used to require words or numbers that would otherwise be
ignored. It doesnt seem to make much difference. Results for Internet 2
(separate higher band network for research) are the same as for Internet +2
- will exclude pages containing that word.
|
asks that the words be together web
searching
|
OR will look for pages with either one term OR another. It must
be entered in UPPER case, and it works better with single words than phrases.
Its handy for picking up alternate spellings and singular / plural forms,
or widening a search that returned only a few results. To illustrate --
homeschooling (child OR children) picks up singular and plural
for a total of 120,000 hits.
|
Drawbacks
Google is not perfect.
- It will not do exact phrase searching. Walk the talk will bring
up walk and talk and walk & talk and other
variants. Google simply looks for walk and talk one word apart any word
will do. Using +the doesnt help.
- Google will not pick up variants in words. The searcher must think of
singular and plural and alternate forms.
- Google is not case sensitive. A Turkey is a turkey.
Advanced Search
Google now accommodates strategies for narrowing searches through searches
on title, url, site, and date (new in July 2001). These are most easily invoked
from the Advanced Search page.
Faced with 120,000 hits for homeschooling one might want to deal with only
those pages where homeschooling is in the title. Doing this drops
results to 11,500. Unfortunately it is not possible when using the form to
combine a search on title with a search on the words in the document.
Limiting a search to a specific site is the most useful trick of all.
Its not always easy to find the search function at a site, or having
found it get good results. ZDNet, as an example, cannot handle multiple word
queries. To search for reviews of Netscape 6.1 it is far better to use Google -
site:zdnet.com netscape +6.1 review. (Note the use of the +
sign.) This approach is also the easiest way to search Government of Canada
sites. Rather than guessing which government site might have the information on
consumer protection, limit the search to the domain gc.ca -- site:gc.ca
consumer protection .
Similarly one can limit searches to just the pdf format. We might look for
documents about wildlife gardening and be pleasantly surprised by the number
and the quality. See for yourself with this search -- filetype:pdf
wildlife gardening.
Regional
Google has country versions for Canada, the
U.K.,
Germany,
France,
Italy,
Switzerland, Japan and Korea.
The main advantage in these is that they offer a search of pages from only that
country. Google Canada (http://www.google.ca) does this very well picking up
Canadian sites in all domains not just .ca for Canada. At present it is
the best search engine for Canadian content. Google Canada is available in
French and English.
images
Googles new Image search is a dream to use. Google has indexed the
name of the file, image caption, and nearby text and uses dozens of other
factors in ranking results. Whatever they do it works. images can
be accessed from the Advanced Search page or at
http://images.google.com.
A search for tree sparrow returns pictures of the American
and Eurasian varieties and a distribution map for North America.
One word of caution from Tara Calishain of ResearchBuzz - she recommends
turning SafeSearch ON (under Preferences) to block surprises at least
those in English.
Some advanced search syntax can be used to refine a search perhaps
too carefully. Filetype can be used to limit to one of the image formats; e.g.
filetype:gif or filetype:jpg. A search intitle will find only
images whose filenames match the keyword intitle:sparrow.
Extras
There are some little known extras as well mainly useful to users in
the U.S.
Maps: enter a U.S. address and get a map with directions from
MapBlast
Stock: enter a ticker symbol for a company on a U.S. Exchange and get
quotes and profile from Excite Money and Investing (Quicken based).
News: enter a hot topic and get the news. Jerusalem and Ireland are
two that are sure to have current news. News is not Googles strength.
This is merely an extra that might enhance a search. Yahoo News is much better.
Advertising Policy
Google has been commended for clearly labeling sites that pay to be found.
Google calls them sponsored links. Companies may advertise as a premium sponsor
the listing will show in a band at the top of the search results (e.g.
weber grill); or in the lower cost AdWords program for a place in the column at
the right (e.g. soy products). Unlike several other services such as AltaVista,
Inktomi, Looksmart, Google never mixes paid-for entries with general results.
Google Toolbar

People who are regular users of Google will want to install the Google
Toolbar. The toolbar works with Windows Microsoft 5.0 and above. These lucky
folks can enter searches directly to the toolbar, restrict a search to the
current site, see the terms highlighted on the page, or look at cached
snapshots. Learn more and download if you can from
http://toolbar.google.com/.
Usenet Newsgroups
No account of Google is complete without mention of Google Groups.
(http://groups.google.com)
Hundreds of newsgroups aficionados were in deep mourning when Deja.com folded
last year and Google introduced a limited service. Today Google Groups has
become the best Usenet newsgroup center on the Web and has likely helped
reenergize the newsgroup community. Users can delve into archives back to 1995
using a full set of search options and also post messages directly to the
newsgroups no fussing with specialized newsreaders.
Conclusion
Google has soared so high that articles are appearing with titles like
Searching for Googles Successor (Wired August 14,
2001). But Google isnt about to falter. Salon interviewed Monika
Henzinger, the director of research in June (
Google a go-go). Henzinger spoke of new techniques to find
pages matching concepts rather than strictly the search terms. Google would
also be doing more with language and translation and possibly multimedia
search. Well be googling for some time to come.
|