Web Search Guide banner
 

WSG Newsletter: Gone Googlin'

Issue: August 17, 2001

Note: Revised January 9, 2003 to remove invalid links or references.

Who doesn’t use Google for web searching? Launched in 1998, Google now processes over 100 million queries a day. In June 2001 Google had a stunning 13.4 million visitors. This is sure to climb further. Over the last 6 months it has added further enhancements to make it an extremely sophisticated search tool. This newsletter reviews the search features at Google and shows ways to exploit them for best results.


The Biggest

Google has long had the greatest reach into the Web. It has fully indexed about 700 million pages and from links on those pages knows enough about another 700 million pages to give it an effective reach of 1.38 billion.

Included in this number are 26 million pages in pdf format – a document format from Adobe that is used widely by companies, government, and universities to publish everything from new research to user manuals. Google is the only major search engine to index these pages.

Google has indexed images too. Introduced in June 2001 the image database holds information on 250 million images on the Web.

Parlez vous?

Eiffel TowerThe search interface can be set to any one of 57 languages including the Russian Cyrillic – likely the reason Google is very popular in Russia. There are some wacky language interfaces too such a bork, bork and pig latin – for people who find searching the Web lacks zip.

The search can be restricted to pages in a particular language, especially useful for words that belong to many tongues. Laissez-faire – is that to be in an English document or a French? Google will offer a Translate link when pages are in a foreign language, but it can also translate pages in Italian, French, Spanish, German, and Portuguese automatically to English.

Language settings are done from the Preferences page. Preferences is also the place to set up SafeSearch to block explicit sexual content.

The Basics

Google looks for all the words – except prepositions, articles and other very common words. The excluded words are always identified. Google retrieves pages that have the search terms (preferably close to each other) and ranks the pages using Google’s PageRank system.

PageRank is a measure of the importance of the page. It is derived from an analysis of who links to whom. A page with many links to it, and especially links from “authority” sites such as major directories, will have a higher page rank. The reasoning is that this page is more likely to provide a “good” answer for this query. Google also considers what the other page said about the target page.

(Incidentally, PageRank gets its name from one of its creators, Larry Page, co-founder of Google.)

This method of ranking by external validation makes Google fairly impervious to spamming tricks embedded in metatags, titles, and dummy pages.

PageRank does work. For example, a search for web searching brings up Greg Notess’s Search Showdown and Danny Sullivan’s Search Engine Watch in the first two spots and, further down the page, the web-search tutorial from the University of California Berkeley Library. These are among the best sources on the Web.

Other search engines have incorporated link analysis in their ranking algorithms but the results are rarely as sharp for simple searches. For our query on web searching, Altavista, Excite, Hotbot, and Northern Light come nowhere close.

On most searches Google will also identify a category in its directory. For web searching it is Computers > Internet > Searching > Help and Tutorials. Here one will find a longer list of sites that have been handpicked and classified as tutorials in web searching. The Google Directory has the same content as the Open Project Directory but Google’s PageRank has ranked the sites for popularity. This usually brings the best to the top of the list.

Merriam Webster DictionaryGoogle can help with spelling and definitions. I had heard that the hydroxyl radical can be used to clean wastewater but had forgotten the complete phrase and the correct spelling. I took a stab at hydroxayl and Google proposed hydroxyl. Google provides a link to definitions of the search terms from the top blue bar. “Searched for hydroxyl” points to definitions from four dictionaries. This is a good tool for checking meaning or finding additional search terms.

Google lists the main page at the site first, and, if there is more than one page at that site containing the words, it indents the second page. A link for More results from … brings up all remaining pages.

Google is the only service to let users see the page as it was indexed – or cached as they say in the biz. This feature is invaluable when a site is unavailable or has changed since Google indexed it. Use this anytime you can’t reach a site you know must still exist. Search Google, find the page, look at the cached version.

Cached also comes in handy with pdf files found through Google. Rather than opening the page in Adobe Acrobat Reader, click on cached to see a plain text version.

Beside Cached is a link to Similar Pages. This is sometimes useful for finding pages that are about the same topic as the target page. In the search for hydroxyl the top hit was Hydroxyl Systems, a Canadian company involved in wastewater treatment. Similar Pages in this case shows sites for other wastewater companies, and less usefully a site on heli-fishing. Similar pages is still a very crude tool.

To narrow the search to hydroxyl radical, use Google’s “search within results” to enter the additional term, radical, in a new search box. It’s often easier than retyping and redoing.

Syntax

Syntax specifies use of words and characters to construct search queries. Syntax gives the searcher more control. Altavista and Northern Light excel at this, but it is not Google’s strong suit.

+ can be used to require words or numbers that would otherwise be ignored. It doesn’t seem to make much difference. Results for Internet 2 (separate higher band network for research) are the same as for Internet +2

- will exclude pages containing that word.

“” asks that the words be together – “web searching”

OR will look for pages with either one term OR another. It must be entered in UPPER case, and it works better with single words than phrases. It’s handy for picking up alternate spellings and singular / plural forms, or widening a search that returned only a few results. To illustrate -- homeschooling (child OR children) picks up singular and plural – for a total of 120,000 hits.

Drawbacks

Google is not perfect.

  • It will not do exact phrase searching. “Walk the talk” will bring up “walk and talk” and “walk & talk” and other variants. Google simply looks for walk and talk one word apart – any word will do. Using +the doesn’t help.

  • Google will not pick up variants in words. The searcher must think of singular and plural and alternate forms.

  • Google is not case sensitive. A Turkey is a turkey.

Advanced Search

Google now accommodates strategies for narrowing searches through searches on title, url, site, and date (new in July 2001). These are most easily invoked from the Advanced Search page.

Faced with 120,000 hits for homeschooling one might want to deal with only those pages where homeschooling is in the title. Doing this drops results to 11,500. Unfortunately it is not possible when using the form to combine a search on title with a search on the words in the document.

Limiting a search to a specific site is the most useful trick of all. It’s not always easy to find the search function at a site, or having found it get good results. ZDNet, as an example, cannot handle multiple word queries. To search for reviews of Netscape 6.1 it is far better to use Google - site:zdnet.com “netscape +6.1” review. (Note the use of the + sign.) This approach is also the easiest way to search Government of Canada sites. Rather than guessing which government site might have the information on consumer protection, limit the search to the domain gc.ca -- site:gc.ca “consumer protection” .

Similarly one can limit searches to just the pdf format. We might look for documents about wildlife gardening and be pleasantly surprised by the number and the quality. See for yourself with this search -- filetype:pdf “wildlife gardening”.

Regional

Google has country versions for Canada, the U.K., Germany, France, Italy, Switzerland, Japan and Korea. The main advantage in these is that they offer a search of pages from only that country. Google Canada (http://www.google.ca) does this very well – picking up Canadian sites in all domains – not just .ca for Canada. At present it is the best search engine for Canadian content. Google Canada is available in French and English.

images

Eurasian tree sparrow

Google’s new Image search is a dream to use. Google has indexed the name of the file, image caption, and nearby text and uses “dozens of other factors” in ranking results. Whatever they do – it works. images can be accessed from the Advanced Search page or at http://images.google.com. A search for “tree sparrow” returns pictures of the American and Eurasian varieties and a distribution map for North America.

One word of caution from Tara Calishain of ResearchBuzz - she recommends turning SafeSearch ON (under Preferences) to block surprises – at least those in English.

Some advanced search syntax can be used to refine a search – perhaps too carefully. Filetype can be used to limit to one of the image formats; e.g. filetype:gif or filetype:jpg. A search intitle will find only images whose filenames match the keyword – intitle:sparrow.

Extras

There are some little known extras as well – mainly useful to users in the U.S.

Maps: enter a U.S. address and get a map with directions from MapBlast

Stock: enter a ticker symbol for a company on a U.S. Exchange and get quotes and profile from Excite Money and Investing (Quicken based).

News: enter a hot topic and get the news. Jerusalem and Ireland are two that are sure to have current news. News is not Google’s strength. This is merely an extra that might enhance a search. Yahoo News is much better.

Advertising Policy

Google has been commended for clearly labeling sites that pay to be found. Google calls them sponsored links. Companies may advertise as a premium sponsor – the listing will show in a band at the top of the search results (e.g. weber grill); or in the lower cost AdWords program for a place in the column at the right (e.g. soy products). Unlike several other services such as AltaVista, Inktomi, Looksmart, Google never mixes paid-for entries with general results.

Google Toolbar

Google toolbar

People who are regular users of Google will want to install the Google Toolbar. The toolbar works with Windows Microsoft 5.0 and above. These lucky folks can enter searches directly to the toolbar, restrict a search to the current site, see the terms highlighted on the page, or look at cached snapshots. Learn more and download if you can from http://toolbar.google.com/.

Usenet Newsgroups

No account of Google is complete without mention of Google Groups. (http://groups.google.com) Hundreds of newsgroups aficionados were in deep mourning when Deja.com folded last year and Google introduced a limited service. Today Google Groups has become the best Usenet newsgroup center on the Web – and has likely helped reenergize the newsgroup community. Users can delve into archives back to 1995 using a full set of search options and also post messages directly to the newsgroups – no fussing with specialized newsreaders.


Conclusion

Google has soared so high that articles are appearing with titles like “Searching for Google’s Successor” (Wired August 14, 2001). But Google isn’t about to falter. Salon interviewed Monika Henzinger, the director of research in June ( Google a go-go). Henzinger spoke of new techniques to find pages matching concepts rather than strictly the search terms. Google would also be doing more with language and translation and possibly multimedia search. We’ll be googling for some time to come.

Google Logo

MarkerArticles

Articles will open in a new window.

SearchDay #36, Speaking in Tongues at Google by Chris Sherman (June 25, 2001)
http://searchenginewatch.com/
searchday/01/sd0625-
google-lang.html

SearchDay #37, Google Polishes its Image by Chris Sherman (June 26, 2001)
http://searchenginewatch.com/
searchday/01/sd0626-
google-images.html

Google a go-go by Katharine Mieszkowski in Salon (June 21, 2001) -- Describes how Google works today and new developments in language, sound, image, and concept searching.

Search engines and editorial integrity by J.D. Lasica in Online Journalism Review (July 23, 2001) -- Comments on the formal complaint filed by Commercial Alert in the USA against several search engines for not disclosing paid-for listings. Commends Google for its advertising policies.

Searching for Google's Successor by Angel Gonzalez. Wired (August 14, 2001) -- Proposes Wisenut, Teoma, Lasoo, Vivisimo, and CURE as up-and-coming search engines.

 

 

Marker Google Treats
Google Zeitgeist reports on patterns in search at Google. There is a weekly list on gaining queries and declining queries and a year-end review.
http://www.google.com/
press/zeitgeist.html
Google Stripped is a bare-bones Google - just title of document on the results list - but it's fast. It's so stripped it doesn't even have a name!
http://www.google.com/ie

 

 

Marker Advanced Searching at Google

The Advanced Search page at Google lays out all the search options. However, many searchers prefer to use Google's special field commands for running the search directly from the main search box.

A Title search is especially useful for cutting to the chase. Getting too many results when looking for lodgings at a popular town. Use title.

allintitle: looks for all words in the title whereas intitle: looks for at least one of the words (and all the others in the text). For example allintitle:stratford ontario lodging finds 4 pages, and intitle:stratford ontario lodging finds 62.

A URL search can be helpful when you can remember a fragment of a url or have a sense that the words might have been used in naming the specific page.

allinurl: looks for all words in the url and inurl looks for at least one of the words. This works on the Stratford search. allinurl:stratford ontario lodging finds 16 pages, and inurl:stratford ontario lodging finds 43.

Filetype can be used to search only a particular filetype or to exclude that filetype. This is most useful for limiting searches to the pdf filetype; e.g.filetype:pdf "wastewater management" or if you don't want pdf files, "wastewater management" -filetype:pdf

Domain searching is done by using the command word site:. Identify the domain to be searched - gc.ca for Canadian government, loc.gov for Library of Congress, etc. Construct the search as site:<domain> <your search terms>; e.g. site:gc.ca "access to information"

Newsletter by Gwen Harris who admits to always checking Google first.


Copyright Gwen Harris
A service to subscribers of The Internet Guide.


Where to Next?

Return to list of newsletters.

home tutorials newsletter what's new about