Archiving the Web

I have often wondered  this too – Why Aren’t We Doing More With Our Web Archives?, Kalev Leetaru, Forbes (Jan 13)

Of course there is  The Internet Archive, but it deserves much more support. Why doesn’t it get it?

Article has many interesting figures and observations.

As of last October the Archive had preserved more than 510 billion distinct URLs (images, videos, style sheets, scripts, PDFs, Microsoft Office files, etc) from over 273 billion web pages gathered from 361 million websites and taking up more than 15 petabytes of storage. Much of this collection is available through the Archive’s public-facing Wayback Machine that allows you to plug in any URL and see all of the Archive’s snapshots capturing its evolution over the past 20 years.

Chrome Extension for Wayback Machine

If you are frequently doing research into older web sties, this new extension for the WayBack Machine at the Internet Archive will be very useful.

Wayback Machine Chrome extension now available by Mark Graham , Internet Archive Blog (Jan 13)

The Wayback Machine Chrome browser extension helps make the web more reliable by detecting dead web pages and offering to replay archived versions of them. You can get it here.

Tika for Deep and Dark

Topic of deep dark Web has surfaced again. Author Christian Mattmann describes a new system he and others have developed called TIka for identifying types of files and analysing. Article notes a couple of current applications.

Searching deep and dark: Building a Google for the less visible parts of the web by Christian Mattmann, The Conversation (Jan 7)

Tika and related software packages are part of an open source software library available on DARPA’s Open Catalog to anyone – in law enforcement, the intelligence community or the public at large – who wants to shine a light into the deep and the dark.

Peekier has privacy and previews

Tara Calishain (ResearchBuzz post) and Mary Ellen Bates (tweet) both mentioned Peekier – a new search engine that promises privacy and previews.

  • No data collected – “No personally identifiable information such your IP address, your browser’s user agent or unique IDs are stored or logged on our servers. Search queries―without any other information attached―are temporarily stored for caching, statistics and service improvement purposes. We do not store your search history.”
  • Provides previews – this is the peeking part – very enticing. Results were similar to Google’s – more so than to Bing, Advantage is the larger preview – Google’s snippet  has become much too short.

Peekier screenshot

  • Also, Peekier offers very useful keyword suggestions for narrowing the results – somewhat of a topical exploration – and it may be this feature that I like the best.

Worth trying out. I’m going to.

Verbatim at Google

Tara Calishain has discovered that Google’s search feature – verbatim – (find this under Tools > All results) – the one by which Google is expected to look for each word exactly and return only those pages with those words – doesn’t conform.

Google Has a Weird Definition of Verbatim, Research Buzz (Jan 2)

She wrote – “Google’s Help page explains the Verbatim option very simply: ” Search for exact words or phrases.” I had always taken that to mean that if you search for a particular set of keywords, a Verbatim search will search for all those keywords in the way you’ve expressed them.” That’s what I thought too – but Google takes liberties with the search query and will drop a term if it can’t be found. I am once again disappointed in Google but not surprised.

Apps for smartphones and tablets

Can’t resist – 9 of the best smartphone apps of 2016 by Kit Eaton, New York Times via Seattle Times (Dec 16)

There are thousands (maybe millions) of new apps every year. Kit Eaton narrows the list to nine. Lots of fun. Those that struck me were NPR One “one stop shop for radio”, Microsoft Pix for improving photos, and for golfers the game Super Stickman Golf 3.

To this list I add the CBC Radio app for access to real-time radio programming, and archived programs – plus other features – indispensable. And the Globe and Mail app for mobile access to the news and instant updates (though I prefer the website for reading sections in depth).

Google Image Search Markup

Image search at Google will improve through the use of schema markup.

Google Image Search Now To Use Product Schema Markup, Barry Schwatz, Search Engine Roundtable (Dec 14)

Quoted from Google: “Add markup to your product pages so Google can provide detailed product information in rich Search results – including Image Search. Users can see price, availability, and review ratings right on Search results.”

Canada.ca – another Phoenix?

Another mega-computer project from the Government of Canada – Canada.ca -. “Initiative to merge 1,500 federal websites into one is behind schedule and over budget”. This is so stupid I can barely bring myself to read about it. When will they learn? Unified theory, single system, one portal – no – the human mind is not capable of taking it all in – or at least the minds of people who take on these projects and commission them are not capable. Surely there is a simpler way to improve access to the information –  index the sites, slap on a search engine, and add an overview as guide to the sites

Federal government’s Canada.ca project ‘off the rails’, CBC News, Dec 13

The Canada.ca initiative was launched in 2013 with the goal of making it easier for people to find and use government information online. A $1.54-million contract for a new content management system, where all government websites would be moved, was awarded to Adobe in 2015. …
The actual migration of the websites is up to the departments themselves and is to be done within existing budgets and staffing. Since 2015, eight of the largest departments have budgeted or spent more than $28 million on this project. …

 

According to the government, only 10,000 web pages have been moved to date. There are more than 17 million Government of Canada web pages in total.