Nine to Noon: 3 February 2011

March 2, 2011 – 12:50 pm

I resumed my Nine to Noon radio segments on Radio New Zealand. I’ll be on every other week, beginning 3 February 2011. MP3 and OGG available.

Below are my notes, made as I researched the topics for the 3 February 2011 show. We often depart from the notes, so they’re not a reliable substitute for what aired.

Nat Torkington will cover:
* is Google getting less useful?
* how do we keep something forever?

Links

Many commentators are talking about a decline in the quality of Google’s search results. It’s pretty important given we all use Google.

The BBC Domesday Project is a canary in the coalmine about the longevity of digital media, whose lifetime isn’t long according to The US National Archives (cf the original Domesday Book). Hard drive failure rates. NZ’s National Digital Heritage Archive.

Search and Spam

Ignorance is now a human condition. What do we do? We Google for the answer. Even the phrase “Google for the answer” shows us how important searching the web has become: we have a new verb for it.

And when we say “search the web”, we really do mean “use Google”. There are only three English search sites with any market share: Google, Yahoo!, and Microsoft’s Bing. In the US, Google has 2/3 of the market, Yahoo! has 16%, and Bing a modest 12%. In some markets it’s even more marked: in the UK, Google is 90%. NZ’s numbers are even more polarised: 75% for Google.co.nz, 16% for google.com, 2% for Bing, and a smidge for Yahoo!. That’s a 91% market share for Google.

We don’t know everything on the net. We look at the net through the lens of Google. If we want to know how to fix spouting or make lamb korma or what that computer error message means or what the best hotel in Queenstown is, chances are that we start with Google.

But for some queries like this, the hits in Google aren’t that great. Some are directories: they don’t have the answer, they just take you to a site with the answer. You might ask what’s the problem with that — well, why doesn’t Google just give you the site with the answer? Some are copies of other sites: you don’t get the Wikipedia entry on spouting, you get someone else’s site which contains a copy of that entry. And some are content farms: you don’t get anything useful, just a vague bunch of paragraphs that sounds like it knows something but it’s just saying vague generalities like “lamb korma is an Indian dish of the type curry. Lamb is the main ingredient in lamb korma, and many people report it to be delicious.” Yes, but how do you make one?! You won’t find out there.

What’s all this crap doing in our search results? Making money for those sites. Google doesn’t pay them, you don’t pay them, but advertisers pay those sites because (being at or near the top of Google) they get a heap of traffic. Many people will have gone to the top result in Google, looked at the page and thought “hmm, don’t see my answer here” but seen an ad on that page that looks promising and clicked on *that*. Those clicks make money for the spam sites.

What’s Google doing about this? They can’t tackle it one search at a time — there billions of people searching with Google and, well, fewer than billions of people working for Google. So they tweak their algorithms, their secret formula for deciding where in the results a site should be. They look for signs that content is copied, or useless, or a directory, and dial down any sites with those signs.

The algorithms are Google’s magic, they’re what make Google’s search useful. Before Google, we had other sites that indexed the web. But they weren’t giving us answers that were as relevant as Google’s (and they didn’t hit on Google’s cash cow advertising business model) so they aren’t with us today.

Preservation

A friend pointed me to an interesting Wikipedia article on a project that the BBC ran, the Domesday Project. Back before the web, in 1986, the wonderful folks at the BBC commemorated the signing of the Domesday book (the first UK census of sorts) in 1086 (William the Conqueror wondering wtf he had just conquered).

They did their own survey, people wrote reminiscences or about social issues, they had maps and graphs and statistical data and even video. This was pretty impressive for computer stuff of the time. And what a time — this is before the web, when we had hobbyist home computers, before Microsoft Word ran on Windows, I was still in short pants. There was no YouTube, there was no Google Maps, there was no Excel to crunch numbers. In short, this was hard work.

They slapped it all on laserdiscs, they needed special hardware to make it work, and it was a magnificent accomplishment, no two ways about it.

Now, fast forward to today. Can we look at this magnificent accomplishment? No, the computers that run it are dead, the laserdiscs are decaying, it’s all turned to crap. There are two computers in a computer history museum that can run it, but for how long? The information on those laserdiscs has vanished. And can we put it on the web? No, the copyright status of all those contributions from people is unknown. Aie.

So, to recap: 900 years after this paper book was compiled, it’s still readable and surviving. Within 25 years, the discs are crumbling, the hardware to read the discs is unreconstructable, and we can’t even put online what we *can* read in order to get help recovering it.

Surely we’ve fixed this? I mean, we live in the age of Google and Facebook and iPads and all that stuff. Um, no.

What about CDs and DVDs? The US National Archives say you can expect them to last 2-5 years even though ads talk about 10-25 years. If this isn’t chilling, I don’t know what is: they say “We recommend testing your media at least every two years to assure your records are still readable.”

Harddrives aren’t much better. If you buy off the shelf hard drives, you’re paying amazingly low prices. You can buy a terabyte hard drive (you could store 250 DVDs in that) for a hundred bucks or so. But it’s like the Warehouse: you got that bargain by compromising on quality. The failure rate of hard drives is scary: you can expect 3% or more to croak within a year.

So if we want to keep these treasures we’re making, whether you’re talking your digital photos or parliamentary email or the latest census, then you can’t just slap ’em on a hard drive and walk away. You can’t burn ’em to DVD and walk away. What do you do? You have to keep the information alive: you have lots of copies, and when one dies you replace it from one of the other copies.

That’s how the National Digital Heritage Archive, a project in the National Library, works. It’s the library’s solution to the problem of preserving the digital books and New Zealand web forever. There aren’t a lot of projects like this around the world, and we’re one of the few tackling it. Hopefully, in 25 years time, people won’t be grizzling that the web record of Nine to Noon’s New Technology slot is unreadable …

(I’m involved with the National Library, but not in this project.)

You must be logged in to post a comment.