On Wednesday, someone over at the Greens Idea-Thinking Collective had a rush of blood to the head. Their solution to this whole tech company versus media company drama? A publicly owned and independent search engine.
A statement from Greens spokesperson for media and communications Senator Sarah Hanson-Young called on the Morrison Government to investigate the establishment of a publicly-owned search engine.
Let’s save everyone a great deal of bother. Here’s what building a non-profit Australian search engine would entail.
At its core, a search engine is just a key-value database lookup. The key is your set of search terms. The values it returns are the URLs of web pages that contain those search terms.
The sort order of those results is … well … OK, it’s a very complicated ranking function. It takes into account some secret-sauce perceived relevance and quality ratings, and returns a different sort order for different search terms.
With sophisticated search engines, it even takes into account your location, knowledge of your interests, what other people have been searching for, and much more.
You have to account for synonyms, where someone searching for “truck” probably also wants the results with “lorry”.
You also have to account for context. Is a search for “cardinal” about the bird, or a sports team, or the senior clergy of the Catholic Church, or a compass direction, or a mathematical concept, or the 1963 film directed by Otto Preminger, or the 2017-2020 TV series, or the watch retailer in Stanmore, Sydney?
Or the American indie pop duo founded by musicians Richard Davies and Eric Matthews in 1992?
So it’s “just” a database lookup, but a really, really complicated one.
Putting aside that complexity, and the fact that Google has a 22-year head start on understanding it, let’s look at the engineering.
‘First, download the internet …’
Gathering all the data for your database is straightforward enough: Use web crawlers to download the entire internet. Or at least the bits of it that are visible on the World Wide Web. Then index it.
Then re-do this for each website when it changes, which in the case of news websites is quite frequently.
How much storage is this going to need? A lot. And we can even estimate that.
It turns out that there’s a bit of set theory which tells us that the storage requirements for a key-value mapping are equivalent to the storage requirements for a value-key mapping for the same data set. (There will not be a question on this in the final exam.)
We already have a reverse search engine that’s equivalent to this value-key mapping, one that starts with URLs and returns the things that we might search for on web pages — which is all the things on the web pages. It’s called the World Wide Web.
So not only do you need to download the entire web for reference, you need the same amount of storage for the index.
Yes, your storage needs for your search engine index are roughly 1.0 World Wide Webs, to a first approximation.
That’s quite a bit of storage.
Now do a Google search for “cardinal”. “About 271,000,000 results (0.83 seconds),” it said for me just now. That’s fast. In fact, it’s so fast there could not have been any disk access involved.
Yes, you need to keep your 1.0 World Wide Webs of index data in RAM.
Actually, you need to keep several replicas in RAM to cope with failures.
That’s quite a bit of RAM.
You could cut that down by only indexing part of the web, sure, but who would make the editorial decisions? And who would use it anyway?
Those replicas of the index need to be geographically dispersed for redundancy, which means you need a WAN fast enough to sling around copies of the entire World Wide Web to replicate them.
That’s quite a bit of network.
Adding it all up, that’s quite expensive.
Obviously, there will be ways to optimise this, but there will also need to be enough infrastructure to cope with the number of users. At least this gives us a rough idea of the scale of the infrastructure that’d be needed.
Which brings us back to Senator Hanson-Young’s modest proposal.
Who’s paying for this?
“We need an independent search engine that is run in the public interest, not for the profit of a corporate giant,” she wrote.
“It would mean Australians can search the internet with the peace of mind that their data is not being sold off to advertisers and corporations.”
In other words, Hanson-Young is proposing that we build all this with government money, and therefore government project management.
Even if it were outsourced to a private-sector vendor, it’d still be the government providing, you know, the governance.
How well do we think the Australian government would handle that, given their past performance? Remember the NBN?
One final point, relevant to the Greens’ worldview: How much energy do you think all this would burn?
Maybe the current Australian government could end up building a coal-fired search engine.
For mine, the most depressing aspect of all this is that such an outlandish idea made it all the way to a press release seemingly without being run by anyone with a clue.
Here is a political party’s official spokesperson for media and communications publicly calling for an inquiry into an idea which could have been shot down during a quick coffee meeting with almost anyone who knows how search engines actually work. Disappointing.
Anyway, senator, we’ve saved the government having to run an expensive inquiry process. Where should I send my invoice?
Stilgherrian would like to thank the participants in the discussions that informed this article, who must remain anonymous.