The next Google search engine will be Generative AI

by Rodolfo Rosini

(explore this prompt on Lexica)

The Old

One of the earliest [1] incarnations of finding stuff on the web was a directory service called Yahoo which was a glorified yellow pages directory [2] that was human curated. While in the beginning this was a good way to navigate the web for new users, very quickly the rate of growth of content became so steep that human curation was not longer the best way to experience the internet. [3]

In the meantime the first web crawlers (programs that would read a webpage and index its text, which is something that parallelizes very well) started to appear, either Sun SPARC servers, on Alpha DEC, but mostly the impact of the internet was underestimated and these companies used search engines as use cases to sell to corporates, instead of understanding that the web was *the* market. So their product was designed not for the customers that were using it. [4]

Linux started getting better [5], and computer clusters based on it became cheaper, based on custom built PCs, so web crawlers were able to have bigger and bigger indexes, and caching more web content. This was the state of the art of search engines. The ~28th search engine to hit the market, Google, was much better than the other, I personally used it because it had the best archive for programming and technical stuff (all the others just returned garbage because they were looking for natural language and became confused where you were searching for code with weird characters and operators). Google took off, more or less became a monopolist globally and here we are still.

Since then some attempt to challenge Google happened either with bigger indexes (Cuil, whose archive – to their dismay – was mostly porn I believe), semantic search (True Knowledge and Powerset, later integrated in Microsoft Bing), privacy (DuckDuckGo, which is neither a search engine nor private, ironically) and more recently You.com (which I am currently using) which allows users to rank results based on their preferences.

But really Google was the last search engine to enter the market. A market that has grown from a few million users to a few billion ones and that defines human culture.

The New

What I am trying to say is that the way search engines work is not because it’s the best way to search, but because of the best technology available in the late 1990s and we are stuck in that paradigm. Maybe there is a better way, but change is hard.

Search engine design, both mobile and desktop, is stuck in a local maxima but the content we consume has changed. A lot of it is in graph form (social networks), data streams (social feed), video content (Youtube and TikTok), ecommerce, and authoritative knowledge (Wikipedia), apps etc. I think a lot of it found that local centralization was more efficient.

I think Google cannot be challenged in their core territory. I mean it’s not my opinion but more a fact; it’s what has happened for 24 years.

[Machine Learning has entered the chat]

I think that the advancements in Generative AI present a point of disruption for multiple industries, but specifically they are a way to break Google’s hegemony on search engines.

Instead of using a big database and searching on it, we need to use that database as training data, and generate results with a neural network. 

EDIT: one of the reasons why this is a big deal is that trained models are really tiny compared to the training data. Stable Diffusion is like ~2 gigabytes but the training data is 100 terabytes (with an estimated training cost of ~$600k for 256 A100s running for 150k hours), and the consensus is that the size of the model could be reduced by one or two orders of magnitude frther still, possibly more. So if you think that the internet is several petabytes, it’s not unreasonable that it could be reduced to a 100 gigabyte model, which is somewhat portable (a 10Gbit/sec fibre connection soon could not be out of the ordinary in wealthy urban areas). So one can see why one could have their own search engine installed locally and not having to rely on Google. The computing cost of running the model would be negligible, as the burden would be on training (which opens another can of worm as we do not have continuous training yet, the model would have to be retrained and redownloaded in its entirety again every time there is new updated info, so there is a great deal of technology that needs to be invented before Google can be replaced, but as others have suggested there might be a hybrid solution in the meantime – to some extent Google already tries to generate some query results when asked some questions and over time they have tried to reduce the traffic to non-Alphabet properties).

Instead of searching for something, and then opening the first few results and scanning for the content we want, while fighting millions of popups, ads, and weight loss scams, one needs to be able to generate the answer they are looking for.

This would also change distribution as installing the entire Google archive on your laptop is not feasible, but a ML model can definitely run (again my money would be on Stability.AI, as this is something DeepMind would be prevented from pursuing [6])

I have been thinking about this post for a while, and in the meantime one search engine already launched (not quite what I am proposing, but it’s using language models to generate URLs https://metaphor.systems/ ). Also Adept is working on an AI assistant which is very impressive, even if (IMHO) they have not nailed the use case yet. Lexica is a search engine for prompts which is also interesting (they start with art, but doesn’t mean that they end up there).

This would completely bypass the distribution monopoly and advertising business of incumbents, as training a model is expensive (for now) but running it is not (the marginal cost of a Google search is negligible, but the cost of maintaining their infrastructure is very much not).

And I am not arguing that this should be only about text. We have the ability to generate multimedia content, so in the near term you could generate a synthetic video on how to replace a faucet or how to solve a calculus problem.

Generative AI is very exciting, and as we backport it to old industries we might find that the incumbent is no longer needed.

Notes:

[1] yes I know there was stuff before Yahoo but hypergrowth on the internet happened with the web and allowing commerce on it, and all the other engines (Lycos, InfoSeek etc) have been dead for a very long time

[2] no idea when they (if?) they stopped producing them but it was a printed local phone book of all the businesses in the area, organized by trade, and it was the main way to acquire customers in the telephone age. In the early ‘90s as the cost of printing kept going down these things were huge in terms of pagecount, but in 5 years or so they became ridiculously small until they disappeared

[3] Yahoo pivoted multiple times and it’s still around, if you want to experience what it looked like you can check yahoo.co.jp (does not work if you are from Europe, but you can image search it to see screenshots), which was licensed to SoftBank decades ago and barely updated since

[4] AltaVista was the most interesting example, as they briefly controlled the market but because in their mind the product was used to index corporate documents inside a private network, they never built any kind of spam protection and very very quickly the top results were just spam garbage and everyone moved to Google. Their demise was lightning fast 

[5] I first used Linux in 1995 and although infuriating at times, it was very promising. By 1998 when Google started it was quite good even for enterprise production, and most definitely better than Windows for servers while Unix was fragmented by different vendors making sure their products were as incompatible as possible which doomed them as developers did not bother

[6]  which IMO should be the primary reason why they should do it as a priority, but I guess one is incentivized to forget about the Innovator’s Dilemma when they are being measured by quarterly performance