The birth of the web and rise of full-text search

In 1993, full-text search was like mRNA vaccines—long researched, partially implemented, and waiting for critical mass to catapult it into mainstream society.

The perfect brew? The rise of storage capacity, the first search engine, and the web.

In 2021, the most difficult part of searching is deciding which of 1 million results we want to read—even then, we hardly ever bother to go past the first page of results.

The quality of search results is a real issue and the subject of ongoing research, but it is a different problem from finding information—a problem that's been around since at least 2000 BC when ancient civilizations in the Middle East were archiving information.

We don't need to go back quite that far to understand some of how full-text search came to be. The idea of searching text with a computer for information retrieval (IR) was first explored in the 1950s, "when IR as a research discipline was starting to emerge … with two important developments: how to index documents and how to retrieve them," according to "The History of Information Retrieval."

In the early 1960s, Gerard Salton developed the vector space model of information retrieval that remains the foundation for informational retrieval today.

Research in information retrieval in the 1970s and 1980s advanced work done in the 1960s with the notable addition of term frequency theories—how often a word appears in a document helps determine what that document is about.

These methods were developed and tested using relatively small document sets because limits on storage capacity and processor speed naturally curtailed how many documents were computerized. This was such a concern that international research groups formed the Text Retrieval Conference (TREC) to build larger text collections, which helped refine full-text theories against different types of content. This was to prove of use with the rise of the web.

The internet connects researchers—but few others

 Research was also progressing on another front. In 1973, the U.S. Defense Advanced Research Projects Agency (DARPA) began to research how to exchange data through networks, according to The Internet Society. 

The result was the Internet, a system of networks and communication protocols connecting endpoints—or infrastructure for data exchange.

 By the late 1980s, the Internet was used to connect a relatively small number of universities, researchers, and government research partners, such as Rand Corporation, IBM, and Hewlett Packard. One of the earliest Internet communication protocols was file transfer protocol (FTP), which "required a minimum of handshaking, and even more crucially was tolerant of temporary interruptions during long file transfer sessions," according to The Register.

Users accessed the Internet with a terminal that looked like a DOS command line—in a manual process that required use of an FTP client to connect to a remote FTP server. They would then ask for a list of files on the server, look through the file names, and download the files to see if they contained the information that they needed—over network connections that were thousands of times slower than what is commonly used today.  

In other words, users had to rummage around files in the endpoint, though some FTP administrators added a downloadable directory list of sorts with file names and one-line descriptions of the files to making this rooting about slightly more efficient.

Keep in mind that these FTP sites had content on them curated by a select group of geeks—not just any academic knew what an FTP site was, let alone how to add content to one. Every FTP site reflected what its administrators thought was interesting or whatever occurred to them to add.

The process was not very different than asking for an interlibrary loan, where you would ask a librarian to order a book from another library.  The requested book would come in a week or two, and it might or might not contain information useful to you.

Those lucky enough to have an Internet connection were giddy because they were able to get whatever was on these FTP sites immediately instead of waiting weeks for an interlibrary loan. Suddenly, the process of discovering if a document was relevant was exponentially faster.

Nevertheless, it was laborious—one had to look through files, download the files that seemed suitable, then see if it had what you wanted. Rinse and repeat, as needed. 

The first search engine

 In 1989, Alan Emtage, a graduate student at Magill University, was working to find software downloads on FTP sites. In an effort to make this easier, he developed Archie, a software program that indexed file names on the FTP sites he connected to.

Archie, which is still available, indexes the downloadable files of FTP sites and allows users to search the file names for exact word matches. Indexing makes a list of where every word or phrase can be found in a set of text—making search faster and more efficient.

Archie has "a crawling phase, a retrieval phase where you pull the information in, and an indexing phase, where you build the data structures that allow the search, and then you have the ability to search," Emtage said in an interview with Digital Archeology.

Archie was, by all accounts, difficult to use, but it laid the foundation for all subsequent search engines and, perhaps, the web. 

The rise of the web

History will long remember 1989—it's also the year that Tim Berners-Lee launched the first website, which described his vision of "The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents."

Berners-Lee developed the web to help scientists at CERN share information more easily. He also hoped to democratize the Internet and "meet the demand for automated information-sharing between scientists in universities and institutes around the world," according to CERN.

In 1993, CERN open sourced the underlying code for the web, and web sites began to proliferate. Suddenly, the problem of small document sets disappeared.

 A few search engines emerged between 1989 and 1993 to help web users find documents that matched their searches. The field was crowded with competitors by 1998, when Google launched.

 Coinciding with, and perhaps enabling, the growth of the web, storage capacity grew, with the number of bits of information packed into a square inch of hard drive surface increasing from 2000 bits in 1956 to 100 billion bits in 2005, according to Mark Sanderson and W. Bruce Croft in The History of Information Retrieval.

Because we had more and varied type of documents to choose from, what we searched changed from the Archie days. Before the web, Internet users were a small population looking for a constrained set of documents. "There is no way to discover new resources, so resources have to be manually discovered," Emtage said. So what you expected of search and how much you needed search was very different and for a much smaller universe of information.

Search, storage and behaviors change

 How we searched changed. The basic process of modern search engines and Archie are not so different. They retrieve information, index it, and allow people to search. But the web allows for discovery, something that Internet users didn't have. "With the invention of the web, you have the ability to discover things that you didn’t previously know, because of hypertext links between websites."

 Those web links allowed users to journey across sites without having to use an engine at all. The web spawned the potential for instantaneous serendipity in information retrieval, a capability that had never existed before. Who hasn't gone down the rabbit hole of following link after link to suddenly find oneself in an alien corner of the web, reading about something wholly unrelated to what one first went looking for?

But serendipity can only take you so far. When you really want to use your research time efficiently, you want to focus in on a narrow range of things directly relevant to your immediate need. The down-the-rabbit-hole process of following links, while valuable in a broader sense, doesn't help you find or compare recipes for kung pao chicken, for example.

Between Archie and the modern search engine, several technological leaps had to happen-

  • A scale problem: We had to solve the problem of indexing actual content, not just file names, at scale because there were 1 billion websites by 2014.

  • A UI problem: We had to make it possible for largely untrained users to type in queries in natural language text, not in a rigid syntax of terms and Boolean operators, and engines had to advance to be able to divine what relevant results would best answer the questions implied by those natural language queries.

  • A quality problem: It's easy to make an engine that returns 10,000 results, but such an engine isn't going to be usable by people who work under limits of time and available attention. Search results must be ranked or otherwise limited in topical scope to be useful at all.  Indeed, search engine optimization (SEO) as a business sector arose purely to make sure that a commercial entity's desired answers ("buy product X") would show up in the first few links search engines presented to users in response to topical searches.

  • A language problem: How do you support search across every human language? The answer was Unicode, a standardized way to write down all the letters of all of the alphabets of all of the languages of the world.

Web-based search engines index the full text of documents and employ many of the early information retrieval tactics developed in the latter half of the 20th century. They used term frequency, statistical analysis of word relationships, clustering and different kinds of faceted search to help determine return relevant results.

 In 1998, Google introduced PageRank, an algorithm for determining which documents were more closely related to others, that tackles the problem of "relevance determination" by using the pattern of web links between pages as a proxy for relevance. While PageRank has since been superseded at Google by newer methods, its appearance changed the game in the commercial search space.

Today, search continues to improve with machine learning-aided approaches  representing the state of the art.

Other methods of search, especially for products, are based on metadata embedded on websites and machine learning has great success by using signals from the moment the user lands on a site and comparing user actions to those similar users have performed. It learns and adapts as it goes, offering information that becomes more relevant the longer the user stays on the site – and it can incorporate user data from other sites that it has access to.

Though information retrieval has been around for millennia, full-text search enables humanity to learn, discover and enjoy human knowledge in ways that were unimaginable even 30 years ago.

The future of search may well lie in images and in predicting what users are likely to need and presenting it to them before they have to ask – opening new frontiers in human information retrieval.

Previous
Previous

What is vector search anyway?

Next
Next

Context is paramount in word-sense disambiguation