The Needle and the Haystack: Exploring Search Models, Part 1

Search

Searching for the proverbial needle in the haystack occurs millions of times a day in the realm of cyberspace. Ever looked under the hood of a search engine — that apparently Rube Goldberg machine we turn to for everything from driving directions to locating that email in your mailbox?

A Rube Goldberg machine is a “…deliberately over-engineered machine that performs a very simple task in a very complex fashion…” (Wikipedia).

You will find more information on these machines using (again!) Internet search. However, search engines have no Rube Goldberg over-engineering in them. From the basic command-line search engine — like grep — to specialised search engines that lord over vast, evolving repositories, the search engine universe seems to be expanding like our own cosmic universe. Complexity is a necessary attribute of a search engine, and not solely because users’ information need — our quest for relevance — keeps on getting more and more complex.

In this two-part series, we illustrate the basic principles of search. By no means are we even pretending to delve into the back rooms of googol-scale (no typo here!) operations that bring with them their own orders of complexity. This is not a formal treatment of the subject either; for that, you would have to refer to a text book on information retrieval.

Jargon Buster
This jargon buster does not give formal definitions, and is just good enough as a starting point before delving deeper. You could impress your colleagues with this jargon for a couple of days, though.

Relevance

No discussion on search engines can be complete without the R-word, just as any discussion on the economy cannot continue for long without the word “infrastructure”. Relevance is simply the quality of meeting our information need. You may not be able to define a relevant result, but you can tell when you see one.

Corpus

A collection or corpus is the body of documents, or suitably organised content, that is indexed by a search engine. Well-characterised corpora are used to measure the precision and recall of a search engine. [Test collections or
corpora for testing Information Retrieval
.]

Precision

The precision of a search engine is the number of relevant documents retrieved, as a percentage of the total number of documents retrieved. Precision is quite apparent. You will find page after page of relevant documents (or links) in the search results when a search engine exhibits high precision.

Recall

The recall of a search engine is the number of relevant documents retrieved, as a percentage of the total number of relevant documents in the collection being searched. This measure reflects the real-life apprehension that all relevant documents will never be retrieved. This measure is not very apparent, and a high precision might mask a low recall.

Precision and recall work at cross purposes, and tweaking one affects the other. As to which of these two, precision or recall, is more important: this is more a matter of “fitness to purpose”, which is a matter of debate. Internet search engines or product finders might cheer for a high precision. There might be commercial interests involved, and holding on to the surfer a few seconds longer, with relevant results within the the first ten or twenty, makes all the difference.

Closed, captive collections on the other hand, have no such compulsions. Here people might put a high premium on recall, and may want to ensure that a relevant document does not escape retrieval. Recall appears to be the bigger challenge, going by the various approaches followed to improve search engine results. [This Wikipedia article explains precision and recall, and some approaches employed to improve the quality of search results.]

Discounted Cumulative Gain

Discounted Cumulative Gain (DCG) is a subjective method to compare search engines. Subject-matter experts are asked to rank results from the search engines that have indexed a common collection. The search results are scored in such a way that relevant items (as judged by the expert) appearing high in the result list score higher than highly relevant documents appearing lower down, which are penalised. The weighted scores of individual documents are accumulated for each result set, giving it a cumulative score that serves as a factor of comparison.

Stop words

Stop words are prepositions and other filler words (“in”, “it”, “of”, “on”, etc.) that add noise to the search query, and are best removed. These are words that are found in almost all documents, and searching with stop words left intact in the query can lead to a lower quality of search results.

Let’s explore this further with an analogy. Imagine you are an alien, with no knowledge of human life on earth, and are asked to identify humans (for further study) using attributes such as “exhibits motion”, “has limbs”, “reproduces”, etc. You would end up with a large result set, which would include almost all living things with limbs, that exhibit locomotion. On the other hand, if you were given attributes such as “walks upright”, “speaks on the phone while driving”, and “watches TV”, you will end up with a much better result set.

In this example, “exhibit motion” is analogous to a stop word for our search. As you can clearly see, the query which gives us the best results is the one that has the least number of attributes unique to a relevant result. Removing stop words helps achieve this.

Stemming

Stemming is a technique used to increase the recall of a search engine. Word forms (assuming English as the language) are mapped to the common root, and are equivalenced to mean the same search term. For instance the word “running” as a search term will also bring up documents that contain “run”, “ran”, or “runner”. On a related note, when a search engine is sensitive to synonyms, it treats as equivalent two words that mean the same thing; “Bangalore” and “Bengaluru”, for instance.

Term weighting

Term weighting implies that a search term is not given the same importance across all documents in the result set. It is a mechanism to rank search results. According to local weighting schemes, a search term is given a weighting in a document that is proportional to criteria such as its frequency in the document, or how early in the document the term appears.

The assumption is: if a document cites a certain (search) term over and over, or early enough, it has to be highly relevant with respect to our search term. If local weighting uses the document under consideration, then global weighting looks exclusively at the complementary set, or the rest of the documents in the result set. Global weighting scores a term in inverse proportion to the frequency of the term in the complementary set of documents. The net result is that documents containing more unique search terms across the result set get pushed up, and documents containing terms common across most documents in the result set get pushed down.

This weighting scheme goes by the name of tf-idf (term frequency-inverse document frequency). While this scheme might work well for closed, captive collections, it can fail miserably in the Internet search scenario. In the latter case, tf-idf used by itself can lead to a lot of spam in the search results (Do the words “keyword stuffing” ring a bell?)

The three models

The simplest search tool is the Linux/Unix command-line grep. Using grep as follows, in a directory containing files, will give you search-engine-like output:

grep -iH --context=1 -m2 <search term> <file names>

The -i stands for “ignore case”; the -H causes printing of filenames; --context=1 ensures that the matching line is preceded and followed by one line of contextual text; -m2 asks grep to print at most two matches per file. The files are your corpus or collection that you want to search.

Note some of the things that are happening here. grep trawls through your collection live, as it executes your search command. Such an approach will not work for large collections. Not only are we required to read thousands of files over and over for each query, we are also required to scan the files that do not have the query terms. Is there a smarter way to do this, thereby reducing disk and processor grind, and improving performance?

Secondly, you will note that files with the search terms are printed in alphabetical order, with no indication as to the relevance of the file to your search request. Of course, you can use the reasonable approach of counting search terms in the files. A -c parameter to grep prints file names with search term counts. If you are a medical student, then files containing high counts of terms such as “patient” or “symptoms” are likely to be more relevant to you.

But is that all there is to judging relevance of documents? Can we judge shades of relevance?

Boolean model

We raised the issue of the performance penalty in using grep as a search engine. Is there a way to read the collection just once, and record the presence or absence of words — our potential search terms? We are talking of a rudimentary index here. This index data structure gets more elaborate later. With a Boolean model, we can throw up our result set (the documents containing our search terms) relatively quickly. What we do not get is information as to which documents are more relevant to our search.

Such a model can be applied to small collections, much like you would use grep -l in a directory to locate certain files that are likely to be there. This model is rarely used in isolation, and is more likely to used as an upstream filtering step, to quickly get a result set before relevance ranking is done. For our purpose, this model gives us an excellent opportunity to illustrate how Boolean algebra comes to the aid of search implementation.

Create four files — a.txt, b.txt, c.txt, and d.txt, with the following sets of three words respectively: apple ball cat ; ball cat dog ; cat dog eel ; dog eel fox.

Passages from Shakespeare would have sounded more natural, but we are using words from kindergarten to illustrate our point better. We have six terms in all, in four files; a term-document incidence matrix might look as follows (we use the words “file” and “document” interchangeably):

Table 1: Term-document matrix
Term a.txt b.txt c.txt d.txt
apple 1 0 0 0
ball 1 1 0 0
cat 1 1 1 0
dog 0 1 1 1
eel 0 0 1 1
fox 0 0 0 1

To answer the query ball AND dog, do a bit-wise AND of the document vectors for ball and dog, thus:

1100 AND 0111 = 0100

The document denoted is b.txt.

For a query cat AND dog AND NOT ball, we take the document vectors for the terms cat, dog, and the complement of the vector for ball, to get:

1110 AND 0111 AND 0011 = 0010

This denotes the file c.txt.

Once we create a term-document incidence matrix as above, and make it persistent — see this as a rudimentary indexing step — we do not have to read the documents each time we search, thus giving our search process an efficiency boost.

You can get the same effect with grep. Note, however, that this is not the smartest way to use the tool, and we are using it only for the purposes of illustration. For a query cat AND dog AND NOT ball, run the following command, and note the output is c.txt.

grep -l cat [abcd].txt | xargs grep -l dog | xargs grep -lv ball

The commands can be pipelined in any order to get the result, just as the document vectors can be logically operated upon in any order.

Shades of relevance?

The Boolean model takes a black-and-white approach to search results, and while being fast, can work with small document collections that the searcher is well acquainted with. All documents of the result set are considered equal. As the size of the collection grows to tens of thousands of documents, and as search queries get more complex, and as ideas of relevance get more sophisticated, we need a document model that introduces shades of grey within the result set.

In the next part of this series, we will look at two other models that help rank search results using relevance measures, and try to understand why, in the interests of better search results, the content and the algorithm need to meet each other half-way.

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.