The Needle and the Haystack: Exploring Search Models, Part 2

Search

In the previous article, we demystified some search-related jargon, and learned how the humble Grep can be used to simulate a Boolean-model search engine. In this concluding article on the subject, let us explore two more search models that sort search results by relevance and allow for some custom tuning to suit the corpus. We also knock on the doors of content management without actually entering, just to show how the business of search becomes more fruitful when content is made to go that extra mile.

Let’s begin by understanding the vector space model. In this model, each document is treated as a projection in vector space. A brief explanation is in order. An arbitrary point in a three-dimensional space can be represented by three coordinates: x, y, and z. Here, our vector space has three dimensions, and any point is the result of its three component vectors (coordinates) projected along their three respective axes.

Now imagine a vector space with more than three dimensions — in fact, as many dimensions as there are unique terms in the collection that you want to search. Each unique term in the collection represents an axis or a dimension. Now, every document can be “visualised” as the result of all component vectors pointing along axes corresponding to unique terms in the document. This model treats every document as a bag of words, with no regard to its semantics or word order. For instance, the two documents Cats love mice and Mice love cats will have the same document vector.

The magnitude of a particular component vector (remember, each component vector represents a unique term now) for a document along that term’s axis can now be made to depend on a “weight” for that term, in that document. Term weight in a document is nothing but a measure of how important a particular term is to a document. Term weighting helps differentiate documents in vector space, making it easier to rank them by so-called relevance. Term weighting can get quite sophisticated, and we will dwell on it briefly later. The simplest term-weighting scheme is term frequency. An example should illustrate the concept. Let us modify our documents a little; for instance, they could be:

Cats love mice and I love cats.

……and:

Mice love cats.

Though both the documents have similar vectors in term space (ignoring “I”, and “and”), the first document has higher weights for cats and love, based on in-document term frequency, making it more relevant in a search for, say, cats. In other words, a term with a high weight in a document makes that document more relevant in a search with that term as a search keyword.

Using the model

We now have a collection with a set of vectors, one to each document. Vector algebra gives us the means to compute the “similarity” of two vectors, with the help of the dot product. A standard text on vector algebra will help you with more details on the vector dot product. But if you accept the maths on faith, then the following illustration should do for the moment. A dot product of two vectors A = [a1, a2, a3, …, an] and B = [b1, b2, b3, …, bn] is a scalar, and is given by the following equation:

A•B = |A||B| cos Θ = a1b1 + a2b2 + a3b3 + …... + anbn

…where Θ is the angle between the vectors. The smaller the angle, the closer the vectors are to each other. And so, we have:

cos Θ = (A•B)/(|A||B|)

Here, |A| is the magnitude of A given by:
\sqrt{a_1^2 + a_2^2 + ... + k_n^2}
…with the magnitude of |B| also being calculated in a corresponding manner.

For the purpose of our search, we need to ascertain how close our search term vector is to various documents in the collection, using the cosine similarity formula above. The cosine similarity is a normalised measure, making it independent of the size of the vectors — and consequently, the document sizes. The document (vector) closest to our search term vector ranks the highest in our search results.

Now that we have spoken enough about this model, let us take it for a test drive. Remember those four documents we used to illustrate the Boolean model? Let us modify them a little, so they appear as follows:

  • a.txt: apple, ball, cat
  • b.txt: Dogs love cats but cats love balls.
  • c.txt: Cats hate dogs and dogs love eels.
  • d.txt: dog, eel, fox

These example documents will not win any literary prizes, but will do famously to illustrate how this model works. Our term space (ignoring word forms) is as follows:

[ apple, ball, cat, dog, eel, fox, hate, love ]

Let us draw the term-document incidence matrix, too (Table 1). Note that we introduce term weighting for each document by giving due importance to a term’s frequency in a document.

In a search engine, ignoring word forms (example: running, runner) and mapping them to a root word (“run” in this case) is done by a process called stemming. This is essential to keep the vector space from burgeoning out of proportion, and also to make life easier for the search user — a search for “run” will also bring up documents with “running” and “runner”, and perhaps improve both precision and recall. We also ignored the words “and” and “but”. These are so called stop-words that are part of any document, and muddy up the vector space if allowed to pass through.

Now, if we were to search for the term dog, our search query vector would be as follows:

 [ 0 0 0 1 0 0 0 0 ]

…with a magnitude |Q| of 1.

Table 1: Term-document matrix
a.txt b.txt c.txt d.txt
apple 1 0 0 0
ball 1 1 0 0
cat 1 2 1 0
dog 0 1 2 1
eel 0 0 1 1
fox 0 0 0 1
hate 0 0 1 0
love 0 2 1 0
|D| 1.73 3.16 2.83 1.73

You might have guessed already that the search query vector is nothing but another document with solely our search terms, and all that the vector space model does is compare the closeness of documents in terms of their vectors. Let us then compute the cosine similarity of our search term vector with the document b.txt to illustrate the computation:

cos θ = (Db.txt•Q)/(|Db.txt||Q|) = (0*0 + 1*0 + 2*0 + 1*1 + 0*0 + 0*0 + 0*0 + 2*0) / (3.16*1) = 0.32

Table 2 shows the cosine values for all documents with respect to our search term vector.

Table 2: Cosine values
a.txt b.txt c.txt d.txt
cos θ 0 0.32 0.7 0.58

There are a couple of interesting things we can learn from Table 2. It should not surprise us that a.txt scores 0, making it the least relevant (in fact, it should not even make it to the search results); and c.txt scores the highest, making it the most relevant according to our model. But what does surprise us is that d.txt scores higher than b.txt. This shows the predisposition of this model towards shorter documents, term weights being equal.

Intuitively, you can appreciate that the term dog has a higher contribution towards the resultant document vector in d.txt than in b.txt. However, using idf (inverse document frequency) in addition to tf (term frequency) for term weighting can mitigate the situation.

The vector space model is powerful, but is very computation- and memory-intensive. All those vector inner products for a large collection take their toll on performance. On the other hand, comparing documents or finding similar ones (the familiar “more like this” or “find similar” hyperlink) is straightforward, given how the model works. Search term vectors can be arbitrarily long because the search terms are treated as yet another document.

Although initial search results might be disappointing, this model gives us the advantage of relevance feedback. What this means is, if we find a highly relevant document in the result set, we can use the “more like this” link to drill down deeper into a subset of the results. When this happens, the significant terms of the highly relevant document are used to reconstitute the search string, which is then resubmitted, giving us a revised and, hopefully, more relevant result set.

Probabilistic model

The probabilistic model uses an approach of estimating the probability of a search query term indexing a relevant document. (A term is said to index a document if that term appears as a key word in that document in the search engine’s database.) It is variously termed as “difficult” and “adhoc”. Working with a search engine implementing this model might also need some getting used to. However, it has a sound theoretical basis, which we will barely touch upon here. The so-called BM25 weighting scheme is most popular since it’s considered the most suitable.

Briefly, this is how it gets off the ground, as described here — technical details of the probabilistic model in a relatively palatable form. If we were to define some probabilities for a query term t, as shown below:

p    probability that t indexes a relevant document
q    probability that t indexes a non-relevant document.

……we would consequently get:

1-p    probability that t does not index a relevant document
1-q    probability that t does not index a non-relevant document

The term weight would then be given by the following equation:

 w(t) = log (p(1-q)/(q(1-p)))

To give real-world values to the term weight, it is expressed in terms of:

  • N — the document collection size,
  • R — the subset of relevant documents that are relevant for a search term t,
  • n — the number of documents indexed by term t,
  • r — the number of indexed documents that are relevant,
  • and some constants that are chosen for optimal results.

That is as far as we go here.

Like the vector space model, this model too calculates weights for documents that are indexed by the query terms. In the vector space model, the cosine similarity acts as a surrogate for this weight. But unlike for the vector space model, the document weights can keep changing based on user interaction. This model starts with a basic set of “relevant” documents in the result set, using in-document term counts or inverse document frequency (see term weighting in the Jargon Buster section from our previous article) to calculate document weights.

But the model expects the user to select the more relevant documents in this set, and resubmit the query, thereby allowing the model to reformulate the query, recalculate document weights, and revise the result set. In other words, the search process is iterative if it has to give increasingly meaningful results. For first-hand experience of this model at work, try this example of a search engine implementation using the probabilistic model with search-related queries.

To get a closer look at where the rubber meets the road, you can read how Lucene — a search engine from the Apache stables — ranks results. If you try out Lucene (or a similar search engine), you might get a glimpse into how the best of algorithms cannot avoid giving you counter-intuitive result rankings, or how sometimes, unlikely keywords can give you relevant results.

Is the quest for the perfect search destined to run aground in the sands of irrelevance? No — not if we can dress up the content.

Dressing up the content

The problem of getting consistent, usable search results has already been cracked by your public library. Imagine that you were searching for library books using key-words the way you search URLs on the Internet. There is no telling how long you would need to search before getting the relevant book; often, you may not even know which keywords to use.

But that is not how searching for books in a library has been happening for a long time, nor how you now search for MP3 songs on your personal music player. You search for books in a library using the library catalogue, and for music on your PMP using ID3 tags, which is nothing but metadata (data about data) describing the books and the music respectively.

While search as an activity had been happening in libraries for a long time, it emerged as a big-ticket application only with the advent of the Internet. With the novelty of this application, there also crept in the implicit assumption that Internet URLs are self-cataloguing. Although that is true to a certain extent when the “Title” and “Keyword” meta tags in HTML are populated correctly, we all know that in the race for search-engine rankings, the most sedate Web designers take extreme liberties with this cataloguing data, making it hard for it to serve any useful purpose. And we have realised that (key-) words that appear directly in the content provide poor and inconsistent self-cataloguing of the content.

Cataloguing or tagging content or documents using standardised schema is the time-tested approach to better, consistent search results in a captive, managed collection. Typical collections can pertain to medical literature, archived news reports, or scientific papers. Certain content just cannot be found without metadata — try searching for MP3 music without ID3 tags.

Content on the Internet is anything but “captive” or managed; besides, it keeps changing in unpredictable ways. Various approaches, such as folksonomy, social indexing, and tagging, are already in vogue to make Internet content search-friendly. If you are active in the blogosphere, you are already using these techniques.

The needle and the haystack

While searching in uncatalogued content can be a frustrating experience, the search operation becomes more rewarding when we pay as much attention to the content as we do to the search algorithms. In other words, you can find the needle in the haystack provided you thread (tag) the needle and arrange the hay in bales. Once you begin to control and manage your content repository so systematically, you have the beginnings of a Content Management System (CMS).

I hope that this, and the previous article, gave you a relevant perspective on search.

Feature image courtesy: Tall Chris. Reused under terms of CC-BY 2.0 License.

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.