In this months column, we feature a set of interview questions on algorithms, data structures, operating systems and computer architecture.
For the past few months, we have been discussing information retrieval, Natural Language Processing (NLP) and the algorithms associated with them. In this months column, we focus on an important aspect of NLP known as textual entailment.
What is textual entailment?
Let us understand more about this by first looking at the two following snippets of text:
(1) India has a number of nuclear power plants which are used to generate electrical power. The first nuclear power plant was started in Tarapur, Maharashtra. The latest one was commissioned at Kudankulam, Tamilnadu.
(2) Kudankulam in Tamilnadu has a power generation station, which generates electricity.
Now the NLP problem posed is to determine whether Snippet 2 can be inferred from Snippet 1. When human beings parse the text of Snippet 1 and Snippet 2, it is very easy for us to determine whether the latter can be inferred from the former. On the other hand, it is not easy for an automated NLP algorithm to reach this conclusion. This is the problem area that textual entailment techniques attempt to solve. Formally, given a text snippet T and a text snippet representing the hypothesis H, a textual entailment program could determine whether they formed a textual entailment pair. T and H form a textual entailment pair, if a human reading T and H would be able to infer H from T.
Consider the following example of two snippets:
(3) The Director of Public Health said, It is important to stress that this is not a confirmed case of Ebola.
(4) A case of Ebola was confirmed in Mumbai.
Now the question is whether Snippets 3 and 4 constitute a textual entailment pair? The answer of course is obvious to humansthat they do not form a textual entailment pair. But as we will see later, this is not a simple case for an automated textual entailment technique. Many techniques that use surface level string parsing or that are keyword-based, do wrongly mark this pair as textually entailed. Much of the complexity of automatic textual entailment techniques is needed to deal with and avoid such false positives.
An area closely related to textual entailment is paraphrasing. Two statements are paraphrases if they convey almost the same information. Consider the following snippets:
(5) Ayn Rand wrote the book Fountainhead.
(6) Fountainhead was written by Ayn Rand.
(7) Ayn Rand is the author of Fountainhead.
(8) Fountainhead is the work of Ayn Rand.
Now statements 5, 6, 7 and 8 are all paraphrases of each other. If two statements A and B are paraphrases, then they are mutually textually entailed. In other words, both (A, B) and (B, A) are textual entailment pairs. Statement A can be inferred from B and vice versa. But textual entailment does not automatically imply that they are paraphrases. Consider our previous example: Statement 1 implies Statement 2, but not the other way around.
Textual entailment programs can be used for: (a) recognising whether two pairs of natural language expressions form a textual entailment pair, (b) given a single natural language expression, generate all possible TE expressions for it, and (c) given a document or set of documents, extract all TE pairs. Similarly, paraphrase programs can be either recognisers, generators or extractors. Before we look into the techniques for TE and paraphrasing recognition/generation/extraction, let us look at the practical applications for TE and paraphrasing in NLP applications.
Applications of paraphrasing and textual entailment
So far, we have been considering textual statements as inputs to paraphrasing systems. But this may not always be the case. In fact, one of the earliest NLP applications of paraphrasing was in the field of automatic Question Answering (QA) systems. Consider a QA system in which the system is given a set of documents, and needs to find the answer to the posed question from among the documents. Given that the answer, if it is indeed present in the document, may be phrased differently from the way the question has been framed, it may be necessary to generate paraphrases of the question and check if any of the paraphrased questions can be answered using the content in the document collection. Let us consider a simple example in which the question being posed to the QA system is, Who is the author of the book Crime and Punishment? and one of the documents in the collection contains the passage, Some of Leo Tolstoys finest works include Crime and Punishment and Anna Karenina. In this case, by paraphrasing the question as, Whose work is Crime and Punishment? the QA system may be able to easily figure out the answer as Leo Tolstoy. Instead of paraphrasing the question, in some cases, the possible answers may also be paraphrased to check if any of the paraphrases can serve as an answer to the question.
Text summarisation is another area where paraphrasing techniques are used widely. Given a set of text documents to be summarised, one of the important functions of a summary extractor is to identify the most important sentences from the texts to be summarised. For example, let us consider that the task given is to create a summary from all news articles in the last one month which discuss the Ebola virus epidemic. Since these newspaper articles are discussing the same event, many of the documents will contain sentences that convey almost the same information. Hence, it is important to avoid selecting sentences that are paraphrases in the summary. Hence, paraphrasing techniques can be applied to a particular sentence to check if it is a paraphrase of any of the existing sentences in the summary, and if found to be so, it can be discarded. Similar to paraphrasing, TE can also be applied on a particular statement to check if it can be inferred from any of the existing sentences in the summary. And if so, then the statement can be discarded. Note that paraphrasing can also be used to achieve sentence compression, since it can help to generate a sentence which is shorter than the original sentence but conveys the same information.
Machine translation is another major area where paraphrasing and textual entailment techniques are applied. Paraphrasing, in particular, has been widely used in training machine translation systems by using a human generated translation reference and checking to see if the machine generated translation is a paraphrase of the human generated one. This is typically useful when popular literary works, which have been translated by humans, are used to train the machine translation systems. Let us consider a simple example, where the two statements are:
(a) Company A filed a case against Company B for copyright infringement.
(b) Company A accused Company B for copyright violation.
If our machine translation system has never encountered the phrase filed a case in the source language during its training samples, it could try finding a translation for the paraphrased source sentence (b), if it has encountered the word accused before in its training samples. Using paraphrasing allows a machine translation system to deal with words that it has not encountered before in its training samples.
Other areas where TE and paraphrasing techniques are useful include Information Extraction (IE) systems, which use automated techniques to extract information on specified topics from various sources. These are typically used for answering natural language queries in information retrieval systems. Consider, for example, a system that extracts information on all motor vehicle accidents from newspaper reports. Consider the following statement:
(a) The tourist bus collided with an oncoming car and crashed on to the median.
If this can be paraphrased as There was an accident involving a bus and car, then it is possible for the IE system to include this as a candidate news item in its search. Other areas in which TE and paraphrasing can be applied include automatic grading of students answers, and in providing simplified summaries of the expert documents in a form understandable to laymen.
We will continue our discussion on TE and paraphrasing next month, when we will look at the algorithms for them. Meanwhile, I would like to give readers the following assignment.
Given a set of questions based on world events, and access to the World Wide Web (and its search engines), can you come up with a program that generates answers to the questions based on search query results? One possible way is to do a keyword-based search on the search query results and return the document passage that bears the most similarity. What are the other possible techniques you can use? For instance, consider the following two questions:
(a) What is the major food item exported by India?
(b) What is the only city in India which, apart from being the capital of a state, is also located on the banks of a river and is on the seashore?
How do these two questions vary in terms of difficulty in finding answers through an automated system? How would your solution deal with question (b)?
By the way, I wanted to point our readers to an interesting article I read recently in the IEEE Spectrum magazine by Prof Jordan (of machine learning, EM algorithm and Bayesian networks fame) titled Delusions of Big Data and Other Huge Engineering Efforts which is available at:
http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts. He makes the important point that the inferences drawn from Big Data need to be validated for whether they are random patterns found by analysis (causality does not imply causation) or real root-causes explaining the data patterns. On being asked what research area he would target if he got a billion dollar research grant, he picked Natural Language Processing.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, Happy New Year and happy programming!