In this months column, we discuss the topic of automatic relation extraction from text using natural language processing techniques.
In continuation of our discussion on natural language processing (NLP), lets look at how it can be used in extracting information automatically from text. Given the exponential growth in information over the Web, extracting and organising knowledge manually is no longer feasible. Automated tools are needed to read the ever growing text available on Web, extract facts and information from it, and be able to organise it in a manner amenable to answering questions from users.
Given that search engines are used increasingly for finding answers to questions (and not just general information), it is important that such search engines go beyond merely returning pages of search result links. For instance, if a user types the following query into the search engine today, Who is the current president of the United States?, it is not user friendly to return thousands of search engine results. Instead, a single-line answer such as, Barack Obama is the current president of the United States is the most user friendly and acceptable result possible from the search engine. If you try this query on Googles search engine, it would return you the one-line answer, Barack Obama, with additional information on him in a small box, and standard search engine results displayed below. Googles ability to answer factual questions directly is powered by means of its Knowledge Graph (http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html). This is a knowledge base curated from multiple data sources such as Freebase, Wikipedia and Googles search engine results. Knowledge Graph is a knowledge base. Just as Google uses it to improve its ability to answer factual queries, Microsoft developed a knowledge base known as Satori Knowledge Base to power its Bing search engine results.
A knowledge base consists of unstructured information, represented in a structured form so that it can be queried easily. A knowledge base system typically contains facts represented in a knowledge base, and an inference engine that can reason using these facts to derive new facts and to detect any inconsistencies among these facts. Knowledge Graph and Satori are knowledge bases harvesting world knowledge from the Web. Knowledge bases can be curated either manually or through automatic means. For Web scale knowledge bases, manual curation is not feasible. Hence, an automatic means of extracting information from the Web is needed to populate such Web-scale knowledge bases.
Let us now discuss how NLP techniques can be used to extract information automatically from text documents to populate knowledge bases. This task is popularly known as Information Extraction (IE). It is the task of processing human language texts by NLP techniques. The core functionality of IE is concerned with extracting information present in unstructured/semi-structured machine readable texts and representing it in a structured manner. Extracting information about entities and their relationships is the core task of IE.
Information extraction can be done specific to a particular domain or it can be done in a domain-independent manner. Since unstructured information needs to be represented and organised in a structured manner, IE requires that a structure capable of supporting queries on the extracted information needs to be defined first. This structure defines the type of entities and the relationship between them. For example, let us consider a domain-specific information extraction task. You have been given a set of medical research texts and are asked to extract information pertaining to various diseases. The entities related to this information extraction can potentially be the disease names, their associated symptoms, the drugs used to treat the different diseases, etc. Some of the possible relations include: Disease X has symptom Y, Disease X is treated by Drug Z, etc. These textual relationships are defined in the domain ontology by ontological relations such as TREATS and Has Symptom. Given that these relations can be expressed in a high level language such as English in myriad different ways, any information extraction system needs to be able to analyse the language syntactic constructs, and if needed, infer semantic meaning and extract the relations.
The structure used for representing the information extracted from the unstructured text is known formally as ontology. It is the formal naming of the various entities of interest in a domain along with their associated relationships. There are two types of information extraction or IE systemsclosed IE and open IE. The former is associated with a fixed set of entities and relations already specified through an ontology. In our example, if the ontology is pre-specified and is used as the basis for extracting information from the unstructured text, then it is closed IE, since only relations specified in the ontology and entities belonging to the types/classes of the ontology would be extracted from the text. Closed IE systems are typically used in domain-specific IE, where the entity classes and relationships are well defined and known. Some well-known examples of closed IE systems are NELL and DBpedia. NELL is the Never Ending Language Learning System (http://rtw.ml.cmu.edu/rtw/), which uses machine learning techniques to continuously extract facts from the Web and represent it in a structured ontology. DBpedia is a structured representation of information extracted from unstructured Wikipedia texts (http://wiki.dbpedia.org/). Both of these popular systems are closed IE systems since they extract only predefined relationships from unstructured texts.
On the other hand, open IE systems extract any relation that is present in unstructured text. Open IE systems do not need a pre-specified ontology to guide the extraction process. In our example above of extracting medical relations, consider a sentence of the form: Drug X has adverse side effects such as severe allergic cough and skin rash. Since closed IE systems extract only the relations specified in the ontology, the side-effect relation will not be extracted by the closed IE system. However, this is an important fact associated with Drug X and can be extracted only by open information systems. Open IE systems do not have fixed ontology or entity types. All the possible relations present in the unstructured text are extracted irrespective of the type of entities and relationships.
After extracting all possible relations, an ontology may be constructed to populate the relations extracted in an organised structure. In certain cases, where a predetermined ontology is provided, the ontology can be extended if the open IE systems have extracted newer entity types or newer relation types. Open IE systems are typically used in Web-scale information systems by search engines. Well known examples of open IE systems include the Open IE 4.0 project (https://github.com/knowitall/openie), the TextRunner project (http://openie.allenai.org/) and the Stanford openIE (http://nlp.stanford.edu/software/openie.html).
OpenIE systems typically extract relation triplets of the form <subject, predicate, object> from unstructured text. For example, consider the following sentence from a text document: Obama, president of the United States was born in Hawaii. OpenIE systems extract two relations of the form: <Obama, is (being), president of United States> and <Obama, born in, Hawaii>. In both these relations, Obama is the subject. born in and is (to be) are the predicates in the two relations. President of the United States and Hawaii are the subjects in the two relations. While closed IE systems would need to be specified, OpenIE systems can discover the exact type of relations such as born in automatically due to the language structure.
Relation extractors for closed IE systems need to be provided with relation structures and entity types, whereas open IE systems do not need this information. This allows open IE systems to discover much larger sets of relations than closed IE systems. For example, closed IE systems such as NELL and DBpedia support only 500 and 940 relation types, respectively. On the other hand, open IE systems such as TextRunner can extract 100,000+ relations, and ReVerb (http://reverb.cs.washington.edu/) can extract 5,000,000+ relations.
Extracting relations from unstructured text can be done by means of: (a) supervised, (b) weakly supervised, and (c) distant supervised techniques. The supervised approach for relation extraction requires extensive training examples of extractions. On the other hand, the weakly supervised approach only requires a handful of seed patterns, which can guide the extractor to iteratively identify further relation patterns. While weakly supervised techniques do not require extensive labelled data, they still need to be provided seed patterns for relation extraction. Distantly supervised techniques use the relationship facts available in external knowledge bases like DBpedia or Wikipedia to identify the initial seed patterns automatically, without the need for any human intervention to provide to the relation extractor systems. We will discuss each of these approaches in the next column.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!