A text-mining system for knowledge discovery from biomedical documents
The life science industry is an emerging market in which application spaces, such as drug discovery and development in the pharmaceutical sector and clinical record management in health care, have become areas of significant recent interest. (1) Documents in the scientific literature play an important role in life science by serving as a potential source for underlying knowledge discovery. These documents are a rich repository of information on relationships among biomedical concepts such as genes, proteins, diseases, and a variety of other key topics.
Text mining is a technology that makes it possible to discover patterns and trends semiautomatically from huge collections of unstructured text. (2-6) It is based on technologies such as natural language processing, information retrieval, information extraction, and data mining. (7) Early papers in this area mentioned the possibility of knowledge discovery from the biomedical literature. Hearst, one of the founders of text mining, proposed a system for predicting the functions of unknown genes using biomedical documents. (2) Swanson also described the idea of discovering new knowledge from the biomedical literature. (4) Subsequently, considerable research has been done in the areas of biomedical concept extraction (named-entity extraction), relationship extraction, and network/pathway construction for protein-protein interaction. However, although text mining has proved a promising approach for knowledge discovery from text sources, certain specific problems are encountered when trying to apply it to the realm of life science.
First, existing approaches are incapable of handling the vast amount of textual domain-specific information available. Indeed, there is more data available than anyone could possibly read or digest. For example, MEDLINE ** (8) is a database of over 11 million citations (abstracts) of biomedical articles dating back to the 1960s. MEDLINE is widely used as a golden standard for text-mining systems in life science, and several text-mining applications using MEDLINE have been proposed. The MedMeSH Summarizer (9) extracts MESH ** (Medical Subject Headings) terms (10) that can summarize the nature of a cluster of gene names obtained from DNA microarrays (also called DNA chips). MedMiner (11) is a system that filters information for the PubMed ** search engine. (12) Obviously, any approach that applies text-mining methods to such a large document collection must be highly scalable and robust.
Second, existing information extraction systems only provide extracted concepts and relationships in a fixed way. Because these systems are noninteractive, it is difficult to iteratively apply mining processes on their results directly. With an interactive text-mining system, users are better able to discover hidden knowledge by using a combination of mining functions and a trial-and-error approach.
To address these problems we have developed a text-mining system called IBM TAKMI * for Biomedical Documents (designated MedTAKMI hereafter), which is capable of mining the entire MEDLINE database in an interactive manner. The predecessor of this system, TAKMI (Text Analysis and Knowledge Mining), is a text-mining system for customer relationship management (CRM), which has been successfully used in call centers to mine customer support call logs. (13) The MedTAKMI system extends TAKMI to provide a useful set of tools for knowledge discovery from biomedical documents. MedTAKMI is designed to handle large document sets and is thus capable of mining the entire set of MEDLINE citations.
The development of methods for extracting information on such biomedical concepts as genes, proteins, and diseases from text is an active area of research (14-19) and typically involves the following two primary subtasks:
1. Entity extraction--the recognition of gene, protein, and chemical names from biomedical text
2. Relation extraction--the extraction of relationships among these entities
Thus architecturally MedTAKMI consists of two main components designed to handle information extraction and entity/relationship mining. The MedTAKMI system performs entity extraction based on dictionary lookup. This approach is simple conceptually and can recognize entities very quickly. We have developed a large domain dictionary that contains two million biomedical entities. These entities and their associated category names are used as keywords in the MedTAKMI system so that users can search for documents that contain a keyword within a specific category, for example, a query on the keyword "p53" within the gene category.
In a preprocessing stage input documents are parsed by a shallow syntactic parser which extracts keywords (entities) with category labels, as well as any binary and ternary relationships that may exist among these entities. The MedTAKMI runtime engine then uses this information to provide mining functions to users. Categories are constructed from public ontological knowledge, for example, using the MeSH terms in MEDLINE or the resources provided by Gene Ontology **. (3) User-defined resources may also be employed.
There has been extensive research in relation extraction, (20-35) wherein the goal is to extract relationships among biomedical entities (e.g. proteins and genes), from patterns such as "A inhibits B" and "A activates B," where A and B represent specific entities. Such relationships may be extracted by using one or more of the following information and methods:
* Surface string patterns (20)
* Syntactic information from shallow parsing (21,22) and full parsing (23-27)
* Templates and rules (28-30)
* Statistical information with machine learning (32-35)
In particular, the MedTAKMI system uses syntactic information with a shallow parser to extract binary (a noun and a verb) and ternary (two nouns and a verb) relationships. These relationships from the document collection are aggregated and can be displayed by category viewers as described later.
As previously noted, MedTAKMI is an extension of TAKMI, a text-mining system for CRM. (13) The main differences between these two systems are the following:
* The use of hierarchical categories: TAKMI only supports flat categories such as product names. For MedTAKMI we developed a hierarchical category viewer because most biomedical entities (e.g., genes and diseases) are defined hierarchically.
* The extraction of ternary relationships to capture protein-protein interaction by using deeper language analysis
* The introduction of support for domain-specific mining functions
* The development of a new system architecture and componentization structure
This paper is organized as follows: The next section describes the key features of MedTAKMI, including the system architecture and the information extraction process. We then introduce the searching and mining functionalities of MedTAKMI. The following section provides an example of the application of the MedTAKMI system to a specific user scenario. Finally, we summarize our work and describe directions for future research.
Features of the MedTAKMI mining system
The MedTAKMI architecture consists of two main components: a preprocessing information extraction stage and a runtime search/mining server, as shown in Figure 1. In this section we briefly discuss these main components and then describe methods for information extraction.
[FIGURE 1 OMITTED]
Information extraction process. Information extraction occurs as a preprocessing step and involves several subcomponents (see Figure 1, upper left). In Step 1, the term annotator finds words in the input text (i.e., the sentence with the label "Input" shown in Figure 2) using the term dictionary and identifies these words using their canonical form. (As described in more detail later, the term dictionary contains a pair of forms for each term: a surface form and a canonical form.) The obtained canonical words are embedded in the text document as annotations in XML (eXtensible Markup Language). In Step 2, the annotated text is passed to a syntactic parser. The parser outputs segments of phrases labeled with their syntactic roles, for example NP (noun phrase) or VG (verb group). The category annotator then assigns categories to the terms in these segments and phrases. The category dictionary consists of a set of canonical forms and their categories, which also indicates the node label in the hierarchy of categories. The hierarchical categories are in turn imported from existing hierarchies, such as the MeSH terms in MEDLINE, or from user-designed resources. Syntactic relationships among these entities, for example, subject-verb (S-V) or subject-verb-object (S-V-O), are also extracted from the output of the parser. All extracted information is finally encoded into an index file that is used by the runtime part of the system.