Using Semantic Content for Formulating and Assessing Hypotheses

Jennifer Golbeck
University of Maryland, College Park
A.V. Williams Building
College Park, Maryland
golbeck@cs.umd.edu

Introduction and Previous Research

"Undiscovered public knowledge" is an idea that has been around since the mid 1980s. It describes a common problem experienced by the full range of professional academics. There is just too much information available to keep up to date with all of it. A Neurologist may have time to stay up to date on new neurology papers, and maybe read some related to her topic of interest, but it is highly unlikely that she will know the work going on in genetics, nutrition, pharmacology, rheumatology, and so on. Because of this, it is not surprising that there is published work available in two subfields of a discipline that addresses parts of the same problem, but which is never connected up by researchers.

With a mechanism for finding these undiscovered links, a researcher is in a unique position to formulate well formed and viable hypotheses simply by making the connections. It is exactly this type of tool that is used as the base for this project, and which I try to create here in a semantic context.

The foundation for this work is in the Arrowsmith system. The strategies in Arrowsmith has been applied primarily to the medical literature, and there is an Arrowsmith interface to PubMed that allows users to experiment with the software[1].

In Arrowsmith, the user begins with a question concerning the connection between two entities (for example, a dietary substance "A", and a disease "C"). A researcher with a hypothesis about the connection between A and C may ask, for example, "A may help prevent C," or "A deficiency of A causes C." If there is previous work that directly discusses concepts A and C together, a conventional search will provide satisfactory results. However, if A and C are discussed in different literatures, different strategies are required to find connections.

If A influences some factor, B, not mentioned in the direct A-C literature, where B in turn influences C, the A-C implication cannot be discovered by a conventional search. Without knowing B, a conventional database search cannot determine whether such an B exists, even though the literature on A and the literature on C separately mention it. ARROWSMITH provides a solution if these connections are reflected in title words, or in the conclusions section of the abstracts.

Figure 1. A Venn diagram that represents the sets of articles, or "literatures," A and C, that have no articles in common, but which are linked through intermediate literatures Bi. Such a structure may contain unnoticed useful information that can be inferred by combining pairs of intersections ABi and BiC. (replicated from [11])

Users of the system explicitly state their A and C concepts. Arrowsmith builds separate collections of papers for each topic, and then does a search to find any B concepts that bridge the A and C literatures. Those results are filtered for vague or semantically useless words, and the remaining terms are presented to the user with frequency values. It is then left to the domain expert to find terms that may suggest possible connections to be investigated further.

This heuristic has been very successful in producing interesting and previously unknown results in the medical literature. To date, the Arrowsmith group has published articles describing novel and testable hypotheses on seven different pairs of complimentary literatures[5,6,7,10,13,14,15]. Several of these hypotheses have subsequently been supported by clinical tests. One of their more famous results is the linkage between magnesium and migraines [14]. Though at the time there were nearly no papers that discussed magnesium and migraines together, Swanson identified eleven intermediate concepts that linked the two literatures. For example, magnesium can be used to prevent spreading depression in the cortex, and separately, spreading depression can be implicated in migraine attacks. Another link shows that magnesium deficiency in rats has been used as a model of epilepsy, and epilepsy has been medically associated with migraine. Since the original paper was published in 1988, twelve different medical research groups have demonstrated positive results for this link [11].

It is rather amazing that such robust hypotheses can be generated from searching merely for exact keyword matches in titles and partial abstracts. As of October 2002, however, this was the extent of the Arrowsmith search. There was no natural language processing or semantics being used in these literature bridging searches[9]. This research attempts to address that problem and demonstrate the richness of the results that can be obtained by integrating the concept of undiscovered public knowledge with semantic markup of articles. Semantic Markup in Academic Literature semantic words, not keywords, matchmaking between equivalent terms searching greater part of document, markup can categorize results, background, etc so users can search for the term at different levels of importance

A major part of effectively using the Arrowsmith system is the presence of a domain expert who can filter results to determine which intermediate terms would make effective linking hypotheses. Part of this process involves reviewing the intermediate papers to find relationships such as "a deficiency of A causes B" and "B co-occurs in patients with C". In the existing Arrowsmith system, there is no mechanism for automatically detecting the actual relationships between A, B, and C terms.

It is possible that with advanced NLP, some of these deeper relationships may be discoverable. With reasonably thorough markup in a language such as RDF, DAML+OIL, or OWL, these relationships will not only be discoverable, but can be explicitly represented. It then becomes easy for a user to search literature A for papers that match "A causes B" and the search literature C for papers that match "B co-occurs_ with C." In this example, causes and co-occurs_with are semantically meaningful relationships between scientific concepts, and B is a variable semantic term.

Semantics also offer another dramatic advantage over any technique, no matter how successful, that relies on article text. Using the text of articles, quite obviously, limits users to what is represented in the text. In fact, many of the most interesting and useful points in academic papers are represented in tables, charts, figures, and equations. Unless these are each described in excruciating detail in the text, there is no way to access that information with something like a natural language tool. Semantic markup, however, can represent complex data relationships, the contents of pictures and figures, and the mathematics of an equation, as easily as the textual contents can be represented. This provides users with the ability to search for information that is deeper but very meaningful.

Implementation

The Ontology

Several ontologies currently exist for describing publications (bibont from Yale, bibliography.o from ISI, Atlas from CMU, and CS1 from UMCP to name a few). These generally include properties such as title, author, publisher, and date, however none of them provided a way to describe results. For this research, I created an ontology with properties for describing results (http://www.cs.umd.edu/~golbeck/daml/resultont.rdf). The classes and properties contained are as follows:

The Algorithms

The basic algorithm here is very similar to the one used in the current Arrowsmith system. The Simple Semantic Match is basically identical to the algorithm described in [11]since it is only matching individual terms. The more advanced match which searches for direct semantic relationships is just a slight extension to take advantage of the full semantic markup.

Algorithm 1: Simple Semantic Match 1. a. Collect all publications that contain term A. This will be literature A.
b. Collect all publications that contain term C. This will be literature C.
2. a. Build a list of all semantic concepts contained in the A literature. This will be the A list
b. Build a list of all semantic concepts contained in the C literature. This will be the C list.
3. Find all concepts that are in the A list and in the C list. There is no advanced matchmaking here – matches are made only between identical URI's. The resulting list of matched concepts are the "bridging" terms, and they make up the B list.
4. Return a list of the papers that connect A to B, and B to C.

Algorithm 2: Direct Semantic Relationships This feature allows users to search for papers were A is directly related to B, and B is directly related to C. With semantics we can do this specifying exact properties (i.e., A causes B, B causes C), or leaving the relationship unspecified (i.e., A shares s property relationship with B, and B shares a property relationship with C). This is more specific because it allows the user to directly tie the two concepts together, where they otherwise may appear in the same paper but have a weak relationship. The algorithm is almost identical to the one described for a simple semantic match.

1. Collect all publications that contain term A. This will be literature A.
b. Collect all publications that contain term C. This will be literature C.
2. a. For each paper in literature, A, collect any terms that have outgoing edges to term A or incoming edges from term A. Call this the A list.
b. For each paper in literature, C, collect any terms that have outgoing edges to term C or incoming edges from term C. Call this the C list.
3. Find all concepts that are in the A list and in the C list. Matching is done the same way as described in the simple semantic match above. A B list is the result of this search.
4. Return a list of the papers that connect term A to term B, and term B to term C.

Results

The original Arrowsmith studies were able to produce dramatic results, in part, because their data was prepared for them. Databases of medical literature were widely available with easy to use search interfaces. For the same work in a semantic context, there is very little. There is basically no literature that has semantic markup. That which does is restricted to basic meta-data – title, author, date, and maybe keywords – not markup of the actual content.

As such, deriving results of the same scale as the Arrowsmith studies will involve a non- trivial investment in time and energy to build a large and diverse database of semantically marked up documents.

That time and manpower were not available for the first round of this project. Though the justification for a semantic implementation is strong, I wanted to show some results, trivial as they may be. To achieve this, a subset of Nature Science Updates from 1999- 2001 were given limited semantic markup. This included the standard title, author, category information, plus links to semantic representations of the main topics in the paper, and markup describing the main one or two results. Ideally, a more significant markup of each article would be available, but, again, time was the single limiting factor.

Among the articles, searches were able to produce a couple interesting bridges. In the interface, the user was given a choice of instances to bridge. Selecting "Geomagnetism" and "Group Dynamics" produces a bridge of "Ants". The results then show which papers discuss Ants and Group Dynamics together, and then which papers discuss Geomagnetism and Ants together. Another search that tries to bridge "Ants" and "Honeybees" produces two bridges: "Navigation" and "Group Dynamics".

Conclusions

Arrowsmith has been able to produce surprising results using very simple methods of data access. It is clear that their method of searching only on title keywords will inevitably miss some connections and have problems detecting direct relationships between terms. An extension of the project to the semantic web is a natural one, allowing the algorithmic method to reach new depths in the literature.

In this paper, I described an implementation of the Arrowsmith algorithm that used simple semantics instead of title keywords, and then extended it to rely on semantic markup to detect direct relationships between A-B-and C terms. The only factor that limited this application from finding new and dramatic results in the literature was the lack of marked up papers to search on.

To allow advanced search techniques like this to work, a body of literature must be created that has at least simple markup of the important results. If publishers, such as Nature, start requiring this markup from their authors, they will quickly build up repositories that will be very useful for knowledge researchers, and which will allow the published content to be more accessible to academics who need it.

References

[1] ARROWSMITH PubMed Interface: http://arrowsmith.psych.uic.edu/cgi- test/arrowsmith_uic/pubsmith.cgi

[2] Smalheiser, N.R. (2002) Informatics and hypothesis-driven research. EMBO Reports 3: 702.

[3] Smalheiser, N.R. (2001) Predicting emerging technologies with the aid of text-based data mining: a micro approach. Technonvation 21: 689-693.

[4] Smalheiser, N.R. and Swanson, D.R. (1998) Calcium-independent phospholipase A2 and schizophrenia. Arch. Gen. Psychiat. 55: 752-753.

[5] Smalheiser, N.R. and Swanson, D.R. (1994), Assessing a gap in the biomedical literature: magnesium deficiency and neurologic disease, Neurosci. Res. Commun.. 15 (1994) 1-9.

[6] Smalheiser, N.R. and Swanson, D.R. (1996), Indomethacin and Alzheimer's disease. Neurology. 46 (1996) 583.

[7] Smalheiser, N.R. and Swanson, D.R. (1996) Linking estrogen to Alzheimer's disease: an informatics approach, Neurology. 47 (1996) 809-810.

[8] Smalheiser, N.R. and Swanson, D.R. (1998) Using ARROWSMITH: a computer- assisted approach to formulating and assessing scientific hypotheses. Computer Methods and Programs in Biomedicine 57: 149-153.

[9] Smalheiser, N.R., Torvik, V.I., Weeber, M. and Swanson, D.R. (2002) The Arrowsmith Project: New Tools to assist Biomedical Discovery and Collaboration, presented at the Beckman Institute, University of Illinois at Urbana-Champaign, October 1, 2002.

[10] Swanson, D.R.(1986) Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30 (1986) 7-18.

[11] Swanson, D.R. and Smalheiser, N.R. (1997) An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence 91: 183-203.

[12] Swanson, D.R. and Smalheiser, N.R. (1999) Implicit text linkages between Medline records: using Arrowsmith as an aid to scientific discovery. Library Trends 48: 48-59.

[13] Swanson, D.R., Smalheiser, N.R. and Bookstein, A. (2001) Information discovery from complementary literatures: categorizing viruses as potential weapons. J. Am. Soc. Information Sci. Technol. 52: 797-812.

[14] Swanson, D.R. (1988) Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 31 (1988) 526-557.

[15] Swanson, D.R. (1990) Somatomedin C and arginine: implicit connections between mutually-isolated literatures. Perspect. Biol. Med. 33 (1990) 157-186.

[16] Valdes-Perez, R.E. (1999) Principles of human-computer collaboration for knowledge discovery in science. Artificial Intelligence 107: 335-346.

Ontologies

http://www.mindswap.org/2002/ont/naturePapers.rdf

http://www.mindswap.org/2002/ont/paperResults.rdf


[FrontPage] [TitleIndex] [WordIndex]