1 From thesauri to rich ontologies

1.1 The problem

Empowering end users in searching collections of ever increasing magnitudes with performance far exceeding plain free-text searching (as used in many Web search engines), and developing systems that not only find but also process information for action, require considerably more powerful - and complex - knowledge organization systems (KOS) than the classification schemes and thesauri that currently exist. Such systems must serve the following functions, among others:

Improved user interaction with the KOS on both the conceptual and the term level for improved query formulation and subject browsing, and for more user learning about the domain.
Intelligent behind-the-scenes support for query expansion, both concept expansion and synonym expansion, within one language and across languages.
Intelligent support for human indexers and automated indexing/categorization systems.
Support for artificial intelligence and semantic Web applications.

All of these functions require semantic relations that are more expressive and nuanced than the few rudimentary categories and relationships found in traditional thesauri and classifications.

A typical scenario in information retrieval illustrates some of the shortcomings of current free-text search engines such as Google. A farmer is interested in finding out about rice and starts a search by entering the string 'rice'. The results returned in response to the query immediately indicate several problems. First, because the system performs the search based on the actual text string entered rather than on an interpretation of the meaning of the string, many irrelevant results are retrieved. This occurs because the query term itself is ambiguous (i.e. rice can refer to the grain, to the university in Houston, or to the name of an author, among others). Further, there are millions of results with no apparently meaningful arrangement. To find something of possible relevance, the user may need to click and scan page after page of the retrieved results. Finally, the user is stuck with the results that have been retrieved; to find other related resources, such as rice cultivation, the user must start from the beginning again and formulate a different query, despite the fact that the new query corresponds to concepts related to the original query. The problem becomes evident: The biggest challenge in information retrieval is concept identification in a specific domain of interest!

In contrast, in a semantics-driven information retrieval system, the system would recognize, i.e. "understand", that the string 'rice' was ambiguous; it would then request clarification from the user as to which of the possible meanings was intended. Only then, after the user disambiguated the term, would the system execute the search. The system would then retrieve only those resources that had been semantically marked up (through manual or automatic indexing) with the concept of rice, no matter what words or even languages are used in the resources to refer to rice. Moreover, because the system is semantically rich, it not only presents results that are based on understanding the user's request, it also offers related concepts the user might not have thought of initially. Based on a <hasPest> relation, the system could display such concepts as rice weevil and rice moth. Searching on these latter concepts could in turn lead to concepts on pesticides used on rice, and so on. The system could retrieve not only information directly pertinent to the user's query but also help the user explore and clarify the information need and find useful related information. In this scenario, a KOS has two functions: assisting the user with exploring the topic of the query, and supporting intelligent automatic indexing (metadata assignment) through statistical and syntactic-semantic analysis and "understanding" of text; both functions require a KOS with a rich and precisely defined semantic structure.

To accomplish these and other more sophisticated tasks, the new KOS must marry the conceptual structure of full-fledged ontologies - well-structured hierarchies of concepts connected through a rich network of detailed relations that support concept retrieval and reasoning - with the terminological richness of good thesauri. While existing KOSs do not provide the full set of precise concept relations needed for reasoning, existing KOSs, both large and small, represent much intellectual capital. This paper explores the question of how this intellectual capital can be put to use in constructing full-fledged KOSs.

1.2 The relationship of traditional KOS to ontologies

Reengineering thesauri, classification schemes, etc., into ontologies means building on the information contained in them and refining that information as needed. Consider the relationships given in the http://www.ericfacility.net/extra/pub/thesbrowse.cfm (ERIC = Educational Resources Information Center) with those given in a hypothetical ontology as shown in Table 1.

Table 1: Statements and rules of a hypothetical ontology versus the information given in the ERIC thesaurus (broader term (BT), related term (RT))
Eric Thesaurus	Hypothetical ontology
	Statements:
reading instruction BT instruction RT reading RT learning standards	reading instruction <isa> instruction <hasDomain> reading <governedBy> learning standards
reading ability BT ability RT reading RT perception	reading ability <isa> ability <hasDomain> reading <supportedBy> perception
	Rule 1 Instruction in a domain should consider ability in that domain: *X shouldConsider* Y** IF X <isa (type of)> instruction AND X <hasDomain> W AND Y <isa> ability AND Y <hasDomain> W yields: The designer of reading instruction should also consider reading ability.
	Rule 2 *X shouldConsider* Z** IF X <shouldConsider> Y AND Y <supportedBy> Z yields: The designer of reading instruction should also consider perception

The inferences given rely on the detailed semantic relationships given in the ontology. But the ERIC thesaurus gives only some poorly defined broader term (BT) and related term (RT) relationships. These relationships are not differentiated enough to support inference.

For another example, consider the hypothetical ontological relationships and rules we could formulate with these relationships in an example taken from the AGROVOC thesaurus (described in detail in section 2) in Table 2.

Table 2: AGROVOC relationships compared with more differentiated relationships of a hypothetical ontology (narrower term (NT), broader term (BT))
AGROVOC	Hypothetical Ontology
Undifferentiated hierarchical relationships in AGROVOC milk NT cow milk NT milk fat cow NT cow milk Cheddar cheese BT cow milk	Differentiated relationships in an ontology milk <includesSpecific> cow milk <containsSubstance> milk fat cow <hasComponent> cow milk* Cheddar cheese <<madeFrom> cow milk
	Rule 1 *Part X <mayContainSubstance> Substance Y* IF Animal W <hasComponent> Part X AND Animal W <ingests> Substance Y
	Rule 2 *Food Z <containsSubstance>* Substance Y** IF Food Z <madeFrom> Part X AND Part X <containsSubstance> Substance Y

In the context of food and nutrition it makes eminent sense to consider milk and egg as parts of an animal since their nutritional value and safety depend on the nature of the animal and the feed it ingests just as do skeletal meat and organ meat. This is an example of careful definition of relationships.

From the statements and rules given in the ontology, a system could infer that Cheddar cheese <containsSubstance>milk fat and, if cows on a given farm are fed mercury-contaminated feed, that Cheddar cheese made from milk from these cows <mayContainSubstance>mercury. But the present AGROVOC Thesaurus (described in detail in section 2) gives only narrower term/broadr term (NT/BT) relationships without differentiation.

The limitations of existing KOS can be summarized as follows:

Lack of conceptual abstraction: thesauri and other traditional KOSs are collections of terms (generic or domain-specific), ordered in a polyhierarchic lattice structure or a monohierarchic tree structure and interlinked with some very broad and basic relationships. The distinction between a concept (meaning) and its lexicalizations (words) is not made consistently, if at all, in such a system, and as such it does not reflect the ways humans understand the world in terms of meaning and language.
Limited semantic coverage: most thesauri do not differentiate concepts into types (such as living organism, substance, or process) and have a very limited set of relationships between concepts, distinguishing only between hierarchical relationships, i.e. NT/BT, and associative relationships, i.e. RT. These very rudimentary relationships are not powerful enough to guide a user in meaningful information discovery on the Web or to support inference. They do not reflect the conceptual relationships that people know and that can be used by a system to suggest concepts for expanding the query or making it more specific. Examples:
- The relation between cow and its part cow milk is expressed as NT rather than the more semantically expressive relation <hasComponent>, so a user who wants to expand the query hierarchically (search for all concepts narrower than cow as well) could not distinguish between searching for all cow parts or searching for all varieties of cow;
- the relation between mad cow disease and the animal it afflicts, cow, is expressed using RT instead of the more semantically precise relation <afflicts>, so the user could not easily assemble a list of all cow diseases and search for recent occurrences;
- mad cow disease and one of its symptoms anorexia would also be related using RT rather than the more semantically expressive relation <hasSymptom>.
The concept relations provided by most thesauri force all relations into the two broad categories, hierarchical and associative. Too often the semantic relationships captured in this way are ambiguous and poorly defined. The generalization/specialization relations defined in most thesauri are not adequately developed to be of use for semantic description and discovery of Web resources. Thus there is a need for a richer and more powerful set of relationships.
Lack of consistency: since the relationships in thesauri lack precise semantics, they are applied inconsistently, both creating ambiguity in the interpretation of the relationships and resulting in an overall internal semantic structure that is irregular and unpredictable. Many of the NT/BT hierarchical relationships could, for example, be resolved to the non-hierarchical RT relationship, and vice versa.
Limited automated processing: traditionally thesauri were designed for indexing and query formulation by people and not for automated processing. The ambiguous semantics that characterizes many thesauri makes them unsuitable for automated processing.

To overcome these limitations and enable more powerful searching and intelligent information processing, especially as such capabilities can be made more widely available through the Web, traditional KOSs must be reengineered into KOSs that contain domain concepts linked through a rich network of well-defined relationships and a rich set of terms identifying these concepts. A concept can be represented by many different terms (words or phrases) in multiple languages. This paper refers to terms as lexicalizations of a concept. One term can identify several concepts (homonymy) and one concept can have multiple synonymous terms. A concept is conveyed by all its lexicalizations, the domain it occurs in, and by its relationships to other concepts. In addition, valid rules and constraints need to be specified to provide additional generalizations over sets of related concepts and to support inference. These systems must also be converted to machine-processable formats based on Web technologies like XML which tag the vocabularies in a standardized way.

In contrast to traditional KOSs, ontologies provide conceptual abstraction and differentiated relationships. Ontologies specifically separate concepts from lexicalizations and thereby better reflect the structure of human understanding of a domain. In ontologies, the semantics are developed through ensuring that each concept within the domain is uniquely and precisely defined and by specifying elaborated relationships among the concepts. The relationships in an ontology are explicitly named and developed with specification of rules and constraints so that they reflect the context of the domain for which the knowledge is modeled.

Given their more precise and unambiguous semantics, ontologies allow further knowledge to be inferred from the knowledge explicitly represented in the ontology. The new (implicit) knowledge could be derived by applying generalization or transitivity rules, the level of applicability of which is limited in a poorly defined KOS like a traditional thesaurus. This added knowledge in the ontology makes it powerful when employed for intelligent information processing. Although there is a huge cost involved in moving from thesauri to ontologies, there is an expectation that the added power of consistency, precision, and completeness will be worth the investment even though reliable numbers on the return on investment (ROI) of ontology development are hard to come by.

1.3 Potential benefits of future generation KOSs

For emerging KOSs to satisfy user needs, they must improve both information organization and retrieval in a way that was not possible with traditional KOSs. The following potential benefits are expected from such systems:

Unique identifiers and formal semantics: the explicit definition of concepts and relations in an ontology allows a unique identifier to be assigned to each concept. As each concept and relation is explicitly defined as a unique entity, the ontology lends itself to semantic formalization.
Internal consistency: another benefit of explicit semantics is the achievement of internal structural consistency in the expression of knowledge due to the possibility of applying integrity constraints.
Interoperability: clear semantics enables interoperability among different KOSs since corresponding concepts within different KOSs would have the same unique identifier, irrespective of the actual lexicalizations used to express those concepts. Semantic interoperability promotes sharing and reuse of knowledge.
Greater information integration: interoperability among different KOSs makes it possible for machines to recognize and analyze intended meaning of terms from disparate vocabularies. This is possible by using structured meta-information and formal knowledge description such as agreed-upon metadata schemas, controlled domain vocabularies, and taxonomies. The ability to integrate terminologies from different sources maximizes the value of investment made in the ontology.
Inferencing capability: new KOSs have the potential for expressing knowledge beyond what is present in the structure of the system. Unlike traditional KOSs where both concepts and relations are underspecified and very few, if any, axiomatic rules exist, the facts (concepts and relations) and rules that can be derived from an ontology have the expressive capabilities that allow for reasoning.
Automated information processing: new KOSs create improved potential to discover relevant information from different sources by exploring patterns and filtering information using conceptual connections represented in the ontology. This enables question-answering from one or more databases or, using natural language processing (see next bullet), from text.
Natural language processing (NLP) support: offers the possibility of providing a direct reply to a search question that is expressed in natural language, using the enhanced relationships and semantics in an ontology, instead of only returning a list of relevant documents.
Search query understanding: using NLP and semantic processing, a system can understand a query posed in natural language, determine the concepts involved and, where useful, create a Boolean query.
Concept-based search: an ontology can provide context-aware search capabilities specific to the area of interest.
Integrated information search/browse support: text mining on the Web (Web mining) through meaning-oriented access, dynamic organization of information with the possibility for cross-domain links are feasible with emerging KOSs.
Search query expansion: the enhancement, extension, and disambiguation of user query terms become possible with the addition of enriched domain- and context-specific information.

To be an effective tool to facilitate information categorization, integration and retrieval, ontologies should be multilingual, domain-specific, and cross disciplinary at the same time. For maximum application potential they should be developed in a non-proprietary, application-independent, and machine-processable format to ensure interoperability among different systems.

1.4 The process of reengineering: The rules-as-you-go approach

Reengineering a thesaurus into an ontology entails refining thesaurus relationships, a laborious process. The steps in the process are:

1. Define the ontology structure

2. Fill in values from one or more legacy KOS to the extent possible

3. Edit manually using an ontology editor:

1. make existing information more precise
2. add new information

Step 1 is addressed in section 3 which gives an overall conceptual model at a high level of abstraction, and section 4, which begins the process of defining a set of relationship types for the food and agriculture domain by examining relationships in AGROVOC as to their relationship types.

Step 3 is the most laborious. We have plans to streamline this process by implementing intelligent conversion using a "rules as you go" approach. The idea is as follows: The KOS editor watches out for patterns; based on these patterns the editor formulates rules that can be applied immediately to all subsequent similar cases as illustrated in the following:

1. An editor has determined that
cow NT cow milk should become cow <hasComponent> cow milk

2. She recognizes that this is an example of the general pattern
animal <hasComponent> milk (or, even more general animal <hasComponent> body part)

3. Given this pattern, the system can derive automatically
goat NT goat milk should become goat <hasComponent> goat milk
since goat is an animal and goat's milk ends with the word milk and thus can be seen to be a type of milk.

To automate this approach even more, we plan to build an inventory of patterns such as animal <hasComponent> body part, augmented by an ontology that specifies the concepts of type animal (cow, goat, sheep, horse, chicken, etc.) and the concepts of type body part (skeletal meat part, liver, bone, milk, egg, etc.). This information would be drawn from AGROVOC itself and other sources, such as Langual, UMLS, and even WordNet. The system can then detect the applicability of these patterns, at least once it saw one example transformed by an ontology editor. The ontology editors will add to the pattern inventory incrementally.

These patterns are a special type of constraint. Other constraints can be formulated and used to limit the options presented to the human editor as thesaurus relationships are refined. The bases for such constraints are the thesaurus relationships, on the one hand, and the entity types of the concepts involved, on the other. Table 3 shows some examples of constraints based on thesaurus relationships.

Table 3: Some relationship constraints
Thesaurus Relationships	Possible ontology relationships
NT/BT	<hasMember>/<memberOf> <includesSpecific>/<isa> <hasComponent>/<componentOf> <spatiallyIncludes>/<spatiallyIncludedIn> etc.
RT	<similarTo> <growsIn>/<EnvironmentForGrowing> <treatmentFor>/<treatedWith>? <hasMember>/<memberOf> etc. Note that the RT relationship often transforms into relationships that are not symmetric. Note further that in a well constructed thesaurus, an RT should not resolve into an <isa> relationship. However, reality shows that the RT relationship has been applied to express this relationship. This can be taken as another proof for the weak definition of relationships in many thesauri.

This inventory will constrain the available choices when manually refining a thesaurus relationship to a more specific ontology relationship. Of course, an authorized ontology editor can override such constraints and thereby update the relationship table. As a relationship has been added or refined the inverse relationship is automatically added or refined.