1. Introduction

The performance of information processing systems may be enhanced when it is supported by ontologies, domain-specific terminologies containing rich and precise semantics. For example, in information retrieval, ontologies may be used for query expansion, for marking up documents at various levels of granularity, for knowledge discovery, etc. Research on (semi-)automatic ontology construction has been conducted using a variety of terminological resources, such as raw text (Hearst 1992, Maedche and Staab 2001, Kiet 2000, and Navigli et al. 2003), dictionaries (Janniak 1999, Kietz 2000, Kang 2001) and thesauri (Soergel et al. 2004, Clark 2000, Wielinga 2001). Each of these sources has different characteristics which require different approaches to term and relationship extraction. Raw text consists of unstructured text containing huge amounts of information that are frequently updated. Dictionaries are semi-structured resources that are infrequently updated; domain dictionaries, in particular, are suitable for extracting terms and their relationships (e.g., hyponyms, meronyms, and synonyms) as well as their definitions. Of the terminological resources considered, thesauri lend themselves most readily to ontology construction because their explicit semantic structure facilitates the natural language processing needed to extract terms and relationships. Our work is to develop and maintain the Agricultural Ontology Service, which will support the construction of an Agricultural Knowledge Portal. Therefore, we use AGROVOC as a resource to build an ontology in the domain of food and agriculture.

AGROVOC is a multilingual agricultural thesaurus developed and maintained by the Food and Agriculture Organization (FAO) of the United Nations. It is used at FAO for indexing and searching information resources within the agricultural domains. However, within AGROVOC, semantic relationships are poorly defined and inconsistently applied. For example, AGROVOC incorrectly^[1] uses NT (narrow term), approximately equivalent to 'superclass of,' or 'hypernym of', in Milk NT Milk Fat, while a more specific, and correct, relationship could be 'containsSubstance'. In AGROVOC, RT (related term) is underspecified, subsuming numerous relationships; for example, it uses RT in Mutton RT Sheep, which should be refined to a more specific one, such as 'madeFrom' (Soergel et al. 2004) to distinguish from other uses of RT.

The question of reengineering AGROVOC to an ontology has recently been addressed in a few studies. Fisseha and Liang (2003) present some rough ideas for preparing AGROVOC for conversion into an ontology, such as converting BT/NT to is-a, and refining RT to more specific relationships. Soergel et al. (2004) propose the rules-as-you-go approach, where rules for semantic refinement are identified as experts work on the thesaurus and notice patterns in the occurrence of semantic relationships between terms. Since the patterns and rules are identified through intellectual work, the refinements occur gradually and can deal with only a limited number of patterns. This paper enhances the feasibility of the rules-as-you-go approach by applying machine learning to automatically extract the rules. The learning technique is based on the OntoLearn method (Navigli et al. 2003), the automatic ontology learning system that was used for extracting terms and detecting semantic relationships from a tourism text corpus. It uses inductive machine learning for extracting semantic relationships between the head word and its modifier in compound nouns.

This paper presents a hybrid approach for (semi-)automatically detecting these problematic relationships, especially BT/NT and USE/UF relationships, and suggesting more appropriate ones. In the case of RT relationships, which usually are underspecified relationships, the refinement rules, acquired from experts and machine learning, are applied. The system consists of three main modules: Rule Acquisition Refinement, Detection and Suggestion, and Verification. The Rule Acquisition module is used to train the machine based on rules specified by experts. The Detection and Suggestion module uses noun phrase analysis and WordNet alignment to detect incorrect relationships and to suggest more appropriate ones based on the application of the acquired rules. The Verification module is a tool for confirming the proposed relationships.

Section 2 describes the problems in AGROVOC. Section 3 gives an overview of the system for data cleaning and relationship refinement. Sections 4 and 5 describe the preparation of the rules and an algorithm for cleaning and refinement, respectively. Finally, the experimental results and future works are summarized in Section 6 and Section 7 gives brief conclusions.

^[1] Within a hierarchy based on partitivity, the use of NT would not necessarily be an incorrect one, e.g., Milk NT Milk Fat NT Milk Fat Globule etc. However, the refined AGROVOC is anticipated to use BT/NT to express hierarchical, super/subclass-type relationships only. And its conversion into an ontology necessitates that each relationship correspond to a unique sense.