1 Introduction

The management of large amounts of information and knowledge is of ever increasing importance in today’s large organizations. With the ease of making information available online, especially in corporate intranets and knowledge bases, organizing information for later retrieval becomes an increasingly difficult task. Subject indexing is the act of describing a document in terms of its subject content. The purpose of subject indexing is to make it possible to easily retrieve references on a particular subject. It is the process of extracting the main concepts of a document, representing those concepts by keywords in the chosen language and associating these keywords with the document. In order to be unambiguous and carry out this process in a more standardized way, keywords should be chosen from a controlled vocabulary.

The AGROVOC^[3] thesaurus, developed and maintained by the Food and Agricultural Organization^[4] (FAO) of the United Nations (UN), is a controlled vocabulary developed for the agricultural domain. The FAO manages a vast amount of documents and information related to agriculture. Professional librarians and indexers use the AGROVOC thesaurus as a controlled vocabulary to manually index all documents and resources managed by FAO’s information management system. They are allowed to assign as many labels as necessary to index a document. In the following we call the automatic assignment process of suitable keywords to the documents the multi-label and multi-class^[5] classification problem. This process is applied to resources in all the official FAO languages and herewith constitutes a multilingual problem. The cost of labour for professional indexers and the increase in growth in available electronic resources has resulted in a backlog of resources that are not indexed. Automatic document indexing could be particularly useful in digital libraries such as the ones maintained at the FAO to make more resources available through the system.

This paper presents an approach to use binary support vector machines (SVM) for automatic subject indexing of full-text documents with multiple labels. An extensive test document set has been compiled from FAO’s large quantity of resources in which multi-label and multilingual indexing have been evaluated. Motivated by our text clustering results with background knowledge (cf. [7]), we have further analyzed the integration of domain specific background knowledge in the form of the multilingual AGROVOC thesaurus for performance improvement. With the evaluated results we will reason the integration of background knowledge with SVMs to be a promising approach towards (semi-) automatic, multilingual, multi-label subject document indexing.

The paper is outlined as follows: The next section introduces the reader to automatic text categorization, in particular support vector machines and the multi-label classification problem. Section 3 gives a brief introduction to ontologies and their representation. In Section 4, we explain in detail the compilation of the used test document set and the evaluation settings followed by a discussion of the results. We conclude by suggesting promising future possibilities for subject indexing of multilingual documents.

^[3] [http://www.fao.org/agrovoc].
^[4] [http://www.fao.org].
^[5] In the following we only use the term multi-label.