Previous Page Table of Contents Next Page


5 Conclusion and outlook


Our results clearly show that SVMs behave robustly across different languages. The fact that no significant performance differences between the languages have been found in the multi-label case[8] indicates that SVMs can be applied to classify documents in different languages. SVMs seem to be especially applicable to the complex case of assigning multiple labels to a document. The inferior results of multi-label indexing compared to the single-label case are clearly explained by the increased complexity of the task. Among human classifiers, multi-label subject indexing is an inconsistent task; opinions vary from person to person and there is no single correct assignment of labels to a document regarding the type and number of chosen labels. Taking this phenomenon (also known as indexer-indexer inconsistency [4]) into consideration, the results found can even be interpreted as equally good. This is a rather optimistic hypothesis and since the two cases are not directly comparable, further research and evaluation are needed in order to confirm it. These results combined with the fact that the integration of background knowledge did not show any significant performance losses - except in the case of total replacement of a document’s word-vector - leads us to an interesting conclusion for further research and evaluation. In the FAO (and most probably in many other environments), English resources heavily outweigh the availability of resources in other languages. As clearly shown in our results, the quality of SVMs strongly correlates with the number of used training examples. A desired scenario is therefore to be able to train the classifier with documents in one language only (i.e. English), and be able to use it to classify documents in other languages. This can be achieved by replacement of a document’s word-vector by using only the concepts found in the multilingual domain specific background knowledge. AGROVOC is available online in 5 different languages and has been translated into many others. A document’s word-vector thus becomes language independent and the resulting classification should be the same. With respect to the lower performance in case of replacing a document’s word-vector with its domain-specific concepts only, future research should be applied towards testing the exhaustiveness of the AGROVOC ontology used here. On the other hand, the AGROVOC is a more generic thesaurus, used for the whole agricultural domain. Subsets of the documents used in this research are assumingly more specific to certain domains. It would therefore be especially of interest to re-evaluate the settings used in this test set by using a document set limited to a very specific domain and a suitable domain specific ontology.

Moreover, especially in multinational organizations and environments like that provided at the FAO, more and more documents are actually multilingual, containing parts written in different languages. The integration of background knowledge as described above obviously has potential in showing robust behaviour towards those kinds of documents.

In conclusion, the results shown here are preliminary steps towards a promising option to use support vector machines for automatic subject indexing in a multilingual environment. Future research should exploit different other domains, in order to prove or confute the findings made here.

Acknowledgements. We express our gratitude to the FAO of the UN, Rome for the funding of our work. We especially thank all our colleagues there for their substantial contribution in requirements analysis and the compilation of the test document sets.


[8] This result could be confirmed with further test runs conducted on the document sets compiled for single-label classification.

Previous Page Top of Page Next Page