Multilinguality and Thesaurus
by Marisa Trigari, BDP, Italy
Indexing and searching by thesaurus for providing access and accessing multiple resources in a multilingual context is still a quality issue in the Internet world.

Nothing can be more frustrating than searching by subject, where no indexing strategy is present: each searching session has to face all the uncertainties of natural language (synonymy, polysemy, homonymy), combined with all the uncertainties of a full text search (no relevance control on the retrieved occurrences). Moreover, we can easily realise that even an indexing strategy in itself is not enough, when we have to face either broad classification systems or indexing by using natural language: in the first case navigation in wide virtual 'containers' is often a time wasting operation; in the second case synonyms, polysemes or homonyms limit and/or delay efficient and effective information retrieval. These problems rapidly multiply when a large number of documents and a multilingual context are involved. 

On the contrary, a multilingual thesaurus:

1. Guarantees the effective control of  the indexing language, covering each selected concept with a preferred term in each language  and ensuring inter-language equivalence among these descriptors;

2. Provides a systematic display of the descriptors, making  navigation through the terminology easier;

3. Allows indexing and searching by combining several descriptors ex post, in order to refine and personalise both the semantic description and the information retrieval. 

In this way a good balance between the numbers of retrieved documents and their relevance is assured.These have been the main reasons for accepting, for the ETB project, too, the recommendation of the Dublin Core Working Group concerning the opportuneness of including either a thesaurus or a classification system in the implementation of an information system.

Why a new thesaurus
Why a new thesaurus? The decision to establish a new thesaurus instead of using one of the existing multilingual thesauri in the field of education came from the consideration that none of them encompasses the scope envisaged by the ETB project, i.e. the content of educational multimedia materials, in a satisfactory way. Although potentially useful for gathering terminology, they cover either the educational system or even teaching contents, but only at a university level. Consequently, an ETB thesaurus project, in five languages, possibly to be expanded to eleven languages, was set up. It was entrusted to the WP5:

The premises
The general criteria and methodology for constructing multilingual thesauri have been established and consolidated in progress both through an ISO standard (ISO5964/1995) and rich technical literature . The WP5 partners founded their work on this ground as much as possible from the very beginning, considering on one hand the importance of standardisation in an international setting and, on the other hand, the opportunity of submitting a logical and coherent searching tool to a target group involved in teaching/learning processes. Obviously, general criteria and standards were harmonised with the specific needs of the project.
Among the major criteria orienting WP5 decisions I would stress the following:

1. Equal status of all the linguistic areas, a very important issue in a European context. Accordingly, each language will have a complete display of the thesaurus. Even if English was chosen as a working language, no language is to be considered a priori a leading language; the possibility of some feed-back from one language to another in establishing inter-language equivalents will always be open;

2. European conceptual approach. The terminology is supposed to reflect a European rather than a national or local approach. That doesn't necessarily mean denying access to information from a local point of view: thanks to the typical thesaurus structure, it can be guaranteed to some extent through specific non-descriptors;

3. Importance of combining deductive and inductive approaches in establishing the terminology. If existing thesauri and/or glossaries can give a general outline both of the content of education and of educational methods/strategies, only a terminological analysis of available European educational repositories can provide direct and updated evidence of the concepts circulating in the field;

4. Necessity of creating a friendly documentary tool, in order to encourage access of non-specialists to relevant information.

Construction methodology and procedures

Accordingly with these premises, a tree of the work to be done was established. Click here to see the tree!!


The first step was aimed at gaining previous knowledge on the matter: a first analysis of some relevant European educational repositories, focusing on indexing languages and information retrieval strategies, was carried out. The document produced by WP2 on the existing school-oriented repositories added new information on the state of the art.  Moreover, a wide range of terminological sources was identified.

The second step produced decisions on the scope, structure, and level of specificity of the thesaurus; on the criteria for selecting terminological experts, on the procedures to be followed.
As for the first item, considering 

  • European dimension,
  • Target likely to be composed, particularly at the beginning, mostly of secondary school teachers and pupils,
  • Opportunity of not going, for the moment, over ca 1,000 descriptors,
the scope was identified in learning/teaching subjects, as well as in learning/teaching methods and processes. Terms would have a medium level of specificity, allowing some in-depth terms for favourite/relevant subjects. The structure would be the classical one, universally considered the best, so as to allow friendly browsing: this structure will include both an alphabetical and a systematic display. Intra- and inter-language, hierarchical, associative relationships among the terms will be developed.

As for the second point, the partners stated that terminological experts would be selected on the grounds of competent knowledge both of the sciences of education and of information sciences.

An analysis of the existing thesaurus management software led the partners to choose the Ortelius thesaurus management software, available free, as a suitable starting point, even if this software has to be adapted and implemented according to the project needs.

Finally, work procedures were established involving:

  • Close collaboration among the partners, both through periodical meetings and the large use of communication technologies;
  • Punctual on-line documentation (See of the work carried out.

 A 'touchstone list' of 670 relevant terms as a reference starting point for the terminological work was the product of the third step. Educational experts drew up this list on the basis of existing multilingual thesauri in the field of education and the statistical study of the indexes of a bibliographical corpus of 53.764 records in the same field.  70% of these terms already had inter-language equivalents, which were registered together with their variants in different sources with notations on language equivalence problems. 
In parallel, BDP started to implement and refine the Ortelius software, in order to include not only the most common features, but also the separate and integrated management of different linguistic versions, wide expandability,  poly-hierarchy .

The following activities (fourth step) mostly dealt with the quantitative/qualitative terminological analysis of 5 educational repositories (˜alglobal,, ,,  and 3 databases (SOLIS (DE), Comenius 97-98 (IT); Lingua 97-98  (IT). Statistical data processing was used in the case of indexed corpora whilst strategies of computational linguistics were applied to the full text analysis of 2 Spanish repositories (relevance control, compound analysis, KWIC, etc.).
A totally conceptual approach was preferred for 3 repositories in the English language area:,  These rich in documents, but poorly indexed sites were explored through random navigation and searching, as well as by checking the already collected terminology in retrieving information.

A further phase, returning to textual sources, was caused by two relevant just-published documents, which could not be ignored: 


  1. the Socrates European project Guidelines (2000) giving useful information on the most recent educational trends at a European level:
  2. the “EUN thesaurus” was built in the framework of the EUN Multimedia Project, specifically aimed at multimedia materials.˜kluck/eunthes9/eunthes-ind.html

Both were carefully “scanned” in order to gather relevant terminology. Moreover the EUN Thesaurus, having the same scope and target, proved to be particularly useful for the in progress checking not only the ETB thesaurus descriptors, but also construction criteria and procedures (and vice versa). Without mentioning that, being HUB a partner involved both in the EUN Multimedia Project and  in  the ETB Project - WP5 -, the experience gained in the first enterprise was naturally incorporated into the second one. 

Finally the terms collected with this variety of approaches were matched with the touchstone list. A great many of them could be related to the terms of the list, some couldn’t. On the grounds of the positive and negative feedback coming both from terminological textual sources and from the field, several sets of terms gravitating around relevant concepts were drawn up (See Work-docTemplate.doc at: The whole corpus was then submitted to: 


  1. Weight analysis of the concepts to be represented
  2. Selection of the terms best representing these concepts. 

At the end of this process 3 lists emerged: the first one coincided with terms related to decidedly  'heavy' concepts; the second one related to 'lighter', but still relevant concepts, the third one consisting in terms too irrelevant to be taken into account. 
The term preferred in each set of the first and second level list was the one which best matched the aforesaid criteria on scope and specificity. Parts of the non-preferred terms were set aside to be considered as non-descriptors. 

At this point a first attempt was made to arrange the core terminology  (first level list) in a systematic display, identifying broad classification groups (macro-hierarchies) and semantic relationships within these groups (fifth step) in order to get a first general view of the internal balance in the terminology. Simultaneously, language and country noun multilingual lists were drafted, adopting the criterion of covering the European area in detail, the other regions in a more general way. 

Immediately afterwards two meetings were arranged in order to:


  1. Check the construction criteria;
  2. Re-allot duties among the partners on the basis of the intermediate results;
  3. Plan the entry of the terminological experts into the WP5 activities 
  4. Check the technical choices involving both the management software and the browsing with the WP10 partners.

That being the state of the art in September 2000, two following steps are planned according to the stated criteria:

The WP5 partners, together with the terminological experts, will collaborate to structure, adjust, implement and balance the terminology (sixth step), particularly considering

  • Necessity of defining each polysemic /and/or ambiguous term according to the scope of the thesaurus; 
  • Necessity of filling in gaps;
  • Opportuneness of expand terminology for 'priority-subjects' in the European educational context;
  • Necessity of balancing a local with an international approach.

In this phase of the work, the terminological experts will play an important role: they will not simply translate, but also provide some constructive feed-back on the terminology, signal gaps, propose non-descriptors as a bridge from the national to the international terms, help to define terms in an acceptable way for the majority of the cultural-linguistic areas involved. 

This implementation and refinement of the terminology will go on in parallel with establishing hierarchies and inputting the terms in the management software. The software will also contain the historical notes of the work, registering definitions, sources, changes in decisions, which could occur for controversial cases, as well as data concerning the future expansion of the thesaurus up to eleven languages (language equivalents, definitions etc.). 
Detailed notation to the classified thesaurus will be added only when the thesaurus is almost complete, since any change in this field implies laborious alterations.

Reasoning by concepts in this phase will be crucial: for any 'old' term the connection of the term with a well defined concept has to be clear and established either through the hierarchy and/or a scope note; each new proposal will arise from the necessity of completing a conceptual field in a suitable way; it will be accompanied by a definition, as many as possible linguistic versions, consideration of its links with the whole thesaurus, documentation giving evidence of its relevance. 
The design of an on-line evaluation system in support of future evaluation procedures will take place at this stage, when the ETB thesaurus is assuming its definitive form. 

This sixth step, the most complex in the whole process, will take at least nine months, ending with the 0.0 release of the ETB thesaurus (Alpha version - May 2001) ready for a testing tour and formally described in order to be processed for the browsing. The 0.0 releases will also be accompanied by an introduction explaining the features of the system, and clarifying how the thesaurus is to be used and updated.

During three subsequent months (seventh step), WP10 partners will evaluate the formal description with respect to the processing needed to produce the agreed user interface. From a different point of view, a selected group of users will test the effectiveness of the thesaurus through searching by subject in the system. This trial might reveal further gaps, and necessitate the introduction of new descriptors and/or non-descriptors, and/or modifications in the system of relations.

Corrections and additions made, the 1.0 release of the ETB thesaurus should finally be available in April 2002. 

Author: Marisa Trigari
Web Editor: Riina Vuorikari
Published: Wednesday, 28 Feb 2001
Last changed: Tuesday, 30 Apr 2002
Keywords: thesaurus