Indexing and searching by thesaurus for providing
access and accessing multiple resources in a multilingual context
is still a quality issue in the Internet world.
Nothing can be more frustrating than searching by subject, where
no indexing strategy is present: each searching session has to face
all the uncertainties of natural language (synonymy, polysemy, homonymy),
combined with all the uncertainties of a full text search (no relevance
control on the retrieved occurrences). Moreover, we can easily realise
that even an indexing strategy in itself is not enough, when we
have to face either broad classification systems or indexing by
using natural language: in the first case navigation in wide virtual
'containers' is often a time wasting operation; in the second case
synonyms, polysemes or homonyms limit and/or delay efficient and
effective information retrieval. These problems rapidly multiply
when a large number of documents and a multilingual context are
the contrary, a multilingual thesaurus:
way a good balance between the numbers of retrieved documents and
their relevance is assured.These have been the main reasons for accepting,
for the ETB project, too, the recommendation of the Dublin Core Working
Group concerning the opportuneness of including either a thesaurus
or a classification system in the implementation of an information
Guarantees the effective control of the indexing language,
covering each selected concept with a preferred term in each language
and ensuring inter-language equivalence among these descriptors;
Provides a systematic display of the descriptors, making
navigation through the terminology easier;
Allows indexing and searching by combining several descriptors
ex post, in order to refine and personalise both the semantic
description and the information retrieval.
a new thesaurus
Why a new thesaurus? The decision to establish
a new thesaurus instead of using one of the existing multilingual
thesauri in the field of education came from the consideration that
none of them encompasses the scope envisaged by the ETB project,
i.e. the content of educational multimedia materials, in a satisfactory
way. Although potentially useful for gathering terminology, they
cover either the educational system or even teaching contents, but
only at a university level. Consequently, an ETB thesaurus project,
in five languages, possibly to be expanded to eleven languages,
was set up. It was entrusted to the WP5:
The general criteria and methodology for constructing
multilingual thesauri have been established and consolidated in
progress both through an ISO standard (ISO5964/1995) and rich technical
literature . The WP5 partners founded their work on this ground
as much as possible from the very beginning, considering on one
hand the importance of standardisation in an international setting
and, on the other hand, the opportunity of submitting a logical
and coherent searching tool to a target group involved in teaching/learning
processes. Obviously, general criteria and standards were harmonised
with the specific needs of the project.
Among the major criteria orienting WP5 decisions I would stress
Equal status of all the linguistic areas, a very important issue
in a European context. Accordingly, each language will have a
complete display of the thesaurus. Even if English was chosen
as a working language, no language is to be considered a priori
a leading language; the possibility of some feed-back from one
language to another in establishing inter-language equivalents
will always be open;
European conceptual approach. The terminology is supposed to reflect
a European rather than a national or local approach. That doesn't
necessarily mean denying access to information from a local point
of view: thanks to the typical thesaurus structure, it can be
guaranteed to some extent through specific non-descriptors;
Importance of combining deductive and inductive approaches in
establishing the terminology. If existing thesauri and/or glossaries
can give a general outline both of the content of education and
of educational methods/strategies, only a terminological analysis
of available European educational repositories can provide direct
and updated evidence of the concepts circulating in the field;
Necessity of creating a friendly documentary tool, in order to
encourage access of non-specialists to relevant information.
Construction methodology and procedures
with these premises, a tree of the work to be done was established.
here to see the tree!!
first step was aimed at gaining previous knowledge on the matter:
a first analysis of some relevant European educational repositories,
focusing on indexing languages and information retrieval strategies,
was carried out. The document produced by WP2 on the existing school-oriented
repositories added new information on the state of the art.
Moreover, a wide range of terminological sources was identified.
second step produced decisions on the scope, structure, and level
of specificity of the thesaurus; on the criteria for selecting terminological
experts, on the procedures to be followed.
As for the first item, considering
was identified in learning/teaching subjects, as well as in learning/teaching
methods and processes. Terms would have a medium level of specificity,
allowing some in-depth terms for favourite/relevant subjects. The
structure would be the classical one, universally considered the best,
so as to allow friendly browsing: this structure will include both
an alphabetical and a systematic display. Intra- and inter-language,
hierarchical, associative relationships among the terms will be developed.
likely to be composed, particularly at the beginning, mostly of
secondary school teachers and pupils,
of not going, for the moment, over ca 1,000 descriptors,
for the second point, the partners stated that terminological experts
would be selected on the grounds of competent knowledge both of
the sciences of education and of information sciences.
analysis of the existing thesaurus management software led the partners
to choose the Ortelius thesaurus management software, available
free, as a suitable starting point, even if this software has to
be adapted and implemented according to the project needs.
work procedures were established involving:
collaboration among the partners, both through periodical meetings
and the large use of communication technologies;
on-line documentation (See http://intranet.eun.org/bscw/bscw.cgi/0/1066976)
of the work carried out.
'touchstone list' of 670 relevant terms as a reference starting
point for the terminological work was the product of the third step.
Educational experts drew up this list on the basis of existing multilingual
thesauri in the field of education and the statistical study of
the indexes of a bibliographical corpus of 53.764 records in the
same field. 70% of these terms already had inter-language
equivalents, which were registered together with their variants
in different sources with notations on language equivalence problems.
In parallel, BDP started to implement and refine the Ortelius software,
in order to include not only the most common features, but also
the separate and integrated management of different linguistic versions,
wide expandability, poly-hierarchy .
following activities (fourth step) mostly dealt with the quantitative/qualitative
terminological analysis of 5 educational repositories (http://sauce.pntic.mec.es/˜alglobal,
and 3 databases (SOLIS (DE), Comenius 97-98 (IT); Lingua 97-98
(IT). Statistical data processing was used in the case of indexed
corpora whilst strategies of computational linguistics were applied
to the full text analysis of 2 Spanish repositories (relevance control,
compound analysis, KWIC, etc.).
A totally conceptual approach was preferred for 3 repositories in
the English language area: http://www.svtc.org.uk,
These rich in documents, but poorly indexed sites were explored
through random navigation and searching, as well as by checking
the already collected terminology in retrieving information.
phase, returning to textual sources, was caused by two relevant
just-published documents, which could not be ignored:
Socrates European project Guidelines (2000) giving useful information
on the most recent educational trends at a European level:
“EUN thesaurus” was built in the framework of the EUN Multimedia
Project, specifically aimed at multimedia materials.
were carefully “scanned” in order to gather relevant terminology.
Moreover the EUN Thesaurus, having the same scope and target, proved
to be particularly useful for the in progress checking not only
the ETB thesaurus descriptors, but also construction criteria and
procedures (and vice versa). Without mentioning that, being HUB
a partner involved both in the EUN Multimedia Project and
in the ETB Project - WP5 -, the experience gained in the first
enterprise was naturally incorporated into the second one.
the terms collected with this variety of approaches were matched
with the touchstone list. A great many of them could be related
to the terms of the list, some couldn’t. On the grounds of the positive
and negative feedback coming both from terminological textual sources
and from the field, several sets of terms gravitating around relevant
concepts were drawn up (See Work-docTemplate.doc at: http://intranet.eun.org/bscw/bscw.cgi/0/1080741.
The whole corpus was then submitted to:
analysis of the concepts to be represented
of the terms best representing these concepts.
the end of this process 3 lists emerged: the first one coincided
with terms related to decidedly 'heavy' concepts; the second
one related to 'lighter', but still relevant concepts, the third
one consisting in terms too irrelevant to be taken into account.
The term preferred in each set of the first and second level list
was the one which best matched the aforesaid criteria on scope and
specificity. Parts of the non-preferred terms were set aside to
be considered as non-descriptors.
this point a first attempt was made to arrange the core terminology
(first level list) in a systematic display, identifying broad classification
groups (macro-hierarchies) and semantic relationships within these
groups (fifth step) in order to get a first general view of the
internal balance in the terminology. Simultaneously, language and
country noun multilingual lists were drafted, adopting the criterion
of covering the European area in detail, the other regions in a
more general way.
afterwards two meetings were arranged in order to:
the construction criteria;
duties among the partners on the basis of the intermediate results;
the entry of the terminological experts into the WP5 activities
the technical choices involving both the management software and
the browsing with the WP10 partners.
being the state of the art in September 2000, two following steps
are planned according to the stated criteria:
WP5 partners, together with the terminological experts, will collaborate
to structure, adjust, implement and balance the terminology (sixth
step), particularly considering
of defining each polysemic /and/or ambiguous term according to
the scope of the thesaurus;
of filling in gaps;
of expand terminology for 'priority-subjects' in the European
of balancing a local with an international approach.
this phase of the work, the terminological experts will play an
important role: they will not simply translate, but also provide
some constructive feed-back on the terminology, signal gaps, propose
non-descriptors as a bridge from the national to the international
terms, help to define terms in an acceptable way for the majority
of the cultural-linguistic areas involved.
implementation and refinement of the terminology will go on in parallel
with establishing hierarchies and inputting the terms in the management
software. The software will also contain the historical notes of
the work, registering definitions, sources, changes in decisions,
which could occur for controversial cases, as well as data concerning
the future expansion of the thesaurus up to eleven languages (language
equivalents, definitions etc.).
Detailed notation to the classified thesaurus will be added only
when the thesaurus is almost complete, since any change in this
field implies laborious alterations.
by concepts in this phase will be crucial: for any 'old' term the
connection of the term with a well defined concept has to be clear
and established either through the hierarchy and/or a scope note;
each new proposal will arise from the necessity of completing a
conceptual field in a suitable way; it will be accompanied by a
definition, as many as possible linguistic versions, consideration
of its links with the whole thesaurus, documentation giving evidence
of its relevance.
The design of an on-line evaluation system in support of future
evaluation procedures will take place at this stage, when the ETB
thesaurus is assuming its definitive form.
sixth step, the most complex in the whole process, will take at
least nine months, ending with the 0.0 release of the ETB thesaurus
(Alpha version - May 2001) ready for a testing tour and formally
described in order to be processed for the browsing. The 0.0 releases
will also be accompanied by an introduction explaining the features
of the system, and clarifying how the thesaurus is to be used and
three subsequent months (seventh step), WP10 partners will evaluate
the formal description with respect to the processing needed to
produce the agreed user interface. From a different point of view,
a selected group of users will test the effectiveness of the thesaurus
through searching by subject in the system. This trial might reveal
further gaps, and necessitate the introduction of new descriptors
and/or non-descriptors, and/or modifications in the system of relations.
and additions made, the 1.0 release of the ETB thesaurus should
finally be available in April 2002.
See, among others: F. W. Lancaster, Thesaurus construction and use,
Paris, Unesco, 1985 (PGI-85/WS/11) and Vocabulary control
for information retrieval, Arlington, Virginia, IRP, 1986; G. Van
Slype, Les language d'indexation : Conception, construction
et utilisation dans les systèmes documentaires, Paris,
Les Edition d'Organisation, 1987; M. Trigari, Come costruire un
thesaurus, Modena, Panini, 1992; J. Aitchison, D. Bawden,
A. Gilchrist, Thesaurus construction and use : a practical
manual, 3rd. ed., London : Aslib, 1997.