News | about | Search Language: DE | EN | FR
 Heterogeneous resources, and how to deal with them?

European Treasury Browser
ETB Handbook
Multilinguality and Thesaurus
•  1. Download for Languages in the ETB Thesaurus
•  ETB Thesaurus Description and Comments
•  ETB Thesaurus: Legenda
•  Heterogeneous resources, and how to deal with them?
•  Multilingual Thesaurus, why?
•  Related activities: Cross-Language Evaluation Forum
•  Related activities: Short report on CLIR activities
Outputs and Documents
Search this area

Advanced Search

Print this page!
Tell a friend!
To get a reminder when this page is updated, please enter your email address here:

Introduction to transfer components and how will they help us accessing very heterogeneous resources. Paper is written by a research group from Social Science Information Centre, Bonn, as a part of the work related to the ETB project.

ETB Working Paper: Transfer Components for Accessing Different Layers of Data Quality and of Repository Types
Nowadays, users of information services are faced with highly decentralised, heterogeneous document sources with different content analysis. Semantic heterogeneity occurs e.g. when resources using different systems for content description are searched using a single query system. This semantic heterogeneity is much harder to deal with than the technological one. Standardization efforts such as the Dublin Core Metadata Initiative (DC) are a useful precondition for comprehensive search processes, but they assume a hierarchical model of cooperation, accepted by all players.

Because of the diverse interests of those partners, such a strict model can hardly be realised. Projects should consider even stronger differences in document creation, indexing and distribution with increasing anarchic tendencies (cf. Krause 1996, Krause/Marx 2000). To solve this problem, or at least to moderate it, we suggest a system consisting of a couple of transformation modules between different document description languages. These special agents need to map between different thesauri and classifications.

The mapping between different terminologies can be done by using intellectual, statistical and/or neural network transfer modules. Intellectual transfers use cross-concordances between different classification schemes or thesauri. Section 2 describes the creation, storage and handling of such transfers.
Statistical transfer modules can be used to supplement or replace cross-concordances. They allow a statistical crosswalk between two different thesauri or even between a thesaurus and the terms of automatically indexed documents. The algorithm is based on the analysis of co-occurrence of terms within two sets of comparable documents. The main principles of this approach are discussed in section 3.

A fundamental problem of co-occurrence analysis is to find documents of similar content, i.e. to build up a parallel corpus. Because this cannot be done in all domains, a corpus has to be simulated in some. This simulation is explained in section 4 and 5.

The traditional form of vagueness treatment in Information Retrieval refers to the comparison between query terms and content analysis terms, whereby the document level is regarded as the uniform modelling basis. Opposed to this, the heterogeneity modules mentioned above are used within the so called two-step model (cf. Krause 2000), which was developed at the Social Sci-ence Information Centre (IZ) in the context of the projects ELVIRA, CARMEN ViBSoz, and ETB (cf. section 1.2).

It is based on the assumption, that heterogeneous document sets should first be interlinked through transfer modules (vagueness modelling at the do-ment level) before they are integrated in the super ordinate process of vagueness treatment between documents and query (the traditional Information Retrieval problem).

If, for example, three heterogeneous document sets have to be integrated, transfer modules between A - B and B - C, each bilaterally treat the vagueness between the different content analysis methods (cf. Figure 1 in the report I). The hope behind this form of vagueness treatment, which differs considerably from the procedure traditionally used in Information Retrieval, is to produce greater flexibility and target accuracy of the overall procedure through separation of the vagueness problem. Different forms of vagueness do not flow into one another uncontrolled, but can be treated close to their cause (e.g. the differences between two thesauri). This appears more plausible in cognitive terms and permits the combination of a wide range of modules for treatment of vagueness. This combination of modules becomes quite effective, when applied on the retrieval of heterogeneous data sets.
Section 6 describes the technical realisation of the two-step method in concrete information retrieval systems.
This report is an edited and shortened version derived from the IZ-working paper No. 23 on the Treatment of Semantic Heterogeneity in Information Retrieval, where additional chapters on neural networks and on extraction of metadata from Internet documents can be found .

Author: Michael Kluck, Jürgen Krause, Jutta Marx
Web Editor: Riina Vuorikari
Published: Thursday, 13 Dec 2001
Last changed: Thursday, 13 Dec 2001
Keywords: standardisation

•  Evaluate the ETB-environment!
•  ETB Handbook