Introduction to transfer components and how will they help us accessing very heterogeneous resources. Paper is written by a research group from Social Science
Information Centre, Bonn, as a part of the work related to the ETB project.
ETB Working Paper: Transfer Components for Accessing Different Layers of
Data Quality and of Repository Types
Nowadays, users of information services are faced with highly decentralised,
heterogeneous document sources with different content analysis. Semantic
heterogeneity occurs e.g. when resources using different systems for content
description are searched using a single query system. This semantic heterogeneity
is much harder to deal with than the technological one. Standardization
efforts such as the Dublin Core Metadata Initiative (DC) are a useful precondition
for comprehensive search processes, but they assume a hierarchical model
of cooperation, accepted by all players.
Because of the diverse interests of those partners, such a strict model
can hardly be realised. Projects should consider even stronger differences
in document creation, indexing and distribution with increasing anarchic
tendencies (cf. Krause 1996, Krause/Marx 2000). To solve this problem,
or at least to moderate it, we suggest a system consisting of a couple
of transformation modules between different document description languages.
These special agents need to map between different thesauri and classifications.
The mapping between different terminologies can be done by using intellectual,
statistical and/or neural network transfer modules. Intellectual transfers
use cross-concordances between different classification schemes or thesauri.
Section 2 describes the creation, storage and handling of such transfers.
Statistical transfer modules can be used to supplement or replace cross-concordances.
They allow a statistical crosswalk between two different thesauri or even
between a thesaurus and the terms of automatically indexed documents. The
algorithm is based on the analysis of co-occurrence of terms within two
sets of comparable documents. The main principles of this approach are
discussed in section 3.
A fundamental problem of co-occurrence analysis is to find documents
of similar content, i.e. to build up a parallel corpus. Because this cannot
be done in all domains, a corpus has to be simulated in some. This simulation
is explained in section 4 and 5.
The traditional form of vagueness treatment in Information Retrieval
refers to the comparison between query terms and content analysis terms,
whereby the document level is regarded as the uniform modelling basis.
Opposed to this, the heterogeneity modules mentioned above are used within
the so called two-step model (cf. Krause 2000), which was developed
at the Social Sci-ence Information Centre (IZ) in the context of the projects
ELVIRA, CARMEN ViBSoz, and ETB (cf. section 1.2).
It is based on the assumption, that heterogeneous document sets should
first be interlinked through transfer modules (vagueness modelling at the
do-ment level) before they are integrated in the super ordinate process
of vagueness treatment between documents and query (the traditional Information
If, for example, three heterogeneous document sets have to be integrated,
transfer modules between A - B and B - C, each bilaterally treat the vagueness
between the different content analysis methods (cf. Figure 1 in the report
I). The hope behind this form of vagueness treatment, which differs considerably
from the procedure traditionally used in Information Retrieval, is to produce
greater flexibility and target accuracy of the overall procedure through
separation of the vagueness problem. Different forms of vagueness do not
flow into one another uncontrolled, but can be treated close to their cause
(e.g. the differences between two thesauri). This appears more plausible
in cognitive terms and permits the combination of a wide range of modules
for treatment of vagueness. This combination of modules becomes quite effective,
when applied on the retrieval of heterogeneous data sets.
Section 6 describes the technical realisation of the two-step method
in concrete information retrieval systems.
This report is an edited and shortened version derived from the IZ-working
paper No. 23 on the Treatment of Semantic Heterogeneity in Information
Retrieval, where additional chapters on neural networks and on extraction
of metadata from Internet documents can be found .
||Michael Kluck, Jürgen Krause, Jutta Marx
||Thursday, 13 Dec 2001
||Thursday, 13 Dec 2001