Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	25/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 21 22 23 24 25 26 27 28 ... 57

2
Ideally, for a given term v we are interested in a synonym of v within T . As
the
GTAA
only consists of about 5,000 terms, it is not likely that a synonym is
present for v. We therefore focus on finding the narrowest broader term t for v. For

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
81
example, if we are interested in a mapping for the term
albatross, the terms bird
and
animal are indeed broader terms, but are too broad. Seabird however would be
the narrowest broader term for
albatross.
The algorithm presented is to be used as an assistant. For a given query v, we
therefore present multiple candidate terms with hyperlinks to the
GTAA
. Even if t is
not identified by the assistant, t can easily be found if the method does return terms
that are semantically close (e.g. by the
RT
relation) and hence links to t. Instead of
navigating through a thesaurus with 5,000 terms, the user is now only presented a
handful of alternatives. Hence, the user can select a term at a single glace.
As no suitable structured information is available for this task, we again use
a pattern-based method to determine a mapping from v to the thesaurus. Using
the
Yahoo!
API
for our experiments, we are allowed to perform 5,000 automated
queries a day. Approaches as discussed in e.g. [Cimiano & Staab, 2004; Cilibrasi
& Vitanyi, 2007] have a query complexity of the order of the number of terms in T
per query term v. We therefore aim at an approach more efficient in the number of
queries per term.
As a first approximation, we start with determining the most relevant categories
(Table 5.1) for v. We use the computed categories in the next steps, where we
present three alternative approaches in mapping v to T .
5.1.3 Determining Categories
A commonly used paradigm in natural language processing is that the semantics of
a term can be determined by its context [Manning & Sch¨utze, 1999]. We use this
assumption to first determine the subcategory – and thus main category – for the
term v. For each subcategory r, we compute a score s
v
(r).
We collect the 100 snippets for the query "v" and we scan them for terms
in T . Each term in T that occurs in the snippets contributes to the scores of its
subcategories [Fleischman & Hovy, 2002]. Hence, if the term
bioscooppersoneel
(see Table 5.1) occurs in the snippets found with v, this occurrence contributes to
the scores of the subcategories 1D05.03, 1D12.01 and 1D13.02.
As infrequent terms are more discriminative than frequent ones, the occur-
rences of the terms in T are weighted by their estimated total frequency on the
web. Words such as
haar (either hair or her in English) that appear frequently in
Dutch texts get a lower score than infrequent terms such as
1 mei-vieringen (May
1 celebrations).
The score s
v
(r) for subcategory r is given by a tf.idf-based weighting scheme
[Manning & Sch¨utze, 1999]
s
v
(r) =
∑
t∈r
oc(t) · log
C
f (t)
,
(5.1)

82
where
oc(t) denotes the number of times term t (or its singular form) occurs ,
f (t) gives the number of search engine hits for the query “t”, and
C =
∑

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 21 22 23 24 25 26 27 28 ... 57