Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	30/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 26 27 28 29 30 31 32 33 ... 57

U2.
shares no words with any other text.
To compute the subset P
0
of P, we pairwise compare the fingerprints of the text
fragments. We assume that if the fingerprints of two text fragments share at least
k words, then they relate to the same text. We note that different lyrics versions
of a song will have similar fingerprints, irrespective of whether repeating parts are
included explicitly or not. For our experiments, we chose k = 3, while m = 5.
We construct a graph, where each node in the graph corresponds to an extracted
text fragment. There is an edge between two nodes if the fingerprints of the cor-
responding text fragments share at least k words. Now, the connected components
of the graph determine the clusters of the extracted text fragments. Two extracted
text fragments are in the same cluster if there is a path in the graph from one to the
other. We only retain the text fragments in the largest cluster.
Since a fingerprint consists of an ordered list of m words, we can determine
whether two fingerprints share at least k words in at most 2m word comparisons.
Hence, constructing the graph and determining the largest connected component
requires O(n
2
m) word comparisons.
Occasionally, the extracted text fragments are so diverse that there is no clear
winning subset. In that case the previous step is redone with the alternative regular
expression, or, if that does not work, the document retrieval step is redone using a
broader query.
5.2.4 Aligning Multiple Lyrics
If the lyrics retrieval and selection have successfully completed, then we have ex-
tracted a number of similar text fragments, that are all expected to be lyrics versions
of the intended song. Even if this is the case, then there may still be a large variety

100
that life’s a bore so full of the superficial
that life support of all of the superficial
A thousand lights had made me colder
A thousand lies have made me colder
Hear him with the women just around midnight
Hear him whip the women just around midnight
Burned out dealer to the teachers pet
Burnouts deal it to the teacher’s pet
Cause waiting at the answer to his questions is a definite blow
Persuade him that the answer to his questions is a definite no!
Figure 5.2. A list of pairs of transcriptions. Each pair gives two different tran-
scriptions of the same part of a song that we encountered in the retrieved lyrics.
in the extracted text fragments. Varieties occur as a result of mishearings, typo’s
and the use of abbreviations such as
repeat chorus.
In Figure 5.2 we give a number of examples of transcriptions we encountered.
We next want to align the extracted text fragments to easily visualize the dif-
ferences and to come up with a most probable version of the lyrics. This version is
constructed using the lyrics identified on the web. The final version thus does not
need to occur as such on the web.
Aligning multiple sequences is known to be an NP-hard problem [L. Wang &
Jiang, 2004], for many sensible choices of the objective function such as the sum-
of-pairs objective function. For a given alignment of n sequences the sum-of-pairs
objective function simply sums up the score of all sequence pairs in the alignment.
Several approximation algorithms have been proposed in the literature, e.g. [L.
Wang & Gusfield, 1997; L. Wang, Jiang, & Lawler, 1996].
We choose the following approach. We first select a reference sequence and
optimally align each of the other sequences with this reference sequence. Next we
combine all these individual alignments into a single alignment of all sequences.
As reference sequence we simply choose a sequence of maximum length, as we
expect this sequence to give a complete transcription of the intended song. Shorter
sequences may not include repeating parts explicitly or miss the beginning or end
of the song.
Aligning a Pair of Lyrics
We align a pair of lyrics on the word level. To realize this, we opt for a dynamic
programming approach where we align a pair of strings S
1
and S
2
in a 2 × l matrix

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 26 27 28 29 30 31 32 33 ... 57