Microsoft Word turney-littman-acm doc

Yüklə 200 Kb.

Pdf görüntüsü

səhifə	9/18
tarix	22.05.2023
ölçüsü	200 Kb.
	#119806

1 ... 5 6 7 8 9 10 11 12 ... 18

Laplace Smoothing Factor A c c u ra c y

Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 3. Effect of Laplace smoothing factor with AV-ENG and the GI lexicon.
0
10
20
30
40
50
60
70
80
90
100
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 4. Effect of Laplace smoothing factor with AV-CA and the GI lexicon.
Figure 5 plots the performance with varying smoothing factors using the smallest
corpus, TASA. The performance is quite sensitive to the choice of smoothing factor. Our
baseline value of 0.01 turns out to be a poor choice for the TASA corpus. The optimal
value is about 0.001. This suggests that, when using SO-PMI with a small corpus, it
would be wise to use cross-validation to optimize the value of the Laplace smoothing
factor.

21
0
10
20
30
40
50
60
70
80
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 5. Effect of Laplace smoothing factor with TASA and the GI lexicon.
These three figures show that the optimal smoothing factor increases as the size of the
corpus increases, as expected. The figures also show that the impact of the smoothing
factor decreases as the corpus size increases. There is less need for smoothing when a
large quantity of data is available. The baseline smoothing factor of 0.01 was chosen to
avoid division by zero, not to provide resistance to noise. The benefit from optimizing the
smoothing factor for noise resistance is small for large corpora.
5.4. Varying the Neighbourhood Size
The AltaVista NEAR operator restricts search to a fixed neighbourhood of ten words, but
we can vary the neighbourhood size with the TASA corpus, since we have a local copy of
the corpus. Figure 6 shows accuracy as a function of the neighbourhood size, as we vary
the size from 2 to 1000 words, using TASA and the GI lexicon.
The advantage of a small neighbourhood is that words that occur closer to each other
are more likely to be semantically related. The disadvantage is that, for any pair of words,
there will usually be more occurrences of the pair within a large neighbourhood than
within a small neighbourhood, so a larger neighbourhood will tend to have higher
statistical reliability. An optimal neighbourhood size will balance these conflicting
effects. A larger corpus should yield better statistical reliability than a smaller corpus, so
the optimal neighbourhood size will be smaller with a larger corpus. The optimal
neighbourhood size will also be determined by the frequency of the words in the test set.
Rare words will favour a larger neighbourhood size than frequent words.

22
Figure 6 shows that, for the TASA corpus and the GI lexicon, it seems best to have a
neighbourhood size of at least 100 words. The TASA corpus is relatively small, so it is
not surprising that a large neighbourhood size is best. The baseline neighbourhood size of
10 words is clearly suboptimal for TASA.
0
10
20
30
40
50
60
70
80
1
10
100
1000

Yüklə 200 Kb.

Dostları ilə paylaş:

1 ... 5 6 7 8 9 10 11 12 ... 18