Microsoft Word turney-littman-acm doc



Yüklə 200 Kb.
Pdf görüntüsü
səhifə9/18
tarix22.05.2023
ölçüsü200 Kb.
#119806
1   ...   5   6   7   8   9   10   11   12   ...   18
Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 3. Effect of Laplace smoothing factor with AV-ENG and the GI lexicon. 
0
10
20
30
40
50
60
70
80
90
100
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 4. Effect of Laplace smoothing factor with AV-CA and the GI lexicon. 
Figure 5 plots the performance with varying smoothing factors using the smallest 
corpus, TASA. The performance is quite sensitive to the choice of smoothing factor. Our 
baseline value of 0.01 turns out to be a poor choice for the TASA corpus. The optimal 
value is about 0.001. This suggests that, when using SO-PMI with a small corpus, it 
would be wise to use cross-validation to optimize the value of the Laplace smoothing 
factor. 


21
0
10
20
30
40
50
60
70
80
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 5. Effect of Laplace smoothing factor with TASA and the GI lexicon. 
These three figures show that the optimal smoothing factor increases as the size of the 
corpus increases, as expected. The figures also show that the impact of the smoothing 
factor decreases as the corpus size increases. There is less need for smoothing when a 
large quantity of data is available. The baseline smoothing factor of 0.01 was chosen to 
avoid division by zero, not to provide resistance to noise. The benefit from optimizing the 
smoothing factor for noise resistance is small for large corpora.
5.4. Varying the Neighbourhood Size 
The AltaVista NEAR operator restricts search to a fixed neighbourhood of ten words, but 
we can vary the neighbourhood size with the TASA corpus, since we have a local copy of 
the corpus. Figure 6 shows accuracy as a function of the neighbourhood size, as we vary 
the size from 2 to 1000 words, using TASA and the GI lexicon. 
The advantage of a small neighbourhood is that words that occur closer to each other 
are more likely to be semantically related. The disadvantage is that, for any pair of words, 
there will usually be more occurrences of the pair within a large neighbourhood than 
within a small neighbourhood, so a larger neighbourhood will tend to have higher 
statistical reliability. An optimal neighbourhood size will balance these conflicting 
effects. A larger corpus should yield better statistical reliability than a smaller corpus, so 
the optimal neighbourhood size will be smaller with a larger corpus. The optimal 
neighbourhood size will also be determined by the frequency of the words in the test set. 
Rare words will favour a larger neighbourhood size than frequent words. 


22
Figure 6 shows that, for the TASA corpus and the GI lexicon, it seems best to have a 
neighbourhood size of at least 100 words. The TASA corpus is relatively small, so it is 
not surprising that a large neighbourhood size is best. The baseline neighbourhood size of 
10 words is clearly suboptimal for TASA. 
0
10
20
30
40
50
60
70
80
1
10
100
1000

Yüklə 200 Kb.

Dostları ilə paylaş:
1   ...   5   6   7   8   9   10   11   12   ...   18




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin