Microsoft Word turney-littman-acm doc



Yüklə 200 Kb.
Pdf görüntüsü
səhifə7/18
tarix22.05.2023
ölçüsü200 Kb.
#119806
1   2   3   4   5   6   7   8   9   10   ...   18
HM lexicon
is the list of 1,336 labeled adjectives that was 
created by Hatzivassiloglou and McKeown [1997]. The 
GI lexicon
is a list of 3,596 
labeled words extracted from the General Inquirer lexicon [Stone
et al.
1966]. The 
AV-ENG corpus
is the set of English web pages indexed by the AltaVista search engine. 
The 
AV-CA corpus
is the set of English web pages in the Canadian domain that are 
indexed by AltaVista. The 
TASA corpus
is a set of short English documents gathered 
from a variety of sources by Touchstone Applied Science Associates. 
The HM lexicon consists of 1,336 adjectives, 657 positive and 679 negative 
[Hatzivassiloglou and McKeown 1997]. We described this lexicon earlier, in Sections 1 
and 4.1. We use the HM lexicon to allow comparison between the approach of 
Hatzivassiloglou and McKeown [1997] and the SO-A algorithms described here. 
Since the HM lexicon is limited to adjectives, most of the following experiments use 
a second lexicon, the GI lexicon, which consists of 3,596 adjectives, adverbs, nouns, and 
verbs, 1,614 positive and 1,982 negative [Stone
 et al.
1966]. The General Inquirer lexicon 
is available at http://www.wjh.harvard.edu/~inquirer/. The lexicon was developed by 
Philip Stone and his colleagues, beginning in the 1960’s, and continues to grow. It has 
been designed as a tool for 
content analysis
, a technique used by social scientists, 
political scientists, and psychologists for objectively identifying specified characteristics 
of messages [Stone
 et al.
1966].
The full General Inquirer lexicon has 182 categories of word tags and 11,788 words. 
The words tagged “Positiv” (1,915 words) and “Negativ” (2,291 words) have 
(respectively) positive and negative semantic orientations. Table 3 lists some examples. 
Table 3. Examples of “Positiv” and “Negativ” words. 
Positiv 
Negativ 
abide 
absolve 
abandon 
abhor 
ability 
absorbent 
abandonment 
abject 
able 
absorption 
abate 
abnormal 
abound 
abundance 
abdicate 
abolish 
Words with multiple senses may have multiple entries in the lexicon. The list of 3,596 
words (1,614 positive and 1,982 negative) used in the subsequent experiments was 


16
generated by reducing multiple-entry words to single entries. Some words with multiple 
senses were tagged as both “Positiv” and “Negativ”. For example, “mind” in the sense of 
“intellect” is positive, but “mind” in the sense of “beware” is negative. These ambiguous 
words were not included in our set of 3,596 words. We also excluded the fourteen 
paradigm words (good/bad, nice/nasty, etc.). 
Of the words in the HM lexicon, 47.7% also appear in the GI lexicon (324 positive, 
313 negative). The agreement between the two lexicons on the orientation of these shared 
words is 98.3% (6 terms are positive in HM but negative in GI; 5 terms are negative in 
HM but positive in GI). 
The AltaVista search engine is available at http://www.altavista.com/. Based on 
estimates in the popular press and our own tests with various queries, we estimate that the 
AltaVista index contained approximately 350 million English web pages at the time our 
experiments were carried out. This corresponds to roughly one hundred billion words. 
We call this the AV-ENG corpus. The set of web pages indexed by AltaVista is 
constantly changing, but there is enough stability that our experiments were reliably 
repeatable over the course of several months. 
In order to examine the effect of corpus size on learning, we used AV-CA, a subset of 
the AV-ENG corpus. The AV-CA corpus was produced by adding “AND host:.ca” to 
every query to AltaVista, which restricts the search results to the web pages with “ca” in 
the host domain name. This consists mainly of hosts that end in “ca” (the Canadian 
domain), but it also includes a few hosts with “ca” in other parts of the domain name 
(such as “http://www.ca.com/”). The AV-CA corpus contains approximately 7 million 
web pages (roughly two billion words), about 2% of the size of the AV-ENG corpus. 
Our experiments with SO-LSA are based on the online demonstration of LSA, 
available at http://lsa.colorado.edu/. This demonstration allows a choice of several 
different corpora. We chose the largest corpus, the TASA-ALL corpus, which we call 
simply TASA. In the online LSA demonstration, TASA is called the “General Reading 
up to 1st year college (300 factors)” topic space. The corpus contains a wide variety of 
short documents, taken from novels, newspaper articles, and other sources. It was 
collected by Touchstone Applied Science Associates, to develop The Educator’s Word 
Frequency Guide. The TASA corpus contains approximately 10 million words, about 
0.5% of the size of the AV-CA corpus.
The TASA corpus is not indexed by AltaVista. For SO-PMI, the following 
experimental results were generated by emulating AltaVista on a local copy of the TASA 


17
corpus. We used a simple Perl script to calculate the hits() function for TASA, as a 
surrogate for sending queries to AltaVista. 
5.2. SO-PMI Baseline 
Table 4 shows the accuracy of SO-PMI in its baseline configuration, as described in 
Section 3.1. These results are for all three corpora, tested with the HM lexicon. In this 
table, the strength (absolute value) of the semantic orientation was used as a measure of 
confidence that the word will be correctly classified. Test words were sorted in 
descending order of the absolute value of their semantic orientation and the top ranked 
words (the highest confidence words) were then classified. For example, the second row 
in Table 4 shows the accuracy when the top 75% (with highest confidence) were 
classified and the last 25% (with lowest confidence) were ignored.
Table 4. The accuracy of SO-PMI with the HM lexicon and the three corpora. 
Percent of full 
test set 
Size of test set 
Accuracy with 
AV-ENG 
Accuracy with 
AV-CA 
Accuracy with 
TASA 
100% 
1336 
87.13% 
80.31% 
61.83% 
75% 
1002 
94.41% 
85.93% 
64.17% 
50% 
668 
97.60% 
91.32% 
46.56% 
25% 
334 
98.20% 
92.81% 
70.96% 
Approx. num. of words in corpus 
1 × 10
11
2 × 10
9
1 × 10
7
The performance of SO-PMI in Table 4 can be compared to the performance of the 
HM algorithm in Table 2 (Section 4.1), since both use the HM lexicon, but there are 
some differences in the evaluation, since the HM algorithm is supervised but SO-PMI is 
unsupervised. Because the HM algorithm is supervised, part of the HM lexicon must be 
set aside for training, so the algorithm cannot be evaluated on the whole lexicon. Aside 
from this caveat, it appears that the performance of the HM algorithm is roughly 
comparable to the performance of SO-PMI with the AV-CA corpus, which is about one 
hundred times larger than the corpus used by Hatzivassiloglou and McKeown [1997] 
(2 × 10

words versus 2 × 10
7
words). This suggests that the HM algorithm makes more 
efficient use of corpora than SO-PMI, but the advantage of SO-PMI is that it can easily 
be scaled up to very large corpora, where it can achieve significantly higher accuracy. 
The results of these experiments are shown in more detail in Figure 1. The percentage 
of the full test set (labeled 
threshold
in the figure) varies from 5% to 100% in increments 
of 5%. Three curves are plotted, one for each of the three corpora. The figure shows that 


18
a smaller corpus not only results in lower accuracy, but also results in less stability. With 
the larger corpora, the curves are relatively smooth; with the smallest corpus, the curve 
looks quite noisy.
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100

Yüklə 200 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   10   ...   18




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin