Microsoft Word turney-littman-acm doc



Yüklə 200 Kb.
Pdf görüntüsü
səhifə3/18
tarix22.05.2023
ölçüsü200 Kb.
#119806
1   2   3   4   5   6   7   8   9   ...   18
Pwords = a set of words with positive semantic orientation 
Nwords = a set of words with negative semantic orientation 
A(word
1
word
2
) = a measure of association between word
1
and word
2
SO-A(word) =



Pwords
pword
Nwords
nword
nword
word
pword
word
)
,
(
A
)
,
(
A

Pwords = {good, nice, excellent, positive, fortunate, correct, and superior} 
Nwords = {bad, nasty, poor, negative, unfortunate, wrong, and inferior}. 


7
strategy. This paper examines SO-PMI (Semantic Orientation from Pointwise Mutual 
Information) and SO-LSA (Semantic Orientation from Latent Semantic Analysis). 
3.1. Semantic Orientation from PMI 
PMI-IR [Turney 2001] uses Pointwise Mutual Information (PMI) to calculate the strength 
of the semantic association between words [Church and Hanks 1989]. Word co-
occurrence statistics are obtained using Information Retrieval (IR). PMI-IR has been 
empirically evaluated using 80 synonym test questions from the Test of English as a 
Foreign Language (TOEFL), obtaining a score of 74% [Turney 2001], comparable to that 
produced by direct thesaurus search [Littman 2001].
The Pointwise Mutual Information (PMI) between two words, word
1
and word
2
, is 
defined as follows [Church and Hanks 1989]:
(7) 
Here, p(word
1
word
2
) is the probability that word
1
and word
2
co-occur. If the words are 
statistically independent, the probability that they co-occur is given by the product 
p(word
1
) p(word
2
). The ratio between p(word
1
& word
2
) and p(word
1
) p(word
2
) is a 
measure of the degree of statistical dependence between the words. The log of the ratio 
corresponds to a form of correlation, which is positive when the words tend to co-occur 
and negative when the presence of one word makes it likely that the other word is absent.
PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) 
and noting the number of hits (matching documents). The following experiments use the 
AltaVista Advanced Search engine
4
, which indexes approximately 350 million web pages 
(counting only those pages that are in English). Given a (conservative) estimate of 300 
words per web page, this represents a corpus of at least one hundred billion words. 
AltaVista was chosen over other search engines because it has a NEAR operator. The 
AltaVista NEAR operator constrains the search to documents that contain the words 
within ten words of one another, in either order. Previous work has shown that NEAR 
performs better than AND when measuring the strength of semantic association between 
words [Turney 2001]. We experimentally compare NEAR and AND in Section 5.4.
SO-PMI is an instance of SO-A. From equation (4), we have: 
(8) 
4
See http://www.altavista.com/sites/search/adv. 
PMI(word
1
word
2
) = 
)
(
p
)
(
p
)
&
(
p
log
2
1
2
1
2
word
word
word
word

SO-PMI(word)=



Pwords
pword
Nwords
nword
nword
word
pword
word
)
,
(
PMI
)
,
(
PMI



8
Let hits(query) be the number of hits returned by the search engine, given the query, 
query. We calculate PMI(word
1
word
2
) from equation (7) as follows: 
(9) 
Here, N is the total number of documents indexed by the search engine. Combining 
equations (8) and (9), we have: 
(10) 
Note that N, the total number of documents, drops out of the final equation. Equation (10) 
is a log-odds ratio [Agresti 1996]. 
Calculating the semantic orientation of a word via equation (10) requires twenty-eight 
queries to AltaVista (assuming there are fourteen paradigm words). Since the two 
products in (10) that do not contain word are constant for all words, they only need to be 
calculated once. Ignoring these two constant products, the experiments required only 
fourteen queries per word.
To avoid division by zero, 0.01 was added to the number of hits. This is a form of 
Laplace smoothing. We examine the effect of varying this parameter in Section 5.3. 
Pointwise Mutual Information is only one of many possible measures of word 
association. Several others are surveyed in Manning and Schütze [1999]. Dunning [1993] 
suggests the use of likelihood ratios as an improvement over PMI. To calculate likelihood 
ratios for the association of two words, X and Y, we need to know four numbers:
(11) 
(12) 
(13) 
(14) 
If the neighbourhood size is ten words, then we can use hits(X NEAR Y) to estimate 
k(X Y) and hits(X) – hits(X NEAR Y) to estimate k(X ~Y), but note that these are only 
rough estimates, since hits(X NEAR Y) is the number of documents that contain X near Y
not the number of neighbourhoods that contain X and Y. Some preliminary experiments 
suggest that this distinction is important, since alternatives to PMI (such as likelihood 
PMI(word
1
word
2
) = 
)
(
hits
)
(
hits
)
NEAR
(
hits
log
2
1
1
1
2
1
1
2
word
word
word
word
N
N
N

SO-PMI(word











Pwords
pword
Nwords
nword
Pwords
pword
Nwords
nword
nword
word
pword
nword
pword
word
)
NEAR
hits(
)
hits(
)
hits(
)
NEAR
hits(
log
2

k(X Y) = the frequency that X occurs within a given neighbourhood of Y 
k(~X Y) = the frequency that Y occurs in a neighbourhood without X 
k(X ~Y) = the frequency that X occurs in a neighbourhood without Y 
k(~X ~Y) = the frequency that neither X nor Y occur in a neighbourhood. 


9
ratios [Dunning 1993] and the Z-score [Smadja 1993]) appear to perform worse than PMI 
when used with search engine hit counts. 
However, if we do not restrict our attention to measures of word association that are 
compatible with search engine hit counts, there are many possibilities. In the next 
subsection, we look at one of them, Latent Semantic Analysis.
3.2. Semantic Orientation from LSA 
SO-LSA applies Latent Semantic Analysis (LSA) to calculate the strength of the 
semantic association between words [Landauer and Dumais 1997]. LSA uses the Singular 
Value Decomposition (SVD) to analyze the statistical relationships among words in a 
corpus.
The first step is to use the text to construct a matrix X, in which the row vectors 
represent words and the column vectors represent chunks of text (e.g., sentences, 
paragraphs, documents). Each cell represents the weight of the corresponding word in the 
corresponding chunk of text. The weight is typically the tf-idf score (Term Frequency 
times Inverse Document Frequency) for the word in the chunk. (tf-idf is a standard tool in 
information retrieval [van Rijsbergen 1979].)
5
The next step is to apply singular value decomposition [Golub and Van Loan 1996] to 

Yüklə 200 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   18




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin