1
2
3
4
5
6
7
8
9 """
10 Read lines from the Prepositional Phrase Attachment Corpus.
11
12 The PP Attachment Corpus contains several files having the format:
13
14 sentence_id verb noun1 preposition noun2 attachment
15
16 For example:
17
18 42960 gives authority to administration V
19 46742 gives inventors of microchip N
20
21 The PP attachment is to the verb phrase (V) or noun phrase (N), i.e.:
22
23 (VP gives (NP authority) (PP to administration))
24 (VP gives (NP inventors (PP of microchip)))
25
26 The corpus contains the following files:
27
28 training: training set
29 devset: development test set, used for algorithm development.
30 test: test set, used to report results
31 bitstrings: word classes derived from Mutual Information Clustering for the Wall Street Journal.
32
33 Ratnaparkhi, Adwait (1994). A Maximum Entropy Model for Prepositional
34 Phrase Attachment. Proceedings of the ARPA Human Language Technology
35 Conference. [http://www.cis.upenn.edu/~adwait/papers/hlt94.ps]
36
37 The PP Attachment Corpus is distributed with NLTK with the permission
38 of the author.
39 """
40
41 from nltk_lite.corpora import get_basedir
42 from nltk_lite import tokenize
43 from nltk_lite.tag import string2tags, string2words
44 import os
45
46 items = ['training', 'devset', 'test']
47
48 item_name = {
49 'training': 'training set',
50 'devset': 'development test set',
51 'test': 'test set'
52 }
53
61
63 for t in raw(files):
64 yield {
65 'sent': t[0],
66 'verb': t[1],
67 'noun1': t[2],
68 'prep': t[3],
69 'noun2': t[4],
70 'attachment': t[5]
71 }
72
80
81 if __name__ == '__main__':
82 demo()
83