forked from denizyuret/nlpcourse
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChangeLog
173 lines (134 loc) · 7.25 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
2014-04-24 <[email protected]>
* generation: A good way to approach the sentence to lambda
expression mapping problem is to come up with a generative model
of lambda expressions first (i.e. a probabilistic grammar, a
probability distribution). Analogous to building an LM for target
language in MT. Once you have a machine that knows how to
generate good target expressions, then the input can be seen as
something that biases that machine slightly one way or the other
to guide it to a specific output. This should be easier with a
well tuned generator than a bad one, i.e. much less information
(in number of bits) need to be added to the former to get the
right answer. (Of course this can be used as an argument for all
generative models and discriminative modelers would probably
disagree.) In any case instead of wrongly calling the first
lectures "mapping strings to probabilities", devote the first part
of the course to models of generating strings, trees, graphs
etc. which would be grammars (cfg, regexp), probabilistic grammars
(ngram, hmm) etc.
On the geo dataset, try to come up with a probability distribution
that maximizes the likelihood of unseen lambda expressions.
2014-04-13 Deniz Yuret <[email protected]>
* Zettlemoyer-Collins-2005: https://bitbucket.org/yoavartzi: found
code and data for geoquery, jobs, navigation etc experiments of
zettlemoyer et al. Is it worth the effort to run / reimplement a
string to string learning system? Need a PCCG parser, need
GENLEX, parse typed lambda calculus, implement stochastic gradient
descent. Probably a week to replicate. What language?
Work in this area seems to learn various sub-paths of the graph:
Sentence -> Lexicon -> Grammar -> LogicalForm -> Answer
In particular ZC05 has sentence-logicalform pairs and infers
lexicon and grammar (which is practically the same thing in CCG).
Two pieces of data in ZC05 are Geo880 and Jobs640. Geo880 exists
in download: geoquery, wasp, spf has different formats. SPF has
the format (typed lambda calculus) used in ZC05 under
geoquery/experiments/data. Jobs640 is in mooney in a different
query format.
Probably need Clark and Curran 2003 for PCCG learning, Steedman
1996 and 2000 for more on CCG. Maybe a fast C implementation of
PCCG. First look at future work to avoid duplicating stupid
ideas. But most importantly this is not the right type of
supervision and NLIDB is not an interesting problem... But think
of the challenges for different word types and syntax. Why are
some things in the initial lexicon? Can they not be learnt? How
many different types of CCG categories are there really? Do we
have the correct CCG parses and gold categories for these
sentences? We'll have to deal with each word type even when
dealing with the right supervision...
2014-02-20 Deniz Yuret <[email protected]>
* nlp.pdf: Add Siskind's Abigail system, Deb Roy and Luc Steels
systems.
2014-01-19 Deniz Yuret <[email protected]>
* Questions:
Can we compute P(A) or P(AB) for words A, B directly looking at
the LM? (AB means A and B appear in the same context, not side by
side)
Why is one algorithm n^3 the other n^5 for dep parsing?
what kind of pgm are hmm vs crf, pcfg vs crf etc? directional vs
undirectional, why conditional vs joint? isnt it what is given
and what is asked? same relation as nbayes vs logreg? why cant
we have arbitrary features in joint models, if they are eq to
joint maxent? just interp problem?
2013-08-23 Deniz Yuret <[email protected]>
* loss: Comparing loss for nbayes, logistic, perceptron, svm, 0-1
loss. Probably assume two class? What is the x and y axes for
the loss graphs? Start discriminative lecture by asking what does
nbayes and loglinear optimize?
* liblinear: includes both logistic regression and svm. The paper
has a nice description of both. Multiclass says one-vs-the-rest.
Standard references (like Daume's http://ciml.info/) present
perceptron etc. for two classes. Should look into generalizing
from two class to multiclass for perceptron, logistic, svm etc.
They also have standard datasets for many problems at (rcv1,
news20): http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Got 83.15 with svm and 83.65 with logistic 5-fold xval after
optimizing c. (c must have a different meaning than our
loglinear).
Ignoring counts (representing each word as 0/1) gives 86.25 and 86
with logistic and svm! Should try the same with perceptron and
nbayes...
* imdb-srilm: classifying with a trigram kneser ney language model
gave 0.8375 and 0.8025 for fold 0 and 1. Is this significant?
Make significance testing part of the first assignment.
* todo: write averaged perceptron (done, but did not make much
difference), reimplement loglinear in C (no need?), implement SVM
with optimizer (no need?) and/or use libsvm, liblinear etc. (done)
implement srilm classifier (done).
Two questions for the discriminative lecture: two-class vs
multi-class expressions; comparison of loss and regularization for
perceptron, loglinear, svm.
http://en.wikipedia.org/wiki/Logistic_regression#As_a_.22log-linear.22_model
Explains two-class vs multi-class.
Shows equivalence between loglinear and logistic, also between
logistic as single layer perceptron (with squashed s activation)!
From: http://inst.eecs.berkeley.edu/~cs188/sp12/slides/cs188%20lecture%2021%20--%20perceptron%206PP.pdf
Binary = multiclass where the negative class has weight zero
Multiclass: has different weight vectors for each class (including
bias) ... alternative representation.
A quora page comparing svm implementations:
http://www.quora.com/Support-Vector-Machines/What-SVM-package-is-best-for-large-scale-linear-classification
* ciml: Check out Daume's book for perceptron notes; shuffling
helps, averaging can be done cheaply:
http://ciml.info/dl/v0_8/ciml-v0_8-ch03.pdf
Add Domingos1997 and McCallum1998 to the nbayes reading list.
* experiments/imdb-perceptron/perceptron: How about this as the
first project: start with SLM, teach SRILM, after the generative
document classification lecture give homework comparing
unigram-nbayes with srilm-nbayes. SVM and logistic give better
results using the 0/1 feature representation! Change ngram order,
smoothing, vocab size for srilm? Here is some data on ngram
order (need stderr):
ngram accuracy
1 .8025
2 .8350
3 .8375
4 .8450
5 .8450
In the project they must have stderr, they must try all obvious
variations (e.g. regularization constant, vocab size, ngram order
etc.) Maybe feature engineering and feature selection.
2013-08-16 Deniz Yuret <[email protected]>
* topics: do we cover statistical significance?
definitely ignore summarization and generation?
gotta research semantic parsing and related learning (ccg,
language and vision, etc.)
morphology: disamb is a tagging problem, how about analysis?
morphochallenge? xavier's (or stolcke's) fst induction?
2013-08-15 Deniz Yuret <[email protected]>
* url: need url field in acl.bst. Check out
http://nodonn.tipido.net/bibstyle.php
Or use \usepackage{url} or \usepackage{hyperref}, then insert a
\url{http://...} entry inside a note or howpublished (for misc).
* annote: need annotation field in acl.bst. Check out
annotation.bst, plain-annote.bst, chicagoa, chicago-annote,
apacann etc. Done.