This repository has been archived by the owner on Dec 8, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathnl_website.ics
3814 lines (3435 loc) · 532 KB
/
nl_website.ics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
BEGIN:VCALENDAR
CALSTYLE:GREGORIAN
PRODID:-//NL//Seminar Calendar//EN
VERSION:2.0
X-WR-CALNAME:NL
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DESCRIPTION: Abstract:People from different cultures and backgrounds tend to make different decisions faced with the same set of choices. Cultural background influences people's decisions in social interactions. Computational agents that are intended to simulate human behavior or engage in interpersonal interactions such as negotiation with humans need decision making models that are sensitive to culture. In this talk, we show how agents can learn to behave like people from specific cultures in the context of a negotiation game.Bio: Elnaz Nouri is a PhD student in the Natural Language group at USC's Institute for creative Technologies (ICT).
DTEND;TZID=America/Los_Angeles:20140606T160000
DTSTART;TZID=America/Los_Angeles:20140606T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:Cultural Negotiating Agents
UID:20140606T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140711T160000
DTSTART;TZID=America/Los_Angeles:20140711T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:TBD
UID:20140711T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140620T160000
DTSTART;TZID=America/Los_Angeles:20140620T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:TBD
UID:20140620T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140703T160000
DTSTART;TZID=America/Los_Angeles:20140703T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern talk] TBD
UID:20140703T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140725T160000
DTSTART;TZID=America/Los_Angeles:20140725T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:TBD
UID:20140725T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Abstract: In NLP, we rely on annotated data to train models. This implicitly assumes that the annotations represent the truth. However, this basic assumption can be violated in two ways: either because the annotators exhibit a certain bias (consciously or subconsciously), or because there simply is not one single truth. In this talk, I will present approaches to deal with both problems.In the case of biased annotators, we can collect multiple annotations and use an unsupervised item-response model to infer the underlying truth and the reliability of the individual annotators. We present a software package, MACE (Multi-Annotator Competence Estimation) with considerable improvements over standard baselines both in terms of predicted label accuracy and estimates of trustworthiness, even under adversarial conditions. Additionally, we can trade precision for recall, achieving even higher performance by focusing on the instances our model is most confident in.In the second case, where not a single truth exists, we can collect information about easily confused categories and incorporate this knowledge into the training process. We use small samples of doubly annotated POS data for Twitter to estimate annotation reliability and show how those metrics of likely inter-annotator agreement can be implemented in the loss functions of structured perceptron. We find that these cost-sensitive algorithms perform better across annotation projects and, more surprisingly, even on data annotated according to the same guidelines. Finally, we show that these models perform better on the downstream task of chunking.Bio:Dirk Hovy is a postdoc in the Center for Language Technology at the University of Copenhagen, working with Anders Søgaard on improving analysis of low-resource languages. Their recent paper on POS tagging with inter-annotator agreement won the best paper award at EACL 2014.Dirk received his PhD from the University of Southern California (USC), where he was working at the Information Sciences Institute (ISI) on unsupervised relation extraction. He has a background in socio-linguistics and worked on unsupervised and semi-supervised models for relation extraction, temporal links, and WSD, as well as annotator assessment. He is interested in the "human" aspects of NLP, i.e., the individual bias people have when producing or annotating language, and how it affects NLP applications.His other interests include cooking, cross-fit, and medieval art and literature.
DTEND;TZID=America/Los_Angeles:20140616T163000
DTSTART;TZID=America/Los_Angeles:20140616T153000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:Two ways to deal with annotation bias
UID:20140616T153000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140630T160000
DTSTART;TZID=America/Los_Angeles:20140630T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern talk] TBD
UID:20140630T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140808T160000
DTSTART;TZID=America/Los_Angeles:20140808T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern talk] TBD
UID:20140808T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140731T120000
DTSTART;TZID=America/Los_Angeles:20140731T110000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:TBD
UID:20140731T110000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140815T160000
DTSTART;TZID=America/Los_Angeles:20140815T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern final talk] TBD
UID:20140815T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Performing machine translation with monolingual data instead of parallel data is an interesting problem. Because of the lack of parallel data for many language pairs, solving the problem would arise interesting new use cases.On the road towards this we look at similar but easier problems. In the past improvements on simple substitution ciphers (1:1) were made - Even word substitutions ciphers with large vocabularies were solved for example by a beamsearch approach. This talk concentrates on the more complicated cipher class of homophonic substitution ciphers (1:m) like the famous Z408 of the Zodiac killer or the second page of the Beale cipher. We preset a method based on beamsearch. Covered aspects are an improved heuristic, the order the beamsearch should explore the search space, pruning, and the impact of the cipher lengths and cipher alphabet size on the deciphering accuracy.Bio:Julian Schamper studies computer science at RWTH Aachen University. He did its bachelor thesis in the field of deciphering foreign language and works as a student research assistant at Prof. Hermann Ney's Human Language Technology and Pattern Recognition Group.
DTEND;TZID=America/Los_Angeles:20140611T160000
DTSTART;TZID=America/Los_Angeles:20140611T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:Solving Homophonic Sustitution Ciphers [Intern talk]
UID:20140611T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140822T160000
DTSTART;TZID=America/Los_Angeles:20140822T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern final talk] TBD
UID:20140822T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140829T160000
DTSTART;TZID=America/Los_Angeles:20140829T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern final talk] TBD
UID:20140829T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140905T160000
DTSTART;TZID=America/Los_Angeles:20140905T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:[Intern final talk] TBD
UID:20140905T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBD
DTEND;TZID=America/Los_Angeles:20140926T160000
DTSTART;TZID=America/Los_Angeles:20140926T150000
LOCATION:11th Floor Large Conference Room [1135]
SUMMARY:Semantic Parsing at Google
UID:20140926T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I will present my current work on language understandingin the project, Mission Rehearsal Exercise(MRE). One of the challengesin a dialogure system is to provide a robust understanding/parsingcompoment. We applied both Finte State Model and Statistical LearningModel for the parsing of separate sentences of dialogue utterances.Their performances are evaluated and compared with a new blind set.And we hope to incorporate them to make a better solution in thisspecific application.
DTEND;TZID=America/Los_Angeles:20030404T160000
DTSTART;TZID=America/Los_Angeles:20030404T150000
LOCATION:11 Large
SUMMARY:Natural Language Understanding in MRE
UID:20030404T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Traditional statistical MT systems mostly work on the word-andphrase-level. For different language pairs, the performance of suchsystems vary from some 15% to 35%. These systems suffer from problemssuch as sparse data, with huge vocabulary sizes leading to lessreliable probability estimates. In our current research, we aim tocome up with a better MT system by looking inside the words. Almost inevery language, a root (stem) can have many different forms(inflectional, derivational, etc.). If we can identify the roots, thesize of the vocabulary will quite small, and we can have betterprobability estimates, reducing the sparse data problem andpotentially leading to higher accuracy. We are trying to come up witha model that induces morphology automatically from a bilingual corpusand achieves this improvement.
DTEND;TZID=America/Los_Angeles:20030425T160000
DTSTART;TZID=America/Los_Angeles:20030425T150000
LOCATION:11 Large
SUMMARY:Statistical MT with Bilingual Morphology
UID:20030425T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030801T160000
DTSTART;TZID=America/Los_Angeles:20030801T150000
LOCATION:11 Large
SUMMARY:Toward deciphering the 2-dimensional ancient Luwian script by discovering its writing order
UID:20030801T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030815T160000
DTSTART;TZID=America/Los_Angeles:20030815T150000
LOCATION:11 Large
SUMMARY:On Her Masters Research
UID:20030815T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030822T160000
DTSTART;TZID=America/Los_Angeles:20030822T150000
LOCATION:11 Large
SUMMARY:Information Extraction, IR and QA
UID:20030822T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030827T160000
DTSTART;TZID=America/Los_Angeles:20030827T150000
LOCATION:11 Large
SUMMARY:Syntax for Statistical MT
UID:20030827T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030829T160000
DTSTART;TZID=America/Los_Angeles:20030829T150000
LOCATION:11 Large
SUMMARY:Deepening Representations
UID:20030829T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Previous research has indicated that when a polysemous word appears twoor more times in a discourse, it is extremely likely that they will allshare the same sense (Gale et al. 92). However, those results werebased on a coarse-grained distinction between senses (e.g, {\emsentence} in the sense of a `prison sentence' vs. a `grammaticalsentence'). I conducted an analysis of multiple senses within twosense-tagged corpora, Semcor and DSO. These corpora used WordNet fortheir sense inventory. I found significantly more occurrences ofmultiple-senses per discourse than reported in (Gale et al. 92) (33\%instead of 4\%). I also found classes of ambiguous words in which asmany as 45\% of the senses in the class co-occur within a document. Iwill discuss the implications of these results for the task ofword-sense tagging and for the way in which senses should berepresented.
DTEND;TZID=America/Los_Angeles:20031219T163000
DTSTART;TZID=America/Los_Angeles:20031219T150000
LOCATION:11 Large
SUMMARY:More than One Sense Per Discourse
UID:20031219T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will give a status report work on information extraction during last10 months. The motivation of this work is to learn extractionpatterns automatically using seed template and web search engine. Myapproach is to generate linguistics patterns and surface patterns andcombine them to compenstate for the respective weaknesses of twopatterns. On the DUC01-test-disasters (67 documents),DUC01-training-disasters (54 documents) I got a 0.34/0.26 f-measurerespectively. In this talk, I will give a status report on ReADproject (with Dr. Chin-Yew Lin).
DTEND;TZID=America/Los_Angeles:20030207T160000
DTSTART;TZID=America/Los_Angeles:20030207T150000
LOCATION:11 Large
SUMMARY:Automatic Pattern Learning for Information Extraction using Web Data
UID:20030207T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The large corpora of written text that is available to the languagecommunity has largely been utilized for language understanding; it hassomewhat been ignored in the context of language generation. Recentdevelopments in stochastic generation have allowed such systems to shiftthe burden from hand crafted databases (lexicons, grammars, ontologies) tothe knowledge implicitly found in written text. However, when building adialogue system, generation is largely interactive, very different fromthe written structure of most corpora.In this talk, I will discuss my recent work at applying a stochasticgenerator, HALogen, and its newswire language model to a dialogue system,TRIPS. I'll describe the difficulties in mapping the TRIPS semantic forminto HALogen's representation, the critical differences between newswireand dialogue, and the possibility of using HALogen and a large newswiremodel as a domain independent generator.
DTEND;TZID=America/Los_Angeles:20030221T160000
DTSTART;TZID=America/Los_Angeles:20030221T150000
LOCATION:11 Large
SUMMARY:Statistical Language Generation in a Dialogue System
UID:20030221T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We introduce two probabilistic models that can be used to identifyelementary discourse units and build sentence-level discourse parsetrees. The models use syntactic and lexical features. A discourse parsingalgorithm that implements these models derives discourse parse trees withan error reduction of 18.8\% over a state-of-the-art decision-baseddiscourse parser. A set of empirical evaluations shows that our discourseparsing model is sophisticated enough to yield discourse trees at anaccuracy level that matches near-human levels of performance.
DTEND;TZID=America/Los_Angeles:20030228T160000
DTSTART;TZID=America/Los_Angeles:20030228T150000
LOCATION:11 Large
SUMMARY:Sentence Level Discourse Parsing using Syntactic and Lexical Information
UID:20030228T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Since its inception more than 30 years ago, electronic mail (email)has developed into a powerful communication medium with applicationsthat extend well beyond simple asynchronous message exchange betweenindividuals. Automated tools to support the use of email inindividual, organizational and social contexts have receivedincreasing attention in recent years. Among the tasks that are nowsupported are filtering (e.g., spam detection), aggregation (e.g.,mailing list digests), workflow management (e.g., help desk routing),and reuse (e.g., retrospective search). We are interested in howtoday's email will be used in the future -- some will certainly bepreserved (indeed, some MUST be preserved!), and those records willserve as powerful evidence of how we lived our lives and organized oursocieties. The challenges of managing many types of electronic recordcollections are receiving increasing attention, but we are not awareof any work yet on supporting access to electronic mail archives.That will be the focus of this talk.We will introduce the Open Archival Information Systems (OAIS) model,and then focus on two key processes: ingestion and access. Our focusin ingestion is on support for review and redaction, which we believewill be key enablers to acquisition and near-term access. For access,we will address both browsing based on provenance (original order) anduser-guided reorganization based on search and visualization. Alongthe way, we will identify potentially productive opportunities toapply natural language processing technologies such as topicsegmentation, link detection, and summarization. We will thendescribe two test collections, and demonstrate a system that we havedeveloped to explore user-guided reorganization through visualizationfor one of those collections. We will conclude the talk by sketchingout a research agenda. At that point, we will expect suggestions andcomments from the audience. Knowing this audience, it is unlikelythat we will need to wait that long :-).
DTEND;TZID=America/Los_Angeles:20030124T160000
DTSTART;TZID=America/Los_Angeles:20030124T150000
LOCATION:11 Large
SUMMARY:Access to Archival Collections of Electronic Mail
UID:20030124T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will give a status report on my current thesis work onnoun phrase translation. The motivation of this work isto break up the machine translation problem into smaller,more manageable units. The treatment of noun phrase translationas a subtask of machine translation is both linguisticallyand empirically motivated. My approach is to generatea n-best list of candidate translations with a statisticalmachine translation system and rerank the candidates withadditional features. For about 90% of all noun phrases wecan find an acceptable translation in the 100-best list, whilean acceptable translation comes out on the very top for onlyabout 60% of the noun phrases. I will discuss a variety oflinguistic and empirical features that (may) help to movethe acceptable translations higher in the list. I will alsopresent results modeling issues such as phrase basedtranslation and compound splitting. This talk is alsointended as a fishing expedition for feature suggestions bythe audience.
DTEND;TZID=America/Los_Angeles:20030131T160000
DTSTART;TZID=America/Los_Angeles:20030131T150000
LOCATION:11 Large
SUMMARY:Noun Phrase Translation
UID:20030131T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030718T160000
DTSTART;TZID=America/Los_Angeles:20030718T150000
LOCATION:11 Large
SUMMARY:A Maryland Yankee in King Eduard's Court: Some Remarks on a Year in Paradise
UID:20030718T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030725T160000
DTSTART;TZID=America/Los_Angeles:20030725T150000
LOCATION:11 Large
SUMMARY:Super-Carmel for Trees
UID:20030725T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030729T160000
DTSTART;TZID=America/Los_Angeles:20030729T150000
LOCATION:11 Small
SUMMARY:A Model of Word Movement for Machine Translation
UID:20030729T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Bilingual term lists have proven to be a useful basis fordictionary-based Cross-Language Information Retrieval (CLIR), butthere is ample anecdotal evidence that differences in vocabularycoverage can have a substantial impact on retrieval effectiveness.This issue has recently been explored using ablation studies in whichprogressively smaller term lists were synthesized using samplingtechniques. The ablation techniques used in those studies have not,however, been validated using real terms lists. In this talk I willreport the results of what we believe is the first large coveragestudy use naturally occurring term lists. Thirty-five bilingual termslists were obtained from a variety of sources, each with English asone of the two paired languages. From these, we created 35English-to-English term lists by taking each term that was present inthe English side of the list as its own translation. When used withan English information retreval test collection, this allowed us tomeasure the reduction in retrieval effectivenss that could beattributed to deficiencies in the coverage of English terms. Eighttypes of untranslatable terms were identified in a collection of newsstories, of which named entitles were found to have the greatestimpact on retrieval effectiveness. Differences in named entitycoverage were found to produce large differences in retrievaleffectiveness for term lists of similar sizes. Controlling for namedentity effects yielded a clear relationship between retrievaleffectiveness and the size of the translatable English vocabulary.The functional dependence that we observed is consistent with onepreviously applied ablation technique and inconsistent with another.Our results indicate that the outcome of a widely cited landmark studyof query expansion effects for CLIR was likely affected by a flawedablation model. We conclude our talk with a suggestion for furtherwork on that topic, and a simple prescription for avoiding suchproblems in the future.
DTEND;TZID=America/Los_Angeles:20030612T120000
DTSTART;TZID=America/Los_Angeles:20030612T110000
LOCATION:11 Large
SUMMARY:Measuring the Effect of Dictionary Coverage on Cross-Language Retrieval
UID:20030612T110000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030627T160000
DTSTART;TZID=America/Los_Angeles:20030627T150000
LOCATION:10 Large
SUMMARY:Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked and Maximum Entropy Models for FrameNet Classification
UID:20030627T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Our contextual inquiry into the practices of oralhistorians uneartheda curious incongruity. While oral historians consider interviewrecordings a central historical artifact, these recordingssit unusedafter a written transcript is produced. We hypothesizedthat this islargely because books are more usable than recordings.Therefore, wecreated Books with Voices: bar-code augmented paper transcriptsenabling fast, random access to digital video interviews ona PDA. Wepresent quantitative results of an evaluation of this tangibleinterface with 13 participants. They found this lightweight,structured access to original recordings to offersubstantial benefitswith minimal overhead. Oral historians found a level ofemotion in thevideo not available in the printed transcript. The videoalso helpedreaders clarify the text and observe nonverbal cues.<ahref="http://guir.berkeley.edu/oral-history/">http://guir.berkeley.edu/oral-history/
DTEND;TZID=America/Los_Angeles:20030307T160000
DTSTART;TZID=America/Los_Angeles:20030307T150000
LOCATION:11 Large
SUMMARY:Books with Voices: Paper Transcripts as a Tangible Interface to Oral Histories
UID:20030307T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: One of the key challenges in retrieval is what to do when a query termneeds to be replaced with more than one term. This problem arises inapplications such as cross language information retrieval andthesaurus expansion. One solution is to use structured query methods,which treat all the possible replacements as if they were one queryterm by computing a joint document frequency and a joint termfrequency. This presentation will review prior work on structuredquery techniques and then introduce three new variants that aim toimprove computational efficiency and to leverage estimates ofreplacement probabilities to improve retrieval effectiveness. Themethods have now been tested in cross-language retrieval andOCR-degraded text retrieval applications in which replacementprobability estimates could be estimated. In both applications, thenew structured query methods showed statistically significantimprovements in retrieval effectiveness over previously knownstructured query methods.
DTEND;TZID=America/Los_Angeles:20030314T160000
DTSTART;TZID=America/Los_Angeles:20030314T150000
LOCATION:11 Large
SUMMARY:Improving the Efficiency and Effectiveness of Structured Query Methods
UID:20030314T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Term weighting methods have been shown to give significant increasesin information retrieval performance. Term weights are typicallycalculated using frequency counts across the whole retrievalcollection, frequency of each term within individual documents andcompensation for varying document length. The presence of pronomialreferences in documents effectively reduces the within document termfrequency of associated words with a consequent effect on term weightsand information retrieval behaviour. This presentation will describean experimental investigation into the impact on information retrievalperformance of broad coverage automatic pronoun resolution. Resultsusing a standard information retieval test collection indicate thatcalculating term weights using a pronoun resolved version of thedocument test collection can improve both fixed cutoff and averageretrieval precision.
DTEND;TZID=America/Los_Angeles:20030321T160000
DTSTART;TZID=America/Los_Angeles:20030321T150000
LOCATION:11 Large
SUMMARY:An Investigation of the Application of Broad Coverage Automatic Pronoun Resolution in Information Retrieval
UID:20030321T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We present an approach to automatically extracting paraphrase templatesfrom document/abstract pairs. This methodology relies on word-basedalignments created by off-the-shelf software. Our paraphrases areevaluated by human evaluators for precision and automatically forapplicability. We find that 77% of the extracted paraphrases are judgedto be always correct and that the generalized templates of 60% arejudged to be applicable most of the time and 87% are judged to beapplicable sometimes.
DTEND;TZID=America/Los_Angeles:20030502T160000
DTSTART;TZID=America/Los_Angeles:20030502T150000
LOCATION:11 Large
SUMMARY:Acquiring Paraphrase Templates from Document/Abstract Pairs
UID:20030502T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: For ten days in March, nine research teams worked together to buildCebuano language resources and systems for a "dry run" the TIDES SupriseLanguage experiment. Cebuano is spoken widely in the southernPhillipines, but there had previously been little work on computationallinguistics for that language. As we prepare for the actual SupriseLanguage experiment this June, we will use this talk to look back on whatworked, what didn't, and what lessons there are to be learned from ourexperience in March. Come prepared to share the excitement, offer yourideas, and understand why we have tried to ask Ed to cancel all vacationsduring the month of June (just kidding...).
DTEND;TZID=America/Los_Angeles:20030509T160000
DTSTART;TZID=America/Los_Angeles:20030509T150000
LOCATION:11 Large
SUMMARY:Coping with Surprise: The Case of Cebuano
UID:20030509T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Following the recent adoption by the machine translation community ofautomatic evaluation using the BLEU/NIST scoring process, we conduct anin-depth study of a similar idea for evaluating summaries. The resultsshow that automatic evaluation using unigram co-occurrences betweensummary pairs correlates surprising well with human evaluations, basedon various statistical metrics; while direct application of the BLEUevaluation procedure does not always give good results.
DTEND;TZID=America/Los_Angeles:20030516T160000
DTSTART;TZID=America/Los_Angeles:20030516T150000
LOCATION:11 Large
SUMMARY:Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics
UID:20030516T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20030520T160000
DTSTART;TZID=America/Los_Angeles:20030520T150000
LOCATION:11 Large
SUMMARY:Discourse Segmentation of Multi-Party Conversation
UID:20030520T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: 1) A serious bottleneck in the development of trainable text summarizationsystems is the shortage of training data. Constructing such data is a verytedious task, especially because there are in general many differentcorrect ways to summarize a text. Fortunately we can utilize the Internetas a source of suitable training data. In this paper, we present asummarization system that uses the web as the source of training data. Theprocedure involves structuring the articles downloaded from variouswebsites, building adequate corpora of (summary, text) and (extract,text) pairs, training on positive and negative data, and automaticallylearning to perform the task of extraction-based summarization systems.2) Headlines are useful for users who only need information on the maintopics of a story. We present a headline summarization system that isbuilt at ISI for this purpose and is a top performer for DUC2003's task 1,generating very short summaries (10 words or less).
DTEND;TZID=America/Los_Angeles:20030523T160000
DTSTART;TZID=America/Los_Angeles:20030523T150000
LOCATION:11 Large
SUMMARY:A Web-Trained Extraction Summarization System and Headline Summarization at ISI
UID:20030523T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: An Overview of Question Answering ChallengeJun'ichi Fukumoto and Tsuneaki KatoIn this talk, we will present an overview of Question AnsweringChallenge(QAC), which is the question answering task of the NTCIRWorkshop. QAC-1 (the first evaluation of QAC) was carried outat NTCIR Workshop 3 in October 2002, and QAC-2 will be atNTCIR Workshop 4 in December 2003. In the QAC, systems to beevaluated are expected to return exact answers consisting of a nounor noun compound denoting, for example, the names of persons,organizations, or various artifacts or numerical expressions suchas money, size, or date. Those basically range over the NamedEntity (NE) elements of MUC and IREX but is not limited to them.QAC consists of three kinds of subtasks: Task 1, where the systemsare allowed to return ranked five possible answers; Task 2, wherethe systems are required to return a complete list of answers; andTask 3, the systems are required to answer series of questions, thathave anaphora and zero-anaphora. We will present the results ofQAC-1, and vision and prospect of QAC-2.NTCIR -- the Way AheadNoriko KandoDr. Noriko Kando is the leader of NTCIR(Test Collections and Evaluationof IR, Text Summarization, Q&A, etc) project, and an associate professorof National Institute of Informatics (NII). She got her Ph. D in 1995from Keio University. Her research interest includes evaluation ofinformation retrieval systems, technologies to "Make Information Usablefor Users", cross-lingual information retrieval, and analysis of textstructure, genre, citation & link She is a member of editorial boards ofInternational Journal on Information Processing and Management,ACM-Transaction on Asian Language Information Processing, etc.Jun'ichi Fukumoto and Tsuneaki Kato are task organizers of QAC.Dr. Jun'ichi Fukumoto is an associate professor of RitsumeikanUniversity. He got his Ph. D in 1999 from University of ManchesterInstitute of Science and Technology. His research interest includesQ&A, automatic summarization, and dialogue processing.Dr. Tsuneaki Kato is an associate professor of the University of Tokyo.He got his Dr. of Engineering in 1995 from Tokyo Institute ofTechnology. His research interests includes multimodal dialogueprocessing, multimodal presentation generation and domain independentquestion and answering. He is a member of editorial committee oftransaction on information and systems of The Institute of Electronics,Information and Communication Engineers.
DTEND;TZID=America/Los_Angeles:20031117T120000
DTSTART;TZID=America/Los_Angeles:20031117T103000
LOCATION:4th Floor
SUMMARY:An Overview of the QA Challenge + NTCIR -- The Way Ahead
UID:20031117T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I will introduce some of the technologies whichwe have developed in the project on an English reading assistant systemcalled English Reading Wizard. The technologies include a method formining translations from web (unparallel corpora), a method for wordtranslation disambiguation based on bootstrapping, which is calledBilingual Bootstrapping, and a general method of bootstrapping, which iscalled Collaborative Bootstrapping. First, I will introduce the mainfeatures of English Reading Wizard. Next, I will introduce each of themethods. The translation mining method is based on a naïve Bayesianensemble and the EM algorithm. Bilingual Bootstrapping uses theasymmetric translation relationship between words in the two languagesin translation and can construct reliable classifiers for wordtranslation disambiguation. Collaborative Bootstrapping contains theco-training algorithm as its special case, and it uses the strategy ofuncertainty reduction in training of the two classifiers.Bio:Hang Li is a researcher at the Natural Language Computing Groupof Microsoft Research in Beijing, China. He is also adjunct professor ofXian Jiaotong University. Hang Li obtained a B.S. in ElectricalEngineering from Kyoto University (Japan) in 1988 and a M.S. in ComputerScience from Kyoto University in 1990. He earned his Ph.D. in ComputerScience from the University of Tokyo in 1998. >From 1990 to 2001, HangLi worked at the Research Laboratories of NEC Corporation in Kawasaki,Japan. He joined Microsoft Research in 2001.  His research interestincludes statistical learning, natural language processing, data mining,and information retrieval. Hang Li's web site:http://research.microsoft.com/users/hangli/
DTEND;TZID=America/Los_Angeles:20031125T120000
DTSTART;TZID=America/Los_Angeles:20031125T223000
LOCATION:11th Floor Large
SUMMARY:Using Bilingual Data to Mine and Rank Translations
UID:20031125T223000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20031002T170000
DTSTART;TZID=America/Los_Angeles:20031002T160000
LOCATION:11 Large
SUMMARY:TBA
UID:20031002T160000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I would like to talk about some of the things I did during the lastyear. I will discuss and demonstrate CuSTaRD, a cross-lingualinformation retrieval, organization, summarization, and visualizationsystem that was built for the Surprise Language exercise. I will focusin more details on iNeATS, the interactive multi-document summarizationpart of CuSTaRD. The other project I plan to present is eArchivarius, asystem for accessing collections of electronic mail.
DTEND;TZID=America/Los_Angeles:20031003T160000
DTSTART;TZID=America/Los_Angeles:20031003T150000
LOCATION:11 Large
SUMMARY:A Year in Paradise
UID:20031003T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: (This is a practice run for I talk I will give a few times over the nextweeks when interviewing for job positions.)I will review the state of the art in statistical machine translation(SMT), present my dissertation work, and sketch out the researchchallenges of syntactically structured statistical machine translation.The currently best methods in SMT build on the translation of phrases (anysequences of words) instead of single words. Phrase translation pairs areautomatically learned from parallel corpora. While SMT systems generatetranslation output that often conveys a lot of the meaning of the originaltext, it is frequently ungrammatical and incoherent.The research challenge at this point is to introduce syntactic knowledgeto the state of the art in order to improve translation quality. Myapproach breaks up the translation process along linguistic lines. I willpresent my thesis work on noun phrase translation and ideas about clausestructure.
DTEND;TZID=America/Los_Angeles:20031010T160000
DTSTART;TZID=America/Los_Angeles:20031010T150000
LOCATION:11 Large
SUMMARY:Advances in Statistical MT: Phrases, Noun Phrases and Beyond
UID:20031010T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The annual Computational Linguistics Open House will be held at USC's InformationSciences Institute from 3:00-4:30pm in the 11th floor Conference Room. Researchers fromISI, including Eduard Hovy, Daniel Marcu, and Kevin Knight will present overviews oftheir latest research. We will also hear about the research activities of Dani Byrd ofthe Linguistics Department, Shri Narayanan's group in EE, and David Traum and AndrewGordon of USC's Institute for Creative Technologies.
DTEND;TZID=America/Los_Angeles:20031017T163000
DTSTART;TZID=America/Los_Angeles:20031017T150000
LOCATION:11 Large
SUMMARY:Introduction to CL Research
UID:20031017T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Probabilistic parsing methods have in recent years transformed our ability torobustly find correct parses for open domain sentences. Much of this work hasbeen within a common architecture of heuristic search for good pares inlexicalized probabilistic context-free grammars, with many layers of back-offto avoid problems of sparse data.In this talk, I will outline some different ideas that we have been pursuing.I will connect stochastic parsing with finding shortest paths in hypergraphs,and show how this approach naturally provides a chart parser for arbitraryprobabilistic context-free grammars (finding shortest paths in a hypergraph iseasy; the central problem of parsing is that the hypergraph has to beconstructed on the fly). From this viewpoint, a natural approach is to use theA* algorithm to cut down the work in finding the best parse. On unlexicalizedgrammars, this can reduce the parsing work done dramatically, by at least 97%.This approach is competitive with methods standardly used in statisticalparsers, while ensuring optimality, unlike most heuristic approaches tobest-first parsing.Finally, I will present a novel modular generative model in which semantic(lexical dependency) and syntactic structures are scored separately. Thisfactored model is conceptually simple, linguistically interesting, admits exactinferenence with an extremely effective A* algorithm, and providesstraightforward opportunities for separately improving the component models. Inparticular, I will mention some of the work we have done focusing on the PCFGcomponent to produce a very high accuracy unlexicalized grammar.This is joint work with Dan Klein.About the Speaker:Christopher Manning is an Assistant Professor of Computer Science andLinguistics at Stanford University. He received his Ph.D. from StanfordUniversity in 1995, and served on the faculty of the Computational LinguisticsProgram at Carnegie Mellon University (1994-1996) and the University of SydneyLinguistics Department (1996-1999) before returning to Stanford. His researchinterests include probabilistic models of language, natural language parsing,constraint-based linguistic theories, syntactic typology, informationextraction and text mining, and computational lexicography. He is the author ofthree books, including Foundations of Statistical Natural Language Processing(MIT Press, 1999, with Hinrich Schuetze).Chris' schedule is available in <a href="manning.ps">Postscript</a> or<a href="manning.pdf">PDF</a> format.
DTEND;TZID=America/Los_Angeles:20031027T110000
DTSTART;TZID=America/Los_Angeles:20031027T100000
LOCATION:11 Large
SUMMARY:Natural Language Parsing: Graphs, the A* Algorithm, and Modularity
UID:20031027T100000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We will present the results of the 2003 Johns Hopkins UniversitySummer Workshop on "Syntax for Statistical Machine Translation".We will describe a large effort to extend a high-performingphrase-based MT system as baseline by adding new features representingsyntactic knowledge that deal with specific problems of the underlyingbaseline. We investigate a broad range of possible feature functions,from very simple binary features to sophisticated tree-to-treetranslation models. Simple feature functions test if a certainconstituent occurs in the source and the target language parsetree. More sophisticated features will be derived from an alignmentmodel where whole sub-trees in source and target can be aligned nodeby node. We present results on the Chinese-English large data track ofthe recent TIDES MT evaluations.This is joint work with the other workshop team members: DanielGildea, Anoop Sarkar, Sanjeev Khudanpur, Kenji Yamada, Libin Shen,Shankar Kumar, David Smith, Viran Jain, Katherine Eng, Jin Zhen andDragomir Radev.See <ahref="http://www.clsp.jhu.edu/ws03/groups/translate/">http://www.clsp.jhu.edu/ws03/groups/translate/</a>for more.
DTEND;TZID=America/Los_Angeles:20030903T160000
DTSTART;TZID=America/Los_Angeles:20030903T150000
LOCATION:11 Large
SUMMARY:JHU MT Workshop
UID:20030903T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: A major hurdle in building automated information retrieval systems forHindi text is the lack of an uniform encoding for text representation.Standards do exist, but noone seems interested. Every web contentpublisher seems to have their encoding system, making informationextraction a nightmare. We explore an unsupervised approach toconvert any given "unknown" encoding to UTF-8, by treating it as adecipherment problem. We also study how a little amount of supervisioncan improve decoding accuracy.
DTEND;TZID=America/Los_Angeles:20030905T160000
DTSTART;TZID=America/Los_Angeles:20030905T150000
LOCATION:11 Large
SUMMARY:Deciphering Hindi Scripts
UID:20030905T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I look at how the notion of discourse coherence can bemodeled computationally. I begin with the following idea: if you takea text and shuffle its sentences into a random order, that text willno longer make sense. In other words, the text will be "incoherent".Our task is to learn how to reassemble a shuffled text into an orderthat humans would consider to be coherent.I discuss practical and theoretical motivations for the task,evaluations of our model, increases in performance achieved over thesummer, and directions for future research.This work was done in collaboration with Kevin Knight, Daniel Marcu,Jonathan Graehl and Nick Mote.
DTEND;TZID=America/Los_Angeles:20030912T160000
DTSTART;TZID=America/Los_Angeles:20030912T143000
LOCATION:11 Large
SUMMARY:Discourse Coherence for Ordering Information
UID:20030912T143000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I present my summer project - writing rule-based software forsimplifying texts. Task definition and motivations will bediscussed, as well as human and automatic evaluation, thelatter using a question answering system.This is joint work with Daniel Marcu and Kevin Knight.
DTEND;TZID=America/Los_Angeles:20030915T160000
DTSTART;TZID=America/Los_Angeles:20030915T143000
LOCATION:11 Large
SUMMARY:Analyzing Sentences into Facts: Simple is Beautiful
UID:20030915T143000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The presentation will give an overview of the SMT activities at theLanguage Technologies Institute, Carnegie Mellon University, in largevocabulary text translation tasks, esp. the Chinese-English andArabic-English, as well as in limited domain speech-to-speech translationtasks. The CMU SMT system is, like most modern statistical MT systems,based on phrase translation. Several approaches have been developed toextract the phrase pairs from parallel corpora and current researchinvestigates different scoring approaches for these translation pairs.Details of the decoder, esp. on hypothesis recombination, pruning, andefficient n-best list generation will be given. Recently, the SMT systemhas been extended to use partial translations generated from example basedand grammar based translation system, thereby performing multi-enginemachine translation.Bio:Stephan Vogel is a researcher at the Language Technologies Institute,Carnegie Mellon University, where he heads the statistical machinetranslation team. He received a Diploma in Physics from PhilipsUniversity Marburg, Germany, and a Masters of Philosophy from theUniversity of Cambridge, England. After working for a number of years onthe history of science, he turned to computer science, especially naturallanguage processing. Before coming to CMU, he worked for several years atthe Technical Univerity of Aachen on statistical machine translation, andalso in the Interactive Systems Lab at the University of Karlsruhe.
DTEND;TZID=America/Los_Angeles:20040402T160000
DTSTART;TZID=America/Los_Angeles:20040402T150000
LOCATION:11 Large
SUMMARY:The CMU Statistical Machine Translation System
UID:20040402T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: An interesting (disturbing?) new trend is beginning to manifest itself inNLP, one that is focused on performance and hence very attractive in thecontext of inter-system competitive evaluations such as TREC and DUC, butone that does not provide much insight about language or NLP methods tothe researcher interested in these topics. This addition of a newparadigm to NLP has implications for all of us.
DTEND;TZID=America/Los_Angeles:20040409T163000
DTSTART;TZID=America/Los_Angeles:20040409T150000
LOCATION:11 Large
SUMMARY:Three (and a half?) Trends: The Future of NLP
UID:20040409T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Automated essay scoring was initially motivated by its potential costsavings for large-scale writing assessments. However, as automated essayscoring became more widely available and accepted, teachers and assessmentexperts realized that the potential of the technology could go way beyondjust essay scoring. Over the past five years or so, there has been rapiddevelopment, and commercial deployment of automated essay evaluation forboth large-scale assessment and classroom instruction. A number offactors contribute to an essay score, including varying sentencestructure, grammatical correctness, appropriate word choice, errors inspelling and punctuation, use of transitional words/phrases, andorganization and development. Instructional software capabilities existthat provide essay scores and evaluations of student essay writing in allof these domains. The foundation of automated essay evaluation softwareis rooted in NLP research. This talk will walk through the development ofCriterionSM, e-rater, and Critique writing analysis tools, automated essayevaluation software developed at Educational Testing Service - from NLPresearch through deployment as a business.(Preview of an HLT/NAACL-2004 Invited Speaker Presentation)Jill BursteinEducational Testing ServicePrinceton, NJ
DTEND;TZID=America/Los_Angeles:20040413T163000
DTSTART;TZID=America/Los_Angeles:20040413T150000
LOCATION:4 Large
SUMMARY:Automated Essay Evaluation: From NLP research through deployment as a business
UID:20040413T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Although we live in a predominantly statistical world, there are stillmany language processing applications that long for accuraterepresentations of text meaning. Even applications that found partialsolutions in statistical modeling, including information retrieval,machine translation, or automatic summarization, are likely to get asignificant boost from deeper text understanding.In this talk, I will present an innovative method for automatic extractionof conceptual graphs as a means to represent text meaning. The methodrelies on a novel adaptation of graph-based ranking algorithms -traditionally (and successfully) used in citation analysis, Web pageranking, and social networks. I will show how such algorithms can beadapted to semantic networks, resulting in an efficient unsupervisedmethod for resolving the semantic ambiguity of all words in open text, andidentifying relations between entities in the text. I will also outline anumber of applications that are enabled by this representation, includingkeyphrase extraction, domain classification, and extractive summarization.BIO: Rada Mihalcea is an Assistant Professor of Computer Science atUniversity of North Texas. Her research interests are in lexicalsemantics, minimally supervised natural language learning, andmultilingual natural language processing. She is currently involved in anumber of research projects, including word sense disambiguation, shallowsemantic parsing, (non-traditional) methods for building annotated corporawith volunteer contributions over the Web, word alignment for languagepairs with scarce resources, and graph-based ranking algorithms forlanguage processing. Her research is supported by NSF and the state ofTexas.
DTEND;TZID=America/Los_Angeles:20040416T120000
DTSTART;TZID=America/Los_Angeles:20040416T103000
LOCATION:11 Large
SUMMARY:Graph-based Ranking Algorithms for Language Processing
UID:20040416T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I'll describe our entry into the DUC 2004 automatic document summarizationcompetition. We competed only in the single document, headline generationtask. Our system is based on a novel kernel dubbed the tree positionkernel, combined with two other well-known kernels. Our system performswell on white-box evaluations, but does very poorly in the overall DUCevaluation. C'est la vie.
DTEND;TZID=America/Los_Angeles:20040423T160000
DTSTART;TZID=America/Los_Angeles:20040423T150000
LOCATION:10 Large
SUMMARY:A Tree-Position Kernel for Document Compression
UID:20040423T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
DTEND;TZID=America/Los_Angeles:20040428T170000
DTSTART;TZID=America/Los_Angeles:20040428T150000
LOCATION:11 Large
SUMMARY:Practice Talks for HLT/NAACL
UID:20040428T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Summarization requires one to identify the internal structure ofinformation and to bring that to the surface both operationally andorganizationally.How does one put this theory to practice and build real summarizationsystems? How do the systems built based on this idea perform?
DTEND;TZID=America/Los_Angeles:20040430T163000
DTSTART;TZID=America/Los_Angeles:20040430T150000
LOCATION:11 Large
SUMMARY:Automating the Building of Summarization Systems
UID:20040430T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Information retrieval using word senses is emerging as a good researchchallenge on semantic information retrieval. In this presentation, I amgoing to propose a new method using word senses in information retrieval:root sense tagging method. This method assigns coarse-grained word sensesdefined in WordNet to query terms and document terms by unsupervised wayusing co-occurrence information constructed automatically. The sensetagger is crude, but performs consistent disambiguation by consideringonly the single most informative word as evidence to disambiguate thetarget word. We also allow multiple-sense assignment to alleviate theproblem caused by incorrect disambiguation.Experimental results on a large-scale TREC collection show that theproposed approach to improve retrieval effectiveness is successful, whilemost of the previous work failed to improve performances even on smalltext collection. The proposed method also shows promising results when iscombined with pseudo relevance feedback and state-of-the-art retrievalfunction such as BM25.
DTEND;TZID=America/Los_Angeles:20040806T163000
DTSTART;TZID=America/Los_Angeles:20040806T150000
LOCATION:11 Large
SUMMARY:Information Retrieval using Word Senses: Root Sense Tagging Approach
UID:20040806T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Justin Busch:Weight and Semantic Class Issues in Japanese Noun Phrase OrderingMany current designs for automatic parsers learn probabilities for therelative frequencies of parts-of-speech and syntactic rules, and this hasproven to be generally reliable. In spite of the ubiquity of probabilistictechniques for parsing, however, little attention has been given to thelinguistic significance of the probabilistic data and what it might sayabout human performance.Hawkins proposes a general theory of grammaticalization based on theminimization of syntactic domains. Given that a sentence of any languagewill contain at least one noun phrase, one verb, and possibly additionalnoun phrases and prepositional phrases, "minimize domains" suggests thatthese phrases will order themselves according to whichever patternrequires the least effort to recognize the higher syntactic structure ofthe sentence. These effects are directly measurable through corpusstatistics, and can be interpreted as potential heuristics forprobabilistic parsers. In this study, we examine Japanese data from theKyoto Treebank and test Hawkins' predictions for noun phrase ordering bynoun phrase weight as well as by generic semantic types. The discussionwill focus primarily on how accurately Hawkins' predictions are reflectedin the corpus statistics, and will conclude with observations about howthey might be applied to the decision mechanisms of probabilistic parsers.--------------------------------------------------------------------------Hai Huang:TBA--------------------------------------------------------------------------Jens Stephan:Evaluation and Visualization of a Dialogue SystemEvaluations have become a necessary standard to almost any type ofresearch. However, there are many areas where there is no common agreementon how to evaluate, which is the case for complex problem of evaluatingdialogue systems. The evaluation of the multi party multi modal dialoguesystem MRE(1) provides a good example of what questions are important forsuch an evaluation, how to actually do the evaluation and finally how tohow make special problems of the system visible to use the evaluationresults to improve the systems performance.After a brief introduction of the MRE domain and architecture, I willbreak the task town to a set of general evaluation questions. From there Iwill explain what kinds of metrics and visualizations are suited to answerthose questions and what kind of data is needed, as well as how that datawas obtained. Along the road, examples of actual system problems andperformances will be presented. The topics of data formatting andvisualization will receive some special attention by introducing the MREEvaluation Toolkit as well as the corpus it operates on.--------------------------------------------------------------------------Chen-kang Yang:Using the Omega Ontology to Determine Selectional Restrictions for Word Sense DisambiguationWord sense disambiguation is fundamental for language processing. Thoughpurely statistical methods are effective for this task, they neglect thesyntactic and semantic aspects. In this study, we adopt a hybrid approachby applying an unsupervised machine learning method to learn verbsselectional restrictions on their subjects/objects. The system then usesthese learned selectional restrictions for word sense disambiguation ofthe subjects/objects. Instead of words, the training data containsontological taxonomy hierarchies that are retrieved from the Omegaontology. Unlike other similar systems, we are able to automatically findthe best match among classes from different levels of the ontology. Thisprovides us more flexibility and is closer to human instinct. Our systemperforms better than other similar systems, though it still needscooperating methods for better results.
DTEND;TZID=America/Los_Angeles:20040809T163000
DTSTART;TZID=America/Los_Angeles:20040809T150000
LOCATION:11 Large
SUMMARY:CL Student Presentations
UID:20040809T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The last decade has seen a plethora of papers in NLP devoted to MachineLearning algorithms. However, most of these papers have devoted theireffort exclusively to improving the system performance on the accuracyaxis. Most of the sophisticated NLP algorithms are extremely slow and donot scale up easily when applied to large amounts of data.I will talk about the importance of randomized algorithms and theirpotential in speeding up some NLP algorithms. This talk will be a surveyof some recent advances in Theoretical Computer Science/Math seen with anNLP point-of-view. I am not going to present any results. But I am hopingthat this talk will clarify my thinking process, get feedback from peopleand help me colloborate with others.
DTEND;TZID=America/Los_Angeles:20040813T163000
DTSTART;TZID=America/Los_Angeles:20040813T150000
LOCATION:11 Large
SUMMARY:Randomized algorithms and its application to NLP
UID:20040813T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Broad-coverage repositories of semantic relations between verbs couldbenefit many NLP tasks. We present a semi-automatic method for extractingfine-grained semantic relations between verbs. We detect similarity,strength, antonymy, enablement, and temporal happens-before relationsbetween pairs of strongly associated verbs using lexico-syntactic patternsover the Web. On a set of 29,165 strongly associated verb pairs, ourextraction algorithm yielded 65.5% accuracy. We provide the resource,called VerbOcean, for download at http://semantics.isi.edu/ocean/. We willalso discuss current work on disambiguating the verbs in the network aswell as refining the semantic relations using path analysis.
DTEND;TZID=America/Los_Angeles:20040816T153000
DTSTART;TZID=America/Los_Angeles:20040816T140000
LOCATION:11 Large
SUMMARY:VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations
UID:20040816T140000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: ISI's Tactical Language Project is a system designed to teach Americanshow to speak Arabic through a video game environment. We've taken a FPSengine (Unreal 2003) and re-did the graphics so it looks like you're in atypical Lebanese village. We took away the guns, added speech recognition,and set the player in the middle of it all. The theory is that if youlearn well in a classroom, you'll perform well in a classroom, but if youlearn well in a pseudo-naturalistic environment, you'll perform better inreal life.In a pedagogical context, speech recognition is a hard thing we're tryingto recover signal from noisy language-learner speech--with all of itsmispronunciations, disfluencies, and grammatical errors . Languageunderstanding is hopeless unless you have a good approximation of whatkinds of mistakes learners make, and you can build a system to anticipatethem.Suppose an English language learner says "Water". Is he asking you forwater? Is he telling you there's a puddle in front of you? Is he sayinghis name is "Walter", but with horrible pronunciation? There's a lot ofambiguity involved. In order to disambiguate, we need to look at thespeech signal itself, the utterance's context, the learner's past languageperformance, and details about the learner's mother language as it relatesto English, etc., etc... Only then can we hope to guess what the learneris actually trying to say.And then, of course, once we've made a good guess at the learner's speechintentions, what do we do about it? How do we correct him? How do webalance the consideration of inherent qualities of learner motivation,language errors, learning objectives, and possibly low-confidence speechrecognition, as we generate good pedagogical feedback?This is NLP (primarily statistical) with a bit of pedagogy theory andlinguistic (SLA and phonology) theory sprinkled in.
DTEND;TZID=America/Los_Angeles:20041210T163000
DTSTART;TZID=America/Los_Angeles:20041210T150000
LOCATION:11 Large
SUMMARY:Developing a Language Model for Second Language Learner Speech
UID:20041210T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk will address the problem of assessing the correctness of MToutput on the word level. I will give an overview on word confidencemeasures for SMT. Different variants of word posterior probabilities thatcan be directly used as confidence measure will be presented. Theirconnection with the Bayes decision rule and the underlying error measurewill be shown. Experimental comparison of different word confidencemeasures will be presented on a translation task consisting of technicalmanuals.Additionally, I will show how word confidence measures can be applied inan interactive SMT system. This system predicts translations, taking partsof the sentence into account that have already been accepted or typed bythe user. Through the use of confidence measures, the performance of theprediction engine can be improved.About the Speaker:Nicola Ueffing is a graduate research assistant at the group for "HumanLanguage Technology and Pattern Recognition" (Lehrstuhl fuer InformatikVI) at RWTH Aachen University. She received her diploma in mathematicsfrom RWTH Aachen University in 2000. Her research topic is statisticalmachine translation, focusing on confidence measures for SMT. In 2003, shewas a member of the team working on "Confidence Estimation for SMT" at theCLSP workshop at JHU.
DTEND;TZID=America/Los_Angeles:20041217T163000
DTSTART;TZID=America/Los_Angeles:20041217T150000
LOCATION:11 Large
SUMMARY:Word-Level Confidence Measures for SMT
UID:20041217T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: We propose a theory that gives formal semantics to word-levelalignments defined over parallel corpora. We use our theory tointroduce a linear algorithm that can be used to derive fromword-aligned, parallel corpora the minimal set of syntacticallymotivated transformation rules that explain human translation data.(joint work with Michel Galley, Kevin Knight, and Daniel Marcu)
DTEND;TZID=America/Los_Angeles:20040206T160000
DTSTART;TZID=America/Los_Angeles:20040206T150000
LOCATION:11 Large
SUMMARY:What's in a Translation Rule?
UID:20040206T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will be presenting some recent results of mine regarding the possibilityof automatic evaluation in summarization. I will discuss both my ownfindings, as well of those of people here and at Columbia, and attempt toexplain in a principled fashion why there are disparate opinions on theplausibility of performing automatic evaluation in this task. I willdiscuss my (perhaps pessimistic) views on the plausibility of doing anysort of evaluation of summarization, automatic or otherwise.The results and experimental setups developed in connection withsummarization will be extended to the machine translation. I will reviewpossible reasons why metrics such a bleu have experienced significantlymore success in machine translation than in summarization. I will alsoconnect the evaluation criterea developed in the context of summarizationto machine translation, and discuss the automation of these methods.In short: I'll talk about why I've been doing so much data elicitaitonrecently.This will be a highly informal seminar and participation is highlyencouraged.
DTEND;TZID=America/Los_Angeles:20040220T160000
DTSTART;TZID=America/Los_Angeles:20040220T150000
LOCATION:4 Large
SUMMARY:Some Results in Automatic Evaluation for Summarization and MT
UID:20040220T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Leading Question-Answering systems employ a variety of means to boost theaccuracy of their answers. Such methods include redundancy (getting thesame answer from multiple documents/sources), deeper parsing of questionsand texts (hence improving the accuracy of confidence measures),inferencing (proving the answer from information in texts plus backgroundknowledge) and sanity-checking (verifying that answers are consistent withknown facts). To our knowledge, however, no QA system deliberately asksadditional questions in order to derive constraints on the answers to theoriginal questions.We present in this talk the method of QA-by-Dossier-with-Constraints (QDC).This is an extension of the simpler method of QA-by-Dossier, in whichdefinitional questions ("Who/what is X") are addressed by asking a set ofquestions about anticipated properties of X. In QDC, the collection ofDossier candidate answers, along with possibly other answers to questionsasked expressly for this purpose, are subjected to satisfying a set ofnaturally-arising constraints. For example, for a "Who is X" question, thesystem will ask about birth, accomplishment and death dates, which, if theyexist, must occur in that order, and also obey other constraints such aslifespan. Temporal, spatial and kinship relationships seem to beparticularly amenable to this treatment, but it would seem that almost any"factoid" question can benefit from QDC. We will discuss the setting-upand application of constraint networks, and talk about how (and whether) todevelop the constraint sets automatically. We will demonstrate severalapplications of QDC, and present one evaluation in which the F-measure fora set of questions improved with QDC from .39 to .69.
DTEND;TZID=America/Los_Angeles:20040116T150000
DTSTART;TZID=America/Los_Angeles:20040116T140000
LOCATION:11 Large
SUMMARY:Using Constraints to Improve Question-Answering Accuracy
UID:20040116T140000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: PropBank: the next stage of TreebankNatural-language engineers the world over are coming to a consensus that adegree of semantic knowledge is a necessary addition to purely structuralrepresentations of language. This talk describes the Propbank project atPenn, which provides a complete shallow semantic parse of the Treebank IIcorpus.Inducing a Chronology of the Pali Canon:Works such as Kroch (1989), Taylor (1994) and Han (2000) have demonstratedthat syntactic change can be described mathematically as the competitionbetween innovating and archaic formations. This paper demonstrates howthis same mathematical description can be turned around to predict thedate of a historical text. The Middle Indic period showed dramatic changein the morphological system, such as the collapse of the past-tense verbalsystem. Whereas Sanskrit had three competing formations, each withmultiple possible morphological realizations, Pali (a Middle Indo-Aryanlanguage) had only a single formation, based mostly on the sigmatic aoristalthough many archaic nonsigmatic aorists are also attested. Theproportions of the archaic and innovative forms can be easily calculatedfor each text in the Pali Canon and these proportions used to assign anapproximate date for each text. The accuracy of the method can beassessed qualitatively by comparing the derived chronology to chronologiesbased on various non-linguistic criteria, or quantitatively by comparingthe derived chronology to a known dating scheme. For the latter it isnecessary to turn to a different dataset, such as that describing the riseof do-support in Early Modern English, as described in Ellegard (1953) andKroch (1989).Bio:Paul Kingsbury graduated summa cum laude in linguistics from Ohio StateUniversity in 1993 with a thesis on "Some sources for L-words inSanskrit". He subsequently entered the University of Pennsylvania tostudy historical linguistics and Sanskrit, but (like most historicalstudents) was diverted to computational issues. He joined the Propbankproject in 2000 and soon thereafter engineered a major rethinking of themethods and goals of the project, in order to make the annotationlinguistically meaningful. He completed his doctorate in 2002 with athesis entitled 'The Chronology of the Pali Canon: the case of theaorist'.
DTEND;TZID=America/Los_Angeles:20040130T163000
DTSTART;TZID=America/Los_Angeles:20040130T150000
LOCATION:11 Large
SUMMARY:PropBank: the next stage of Treebank <b>and</b><br>Inducing a Chronology of the Pali Canon
UID:20040130T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will present work that extends the standard hidden Markov model to aversion that can emit multiple symbols in a single time step. Using thismodel, we are able to automatically create phrase-to-phrase mappings in analignment process. I've applied this model to the task of creatingalignments between documents and their human-written abstracts, yieldingan overall alignment F-score of 0.548, a significant improvement on thebest results to date of 0.363. These results are published in an EMNLPpaper this year, but the talk will be an extended version of the talk Iwill give there (namely, I will discuss the mechanics of the extended HMMin more detail in this seminar).
DTEND;TZID=America/Los_Angeles:20040702T150000
DTSTART;TZID=America/Los_Angeles:20040702T133000
LOCATION:11 Large
SUMMARY:A Phrase-Based HMM Approach to Document/Abstract Alignment
UID:20040702T133000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I'll give a survey of trees and grammars, at least the parts that seemmost relevant to ongoing work at ISI. This will be a theory talk. I'llstart with context-free grammars, which were developed in the 1950s, andcover other tree-generating systems. I'll also talk abouttree-transforming systems.
DTEND;TZID=America/Los_Angeles:20040709T163000
DTSTART;TZID=America/Los_Angeles:20040709T150000
LOCATION:11 Large
SUMMARY:Survey of Trees and Grammars
UID:20040709T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
DTEND;TZID=America/Los_Angeles:20040716T163000
DTSTART;TZID=America/Los_Angeles:20040716T150000
LOCATION:11 Large
SUMMARY:Practice Talks for ACL (+workshops)
UID:20040716T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: My presentation will overview recent activities on Chinese-English SMTcarried out at ITC-irst (Trento, Italy). After an overview of thecomplete architecture of our system, I will focus on progress made inChinese word-segmentation, phrase-based modeling and decoding, log-linearmodeling and minimum error training, and language model adaptation.Experimental results will be provided in terms of Bleu and Nist scores ontwo translation tasks: basic traveling expressions and news reports,respectively adopted by the C-STAR consortium and for the 2002 and 2003NIST MT evaluation campaigns.Bio:Marcello Federico has been a permanent researcher at ITC-irst since 1991.During 1998-2003, he led the "Multilingual natural speech technologies"(MUNST) research line at ITC-irst. Since 2004, he is head of the"Cross-language information processing" (Hermes) research line. Hisinterests include automatic speech recognition, statistical languagemodeling, information retrieval, and machine translation.
DTEND;TZID=America/Los_Angeles:20040617T163000
DTSTART;TZID=America/Los_Angeles:20040617T150000
LOCATION:4th Floor
SUMMARY:Statistical Machine Translation at ITC-irst
UID:20040617T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk will be about automatic speech-to-speech translation. In oursystem, a doctor speaks one language, the patient speaks another language,and the machine translates their utterances from one language to theother. The talk will be followed by a demo of our system.One approach we have been successful with is phrase classification, i.e.,classifying a noisy speech-recognized utterance into one of many meaningcategories. Phrase classification is computationally cheap and canprovide high quality translations for in domain utterances almostinstantaneously. Speed is important for speech translation, whereprocessing delay is a great concern.In this talk, different aspects of building a classification-based speechtranslator are discussed. Following an overview of automaticspeech-to-speech translation and its challenges, a comparison of differentclassification methods is presented and data collection techniques forthat application are introduced.
DTEND;TZID=America/Los_Angeles:20040621T160000
DTSTART;TZID=America/Los_Angeles:20040621T150000
LOCATION:11 Large
SUMMARY:Speech-to-Speech Translation: A Phrase Classification Approach
UID:20040621T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Tree-based probability models of translation have been proposed to takeadvantage of parse trees on one, both, or neither sides of a parallelcorpus. I will present comparative results for these three approaches forthe task of word alignment on Chinese-English and French-English data, aswell as some analysis of what is going on behind the numbers.
DTEND;TZID=America/Los_Angeles:20040625T160000
DTSTART;TZID=America/Los_Angeles:20040625T150000
LOCATION:11 Large
SUMMARY:Syntactic Supervision and Tree-Based Alignment
UID:20040625T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
DTEND;TZID=America/Los_Angeles:20040312T163000
DTSTART;TZID=America/Los_Angeles:20040312T150000
LOCATION:11 Large
SUMMARY:About My Thesis Proposal
UID:20040312T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The Scamseek project aims to build a surveillance tool for identifyingfinancial scams on the Internet by performing document classification ofInternet pages. There are three principle types of documents of concern:those that give financial advice by unregistered advisors, unlawfulinvestment schemes, and share ramping.The first phase of the project has been completed and a working system,known as ScamAlert installed at the Australian Securities and InvestmentCommission (ASIC). The independent audit of the performance of the systemproved satisfactory with a result for precision of .75, recall .43, andF=. 54, along with identification of 4 scams misclassified by the client.Significant improvement in recall is foreshadowed in the 2nd phase of theproject. The results are satisfying in the context of the structure ofthe data where the density of scam documents is about 1.8% of the totalcorpus.The good performance of the operational system is ascribed to thecombination of using a strong linguistic model of language (SystemicFunctional Linguistics) to define the scam documents in parallel with arich statistical analysis of the structure of non-scam documents and scamlook-alikes. A large amount of the experimental program has concentratedon understanding and exploiting the interaction between the linguisticallydescribed aspects of the documents and the statistical properties. Eachtype of data has been used to inform and modify the usage of the other.The operational aspects of the project have proven to be as challenging asthe research objectives. The project has a budget of $2.2M over 15 months.It has been managed so as to create a balance in resources between theneeds of both the research objectives and the engineering objectives.Software development has concentrated on three aspects. Firstly, toproduce an environment for the strong directive management ofcomputational linguistics experiments, secondly, in the aid of thelinguists to create tools to support their manual analysis, and thirdlythe best practice of software engineering principles to ensure a cleanautomated rollout of the production system for ASIC.The contributing partners in the Scamseek project are The Capital MarketsCo-operative Research Centre (CMCRC), ASIC, the University of Sydney andMacquarie University.
DTEND;TZID=America/Los_Angeles:20040325T120000
DTSTART;TZID=America/Los_Angeles:20040325T103000
LOCATION:11 Large
SUMMARY:ScamSeek: Capturing Financial Scams at the Coalface by Language Technology
UID:20040325T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This talk will survey results of several recent projects we have beenundertaking in automated text categorization based upon the style,rather than the topic, of the documents. I will describe a generaltext-categorization framework using machine learning along with generalprinciples for choosing stylistically relevant sets of features forlearning effective classification models. Applications of these methodsinclude determining author gender and text genre in published books andarticles, authorship attribution of email messages, and analysis oflanguage use in different scientific fields. In many cases, the modelsthat are learned also give some insight into the respective styles beingdistinguished, which I will also discuss.Shlomo Argamon is an associate professor at the Illinois Institute ofTechnology Chicago.
DTEND;TZID=America/Los_Angeles:20040326T150000
DTSTART;TZID=America/Los_Angeles:20040326T133000
LOCATION:11 Large
SUMMARY:On Writing, Our Selves: Explorations in Stylistic Text Categorization
UID:20040326T133000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: <b>Natural Language Understanding: A fast and accurate Statistical Learning Approach for Dialogue Systems</b>Natural Language Understanding (NLU) is an essential module of a gooddialogue system. To achieve satisfactory performance levels, real timedialogue systems need the NLU module to be both fast and accurate. FiniteState Model (FSM) based systems are fast and accurate but lack robustnessand flexibility. The Statistical Learning Model (SLM) based systems arerobust and flexible but lack accuracy and are at most times slow.In this talk, I am going to talk about an SLM based NLU approach fordialogue utterances that is both accurate and fast. The system has highaccuracy and produces frames in real time.<b>A Community of Words: Understanding Social Relationships from E-mail</b>A corpus of e-mail messages presents a number of challenges for NLPtechniques, with its nearly unconstrained structure and vocabulary,mistyped words and ungrammatical sentences, and extensive contextualinformation that is never explicitly stated. Yet, the intrinsically socialnature of such communication provides an opportunity to study not just abag of words, but also the relationships, competencies, and activitiesbehind them.This talk presents work with Eduard Hovy as part of the MKIDS project.
DTEND;TZID=America/Los_Angeles:20040521T163000
DTSTART;TZID=America/Los_Angeles:20040521T150000
LOCATION:11 Large
SUMMARY:Statistical Learning for Dialogue System <b>and</b> A Community of Words
UID:20040521T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In the last years a standard model in statistical machinetranslation has emerged, which is based on the translationof sequences of words (so-called "phrases") at a time.I will describe this model, how to train and decode with it,but the focus of this talk will be how to address thechallenges to advance and move beyond the model: my thesiswork on noun phrase translation, making use of syntax, andbetter modeling, such as discriminative training.Bio: Philipp Koehn is the author of papers on natural languageprocessing, machine translation, and machine learning. Hereceived his PhD from the University of Southern Californiain 2003 (advisor: Kevin Knight), and is currently employed asa postdoc at the Massachusetts Institute of Technology, workingwith Michael Collins. He has worked at AT&T Laboratories ontext-to-speech systems, and at WhizBang! Labs on textcategorization.
DTEND;TZID=America/Los_Angeles:20040524T170000
DTSTART;TZID=America/Los_Angeles:20040524T160000
LOCATION:11 Large
SUMMARY:Challenges in Statistical Machine Translation
UID:20040524T160000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The ABC (Assess by Computer) system has been developed and used in theSchool of Computer Science at the University of Manchester for formativeand (principally) summative assessment at undergraduate and postgraduatelevel. We believe that fully automatic marking of constructed answers -especially free text answers - is not a sensible aim. Instead - drawing onparallels in the history of machine translation - we take a"human-computer collaborative" approach, in which the system does what itcan to support the efficiency and consistency of the human marker, whokeeps the final judgement.Our current work focuses on what are generally referred to as "short textanswers" as contrasted to "essays". However we prefer to contrast"factual" with "discursive" answers, and speculate that the former may beamenable to simple statistical techniques, while the latter require moresophisticated natural language analysis. I will show some examples of realexam data and the techniques we are using and developing to handle them.
DTEND;TZID=America/Los_Angeles:20041105T163000
DTSTART;TZID=America/Los_Angeles:20041105T150000
LOCATION:11 Large
SUMMARY:A Human-Computer Collaborative Approach to Computer Aided Assessment
UID:20041105T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Textual data is everywhere, in email and scientific papers, in onlinenewspapers and e-commerce sites. The Web contains more than 200 terabytesof text not even counting the contents of dynamic textual databases. Thisenormous source of knowledge is seriously underexploited. Textualdocuments on the Web are very hard to model computationally: they aremostly unstructured, time-dependent, collectively authored, multilingual,and of uneven importance. Traditional grammar-based techniques don'tscale up to address such problems. Novel representations and analyticaltools are needed.I will discuss several current projects at Michigan related to text miningfrom a variety of genres. Depending on the amount of time, I will talkabout (a) lexical centrality for multidocument summarization, (b)syntax-based sentence alignment, (c) graph-based classification,(d)lexical models of Web growth, and (e) mining protein interactions fromscientific papers. As it turns out, the right representations, whencomplemented with traditional NLP and IR techniques, turn many of theseinto instances of better studied problems in areas such as socialnetworks, statistical mechanics, sequence analysis, and computationalphylogenetics.About the Speaker:Dragomir R. Radev is Assistant Professor of Information, ElectricalEngineering and Computer Science, and Linguistics at the University ofMichigan, Ann Arbor. He leads the CLAIR (Computational LingusiticsAnd Information Retrieval) group which currently includes 12undergraduate and graduate students. Dragomir holds a Ph.D. inComputer Science from Columbia University. Before joining Michigan,he was a Research Staff Member at IBM's TJ Watson Research Center inHawthorne, NY. He is the author of more than 45 papers on informationretrieval, text summarization, graph models of the Web, questionanswering, machine translation, text generation, and informationextraction. Dr. Radev's current research on probabilistic andlink-based methods for exploiting very large textual repositories,representing and acquiring knowledge of genome regulation, andsemantic entity and relation extraction from Web-scale text documentcollections is supported by NSF and NIH. Dragomir serves on theHLT-NAACL advisory committee, was recently reelected as treasurer ofNAACL, is a member of the editorial boards of JAIR and InformationRetrieval, and is a four-time finalist at the ACM internationalprogramming finals (as contestant in 1993 and as coach in1995-1997). Dragomir received a graduate teaching award at Columbiaand recently, the U. of Michigan award for Outstanding ResearchMentorship (UROP).
DTEND;TZID=America/Los_Angeles:20041112T163000
DTSTART;TZID=America/Los_Angeles:20041112T150000
LOCATION:11 Large
SUMMARY:Words, links, and patterns: novel representations for Web-scale text mining
UID:20041112T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: In this talk, I'll present the investigation I'm carrying out in ISIlately under Daniel Marcu's supervision. Following the noisy-channelframework, we propose a statistical model for learning the argumentstructures of verbs automatically. We show that we are able to learn bothlexicalized and generalized structures and achieve good results, relyingonly on basic NLP tools like a POS tagger and named-entity recognizer. Wealso present a comparison of the structures we learn with the predictedones in PropBank.
DTEND;TZID=America/Los_Angeles:20041115T163000
DTSTART;TZID=America/Los_Angeles:20041115T150000
LOCATION:8th floor multipurpose room (#849) -- NOT the conference room
SUMMARY:Unsupervised learning of verb argument structures
UID:20041115T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: As DARPA's TIDES (Translingual Information Detection, Extraction, andSummarization) program coming to an end, I will give a summary of what wehave learned from TIDES in summarization and a brief overview of ourcurrent effort in developing automatic evaluation methods that go beyondsurface n-gram matching. Topics to be covered:(1) Summary of DUCs 2001 - 2004(2) Automatic Evaluations in Summarization and MT(3) Basic Elements - New Efforts in Summarization at ISI
DTEND;TZID=America/Los_Angeles:20041119T163000
DTSTART;TZID=America/Los_Angeles:20041119T150000
LOCATION:11 Large
SUMMARY:After TIDES, What's Left? - Finding Basic Elements
UID:20041119T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: As part of an effort to encode the commonsense knowledge we need innatural language understanding, I have been looking at several very commonwords and their uses in diverse corpora, and asking what we have to knowto understand this word in this context. In this talk, I will describethe investigations of the uses of two words -- the adverb "now" and thepreposition "like".One might think that "now" simply expresses a temporal property of anevent. But in fact in almost every instance, it is used to point up acontrast -- "This is true now. Something else was true then." It is thusmore of a relation than a property. I will describe several categories ofsuch relations. Another question of interest about "now" is "How long aperiod is the word "now" describing in its various uses?": "I'm typing anabstract now" vs. "We travel by automobile now." I suggest somecategories of knowledge that need to be encoded to answer this question.When we successfully understand "A is like B", we have figured out someproperty that A and B have in common. How can we find that propertycomputationally? In the data I looked at, in 80% of the instances, theproperty is explicit in the nearby text, and I will talk about how we canidentify it. For the remainder I examine the knowledge we would need inorder to infer the common property.
DTEND;TZID=America/Los_Angeles:20041022T163000
DTSTART;TZID=America/Los_Angeles:20041022T150000
LOCATION:11 Large
SUMMARY:Like Now: Two Explorations in Deep Lexical Semantics
UID:20041022T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This summer we held a three-month workshop on syntax-driven machinetranslation, in which we learned syntactic transformations automaticallyfrom Chinese/English translated corpora and applied them to translate newtext. We'll give a progress report!
DTEND;TZID=America/Los_Angeles:20040910T163000
DTSTART;TZID=America/Los_Angeles:20040910T150000
LOCATION:11 Large
SUMMARY:About Syntax Fest 2004 (Part I)
UID:20040910T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: This summer we held a three-month workshop on syntax-driven machinetranslation, in which we learned syntactic transformations automaticallyfrom Chinese/English translated corpora and applied them to translate newtext. We'll give a progress report!
DTEND;TZID=America/Los_Angeles:20040917T163000
DTSTART;TZID=America/Los_Angeles:20040917T150000
LOCATION:11 Large
SUMMARY:About Syntax Fest 2004 (Part II)
UID:20040917T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I will present some preliminary results on the problem of domainadaptation in maximum entropy models, specifically in the case when thereis a large amount of "out of domain" data, and only a very small amount of"in domain" data. The model and algorithms I present are based on thetechnique of conditional Expectation Maximization (CEM) and allow forrelatively fast optimization of these models. Preliminary results on sometasks are quite promising.
DTEND;TZID=America/Los_Angeles:20040924T163000
DTSTART;TZID=America/Los_Angeles:20040924T150000
LOCATION:11 Large
SUMMARY:Domain Adaptation in Maximum Extropy Models
UID:20040924T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
DTEND;TZID=America/Los_Angeles:20050408T163000
DTSTART;TZID=America/Los_Angeles:20050408T150000
LOCATION:11 Large
SUMMARY:Search Engines for HLT Applications
UID:20050408T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: I am going to be talking about stuff that I have been working over thepast 6-9 months. This includes randomized algorithms and its applicationto 2 NLP problems: noun clustering and noun-pair clustering. I will alsobe commenting on my experience of working with very very large amounts ofreal Natural Language text (This includes processing and working with dataavailable from the web. This corpus is not the standard newspaper textthat we are so used to in the NLP community.) This talk will also cover alarge part of my thesis work.
DTEND;TZID=America/Los_Angeles:20050422T163000
DTSTART;TZID=America/Los_Angeles:20050422T150000
LOCATION:11 Large
SUMMARY:Working with Large Corpus, High speed clustering and its applications
UID:20050422T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Test collections for information retrieval tasks have traditionallyassumed that what we are searching for are documents (e.g., Web pages,news stories, or academic documents). Most information that is generatedis, however, not in originally generated as part of a document, but ratheras what we might refer to as "conversational media" (e.g., email, speech,or instant messaging). In this talk, I'll describe the creation of twotest collections for conversational media, an email collection beingcreated in the TREC Enterprise Search track and a spoken word testcollection for the the Cross-Language Evaluation Forum (CLEF). I'll spendmost of the talk describing the details of the CLEF test collection,illustrating the issues with some of the results that we have obtainedfrom our experiments with that collection. I'll conclude with a fewremarks about the implications of what we are learning for DARPA's newGALE program. This is joint work with Charles University, the IBM TJWatson Research Center, the Johns Hopkins University, the Survivors of theShoah Visual History Foundation, and the University of West Bohemia.About the speaker:Douglas Oard is an Associate Professor at the University of Maryland,College Park, with a joint appointment in the College of InformationStudies and the Institute for Advanced Computer Studies. He holds a Ph.D.in Electrical Engineering from the University of Maryland, and hisresearch interests center around the use of emerging technologies tosupport information seeking by end users. In 2002 and 2003, Doug spent ayear in paradise here at USC-ISI. His recent work has focused oninteractive techniques for cross-language information retrieval and onsearching conversational text and speech. Additional information isavailable at http://www.glue.umd.edu/~oard/.
DTEND;TZID=America/Los_Angeles:20050805T163000
DTSTART;TZID=America/Los_Angeles:20050805T150000
LOCATION:11 Large
SUMMARY:The CLEF Cross-Language Speech Retrieval Test Collection
UID:20050805T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: The Prague Dependency Treebank project is aimed at a linguisticallycomplex, multi-tier annotation of relatively large amounts of naturallyoccuring sentences of natural language. There are four tiers at present:the basic token tier (level 0), and the morphological, surface-syntacic,and semantic (called "tectogrammatics") tiers. The syntactic andtectogrammatic tiers are based on a richly labelled dependencyrepresentation principle. So far, the project produced three corpora: theCzech-language-only Prague Dependency Treebank, the Prague Czech-EnglishDependency Treebank and the Prague Arabic Dependency Treebank. In thetalk, the principles of the Prague Dependency Treebank linguisticannotation scheme will be presented. Some technical details will also bediscussed, as well as some of the tools developed both for the manualannotation itself and for corpus-based NLP of Czech, English and Arabic.
DTEND;TZID=America/Los_Angeles:20050805T120000
DTSTART;TZID=America/Los_Angeles:20050805T103000
LOCATION:11 Large
SUMMARY:The Family of Prague Dependency Treebanks
UID:20050805T103000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: 3:30pm Mark Hopkins (UCLA)Tree Sequence Automata: A Unifying Framework for Tree Relation FormalismsThere exist a wide variety of competing formalisms for representing alanguage of ordered tree pairs. These include (bottom-up and top-down)tree transducers, synchronous tree-substitution grammars (STSGs),synchronous tree-adjoining grammars (STAGs), and inversion transductiongrammars (ITGs). Since these formalisms have all developed independentlyof one another, it is difficult to compare their respectiverepresentational power. This work seeks to make this task simpler byviewing these formalisms as instances of a general unifying formalism,which we call tree sequence automata (TSA). By casting these differentformalisms in a single framework, we can compare them directly by studyingthe specific subclass of TSA that they fall into.4:00pm Jason Riesa (Johns Hopkins)A case study in building a cost-effective speech-to-speech machine translation system with sparse resources: English - Iraqi ArabicThe Arabic spoken dialect of Iraq is a language deprived of the vastresources that researchers enjoy when working with its writtencounterpart, Modern Standard Arabic (MSA). The Iraqi Arabic lexicon andgrammar are also sufficiently distinct so that the use of existing toolsor corpora for MSA yield little or no positive effect on machinetranslation output quality. One can see that building a machinetranslation system normally dependent on a large parallel corpus is aparticularly difficult task when given just a 37,000 line translatedparallel text based on transcribed speech. This talk will explore theconstraints involved in working with this type of data, how we endeavoredto mitigate such problems as a non-standard orthography and a highlyinflected grammar, and propose a cost- effective way for dealing with suchprojects in the future.4:30pm Preslav Nakov (UC Berkeley)Multilingual Word AlignmentRecently there has been a growing number of available multilingualparallel texts. One such source is the European Union, which publishes itsofficial documents in the official languages of all member states(sometimes also in the languages of the candidates). Another source arethe United Nations. These corpora are a great source of training data formachine translation between new language pairs. But they also offer theopportunity to obtain better pairwise word alignments by looking atmultiple languages in parallel. In this talk I will present my research asa summer intern at ISI on getting better French (Fr) to English (En) wordalignments using an additional language (Xx). First, I will introduce twoheuristics which start with pairwise alignments between Fr-Xx, En-Xx andFr-En and then combine them probabilistically (in a linear model) orgraph-theoretically (by looking at in- and out-degrees for each word).Then I will present two Model1 inspired alignment models: (a) from "Fr andXx" to En; and (b) from Fr to "En and Xx".
DTEND;TZID=America/Los_Angeles:20050824T170000
DTSTART;TZID=America/Los_Angeles:20050824T153000
LOCATION:11 Large
SUMMARY:Summer Student Presentations
UID:20050824T153000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: 3:00pm Victoria Fossum (Michigan)Exploring the Continuum between Phrase-based and Syntax-based Machine TranslationState-of-the-art statistical machine translation systems use lexicalphrases as the basic unit of translation. Phrase-based systems cancapture those aspects of translation that are sensitive to local context.Syntax-based systems, on the other hand, make use of linguisticallymotivated syntactic structure, can capture long-distance dependencies andreorderings, and offer greater generalization in translation rules.However, their performance lags that of phrase-based systems.Hierarchical phrase-based translation, introduced by [Chiang 05], providesan elegant framework for exploring the continuum between phrase-based andsyntax-based translation. This system combines the "formal machinery" ofsyntax-based systems without any "linguistic commitment" to a particularsyntactic structure [Chiang 05].I will present results from my re-implementation of Chiang's hierarchicalphrase-based system, and (if time permits) compare those results with thefollowing systems on Chinese-English translation: ISI's phrase-basedsystem, and ISI's syntax-based system. Between now and December 2005, Iplan to incrementally explore the space between phrase-based andsyntax-based systems by augmenting these hierarchical phrase-based ruleswith richer syntactic annotation.3:30pm Liang Huang (Penn) and Hao Zhang (Rochester)Efficient Integration of n-gram Language Models with Syntax-based DecodingWe first give an overview of the ISI syntax-based MT system which is basedon tree-to-string (xRs) translation rules. The biggest problem at thisstage is the inefficiency of the integration of n-gram models. Withoutn-gram models, the xRs translation rules can be easily binarized withrespect to the foreign language to ensure cubic-time decoding. With n-grammodels, however, binarization without considering both languages will leadto exponential complexity.Inspired by Inversion Transduction Grammar (ITG) (Wu, 97), we will focuson the so-called ITG binarizable rules which count for over 99% of thewhole rule set. A simple linear-time algorithm will be presented to do thebinarization. Decoding with ITG-like rules is of low polynomial complexityin both time and space. We will discuss experimental results on bothefficiency and accuracy of decoding with the new binarization. If timepermits, we will also present the "hook trick" (inspired by (Eisner andSatta, 99)) to even further reduce the polynomial complexity of thedecoding process.
DTEND;TZID=America/Los_Angeles:20050826T163000
DTSTART;TZID=America/Los_Angeles:20050826T150000
LOCATION:11 Large
SUMMARY:Summer Student Presentations
UID:20050826T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: Ranked lists of output trees from syntactic statistical NLP applicationsfrequently contain multiple repeated entries. This redundancy leads tomisrepresentation of tree weight and reduced information for debugging andtuning purposes. It is chiefly due to nondeterminism in the weightedautomata that produce the results. I will introduce an algorithm thatdeterminizes such automata while preserving proper weights, returning thesum of the weight of all multiply derived trees. I will also reportresults of the application of the algorithm to machine translation andData Oriented Parsing.
DTEND;TZID=America/Los_Angeles:20051216T163000
DTSTART;TZID=America/Los_Angeles:20051216T150000
LOCATION:11 Large
SUMMARY:A Better N-Best List - Practical Determinization of Weighted Finite Tree Automata
UID:20051216T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:
DTEND;TZID=America/Los_Angeles:20050211T163000
DTSTART;TZID=America/Los_Angeles:20050211T150000
LOCATION:11 Large
SUMMARY:Unsupervised Word Sense Disambiguation Using Wordnet Relatives
UID:20050211T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: (Note that this is a MONDAY!)
DTEND;TZID=America/Los_Angeles:20050214T163000
DTSTART;TZID=America/Los_Angeles:20050214T150000
LOCATION:11 Large
SUMMARY:Collecting Broad-Coverage Knowledge Bases from Volunteers
UID:20050214T150000@NL
URL:http://www.isi.edu/natural-language/nl-seminar
END:VEVENT
BEGIN:VEVENT
DESCRIPTION: TBA
DTEND;TZID=America/Los_Angeles:20050218T163000
DTSTART;TZID=America/Los_Angeles:20050218T150000
LOCATION:11 Large
SUMMARY:TBA