-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathindex.html
364 lines (356 loc) · 7.81 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
<html>
<head>
<title>HKCanCor 香港粵語語料庫</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="revised" content="Luke Kang Kwong, 20/10/2007, 24/10/2014" />
<style type='text/css'>
body {background:#FFFFF0; }
</style>
</head>
<body>
<h1 class='text-center'>Hong Kong Cantonese Corpus (HKCanCor)
<br><small>香港粵語語料庫</small></h1>
<p class='text-center'>
<strong>Prof. Luke Kang Kwong</strong> (<a href='mailto:[email protected]'>[email protected]</a>)
</p>
<hr>
<h2>Introduction 簡介</h2>
<p>The Hong Kong Cantonese Corpus was collected from transcribed conversations that were recorded between March 1997 and August 1998. About 230,000 Chinese words were collected in the annotated corpus. It contains recordings of spontaneous speech (51 texts) and radio programmes (42 texts), which involve 2 to 4 speakers, with 1 text of monologue. The text were word-segmented, annotated with part-of-speech tagging and Cantonese pronunciation using the romanisation scheme of Linguistic Society of Hong Kong (LSHK). </p>
<p>香港粵語語料庫收錄了在1997年3月至1998年8月期間修錄整理的粵語談話內容。本語料庫修錄了93段2至4人的對話(其中51段是在交談時錄音所得﹐另外42段由電台節目剪輯而成)和1段個人獨白﹐修錄詞語合計約有230,000個。語料以詞語作單位切分﹐並標上發音及詞性。發音部份使用了香港語言學學會(LSHK)的粵拼標準。</p>
<h3>Downloads 下載</h3>
<ul>
<li><a href="data/format_v.pdf">Corpus Specification 語料庫格式</a></li>
<li><a href="data/README">README</a></li>
<li><a href="data/LICENSE">LICENSE (CC BY)</a></li>
<li><u>Monologue 口述</u><br>
<a href="sample/m.mp3">Recording 錄音 (mp3)</a> /
<a href="sample/m_v.txt">Tagged 標注 (big5-hkscs)</A> /
<a href="sample/m_h.txt">Non-tagged 無標注 (big5-hkscs)</a> /
</li>
<li><u>Dialogue 1 對話1</u><br>
<a href="sample/d1.mp3">Recording 錄音 (mp3)</a> /
<a href="sample/d1_v.txt">Tagged 標注 (big5-hkscs)</a> /
<a href="sample/d1_h.txt">Non-tagged 無標注 (big5-hkscs)</a> /
</li>
<li><u>Dialogue 2 對話2</u><br>
<a href="sample/d2.mp3">Recording 錄音 (mp3)</a> /
<a href="sample/d2_v.txt">Tagged 標注 (big5-hkscs)</a> /
<a href="sample/d2_h.txt">Non-tagged 無標注 (big5-hkscs)</a> /
</li>
<li><u>Radio 1 收音機1</u><br>
<a href="sample/r1.mp3">Recording 錄音 (mp3)</a> /
<a href="sample/r1_v.txt">Tagged 標注 (big5-hkscs)</a> /
<a href="sample/r1_h.txt">Non-tagged 無標注 (big5-hkscs)</a> /
</li>
<li><u>Radio 2 收音機2</u><br>
<a href="sample/r2.mp3">Recording 錄音 (mp3)</a> /
<a href="sample/r2_v.txt">Tagged 標注 (big5-hkscs)</a> /
<a href="sample/r2_h.txt">Non-tagged 無標注 (big5-hkscs)</a> /
</li>
<li><u>Full Text 完整文件</u><br>
<a href="data/hkcancor-big5hkscs.zip">BIG5-HKSCS (zipped, transcriptions only, no sound)</a>
<br><a href="data/hkcancor-utf8.zip">UTF8 (zipped, transcriptions only, no sound)</a>
</li>
</ul>
<h3>References</h3>
<ul>
<li>K. K. Luke and May L.Y. Wong
(2015) <a href='data/LukeWong_Hong-Kong-Cantonese-Corpus.pdf'>The
Hong Kong Cantonese Corpus: Design and Uses</a>
<i>Journal of Chinese Linguistics</i> (to appear).
</ul>
<h3>Links</h3>
<ul>
<li><a href='https://github.com/pycantonese/pycantonese'>PyCantonese:
Working with Cantonese corpus data using Python</a> by Jackson L. Lee
</ul>
<h3>Tagset</h3>
<p>See the paper for a full description.
<table border>
<tr>
<th>No.</th>
<th>Tag</th>
<th>POS (in Chinese)</th>
<th>POS (in English)</th>
</tr>
<tr>
<td>1</td>
<td>Ag</td>
<td>形语素</td>
<td>Adjective Morpheme</td>
</tr>
<tr>
<td>2</td>
<td>a</td>
<td>形容词</td>
<td>Adjective</td>
</tr>
<tr>
<td>3</td>
<td>ad</td>
<td>副形词</td>
<td>Adjective as Adverbial</td>
</tr>
<tr>
<td>4</td>
<td>an</td>
<td>名形词</td>
<td>Adjective with Nominal Function</td>
</tr>
<tr>
<td>5</td>
<td>Bg</td>
<td>区别语素</td>
<td>Non-predicate Adjective Morpheme</td>
</tr>
<tr>
<td>6</td>
<td>b</td>
<td>区别词</td>
<td>Non-predicate Adjective</td>
</tr>
<tr>
<td>7</td>
<td>c</td>
<td>连词</td>
<td>Conjunction</td>
</tr>
<tr>
<td>8</td>
<td>Dg</td>
<td>副语素</td>
<td>Adverb Morpheme</td>
</tr>
<tr>
<td>9</td>
<td>d</td>
<td>副词</td>
<td>Adverb</td>
</tr>
<tr>
<td>10</td>
<td>e</td>
<td>叹词</td>
<td>Interjection</td>
</tr>
<tr>
<td>11</td>
<td>f</td>
<td>方位词</td>
<td>Directional Locality</td>
</tr>
<tr>
<td>12</td>
<td>g</td>
<td>语素</td>
<td>Morpheme</td>
</tr>
<tr>
<td>13</td>
<td>h</td>
<td>前接成分</td>
<td>Prefix</td>
</tr>
<tr>
<td>14</td>
<td>i</td>
<td>成语</td>
<td>Idiom</td>
</tr>
<tr>
<td>15</td>
<td>j</td>
<td>简略语</td>
<td>Abbreviation</td>
</tr>
<tr>
<td>16</td>
<td>k</td>
<td>后接成分</td>
<td>Suffix</td>
</tr>
<tr>
<td>17</td>
<td>l</td>
<td>习用语</td>
<td>Fixed Expression</td>
</tr>
<tr>
<td>18</td>
<td>Mg</td>
<td>数语素</td>
<td>Numeric Morpheme</td>
</tr>
<tr>
<td>19</td>
<td>m</td>
<td>数词</td>
<td>Numeral</td>
</tr>
<tr>
<td>20</td>
<td>Ng</td>
<td>名语素</td>
<td>Noun Morpheme</td>
</tr>
<tr>
<td>21</td>
<td>n</td>
<td>名词</td>
<td>Common Noun</td>
</tr>
<tr>
<td>22</td>
<td>nr</td>
<td>人名</td>
<td>Personal Name</td>
</tr>
<tr>
<td>23</td>
<td>ns</td>
<td>地名</td>
<td>Place Name</td>
</tr>
<tr>
<td>24</td>
<td>nt</td>
<td>机构团体</td>
<td>Organisation Name</td>
</tr>
<tr>
<td>25</td>
<td>nx</td>
<td>外文字符</td>
<td>Nominal Character String</td>
</tr>
<tr>
<td>26</td>
<td>nz</td>
<td>其它专名</td>
<td>Other Proper Noun</td>
</tr>
<tr>
<td>27</td>
<td>o</td>
<td>拟声词</td>
<td>Onomatopoeia</td>
</tr>
<tr>
<td>28</td>
<td>p</td>
<td>介词</td>
<td>Preposition</td>
</tr>
<tr>
<td>29</td>
<td>Qg</td>
<td>量语素</td>
<td>Classifier Morpheme</td>
</tr>
<tr>
<td>30</td>
<td>q</td>
<td>量词</td>
<td>Classifier</td>
</tr>
<tr>
<td>31</td>
<td>Rg</td>
<td>代语素</td>
<td>Pronoun Morpheme</td>
</tr>
<tr>
<td>32</td>
<td>r</td>
<td>代词</td>
<td>Pronoun</td>
</tr>
<tr>
<td>33</td>
<td>s</td>
<td>处所词</td>
<td>Space Word</td>
</tr>
<tr>
<td>34</td>
<td>Tg</td>
<td>时间语素</td>
<td>Time Word Morpheme</td>
</tr>
<tr>
<td>35</td>
<td>t</td>
<td>时间词</td>
<td>Time Word</td>
</tr>
<tr>
<td>36</td>
<td>Ug</td>
<td>助语素</td>
<td>Auxiliary Morpheme</td>
</tr>
<tr>
<td>37</td>
<td>u</td>
<td>助词</td>
<td>Auxiliary</td>
</tr>
<tr>
<td>38</td>
<td>Vg</td>
<td>动语素</td>
<td>Verb Morpheme</td>
</tr>
<tr>
<td>39</td>
<td>v</td>
<td>动词</td>
<td>Verb</td>
</tr>
<tr>
<td>40</td>
<td>vd</td>
<td>副动词</td>
<td>Verb as Adverbial</td>
</tr>
<tr>
<td>41</td>
<td>vn</td>
<td>名动词</td>
<td>Verb with Nominal Function</td>
</tr>
<tr>
<td>42</td>
<td>w</td>
<td>标点符号</td>
<td>Punctuation</td>
</tr>
<tr>
<td>43</td>
<td>x</td>
<td>非语素字</td>
<td>Unclassified Item</td>
</tr>
<tr>
<td>44</td>
<td>Yg</td>
<td>语气语素</td>
<td>Modal Particle Morpheme</td>
</tr>
<tr>
<td>45</td>
<td>y</td>
<td>语气词</td>
<td>Modal Particle</td>
</tr>
<tr>
<td>46</td>
<td>z</td>
<td>状态词</td>
<td>Descriptive</td>
</tr>
</table>
</body>
</html>