-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add text classification models #24
Changes from 18 commits
100796f
dfc100a
a651dc4
31ef20f
046d0a1
868d4cb
dd827a2
20c2ec9
77abc3c
0619165
da0eb4e
5a4ad99
0a39b52
1b01f6a
78531d8
55dc1a3
ac67d5a
947e70d
763615e
304df71
225979c
3a712e0
f18b572
ae6a883
18a4bd9
344a9a4
5a59fcd
55a88ad
b0071ff
631ffdd
e6cfcf1
f46dcb1
6ff3565
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
.DS_Store | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.pptx | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 为什么需要忽略掉pptx和pdf? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pptx和pdf是我自己目录下作图中产生的一些临时文件,所以忽略 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 准备将临时文件都放入专门的temp目录 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 自己的临时文件请不要上传到版本库里。.gitignore可以删去。 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,119 @@ | ||
TBD | ||
# 文本分类 | ||
文本分类是机器学习中的一项常见任务,主要目的是根据一条文本的内容,判断该文本所属的类别。在本例子中,我们利用有标注的IMDB语料库训练二分类DNN和CNN模型,完成对语料的简单文本分类。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里稍微用一两句话介绍一下IMDB数据集是一个什么样的数据集。对不了解这个任务的用户更加友好一些。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 关于IMDB的介绍在后面“实验数据”一节中有。为了避免用户读到这里时迷惑,准备在这里把IMBD去掉,直接表示为“有标注的语料库” |
||
|
||
## 实验数据 | ||
本例子的实验在IMDB数据集上进行。IMDB数据集包含了来自IMDb(互联网电影数据库)网站的5万条电影影评,并被标注为正面/负面两种评价。数据集被划分为train和test两部分,各2.5万条数据,正负样本的比例基本为1:1。样本直接以英文原文的形式表示。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. b是小写,全称Internet Movie Database |
||
|
||
## DNN模型 | ||
|
||
#### DNN的模型结构入下图所示: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 第7行是 |
||
|
||
<p align="center"> | ||
<img src="images/dnn_net.png" width = "90%" align="center"/><br/> | ||
图1. DNN文本分类模型 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
</p> | ||
|
||
#### 可以看到,模型主要分为如下几个部分: | ||
|
||
- **embedding层**:IMDB的样本由原始的英文单词组成,为了方便模型的训练,必须通过embedding将英文单词转化为固定维度的向量。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 术语要统一,后面两层用的是中文,前面的embedding和max pooling也用中文。分别是词向量层和最大池化层。全文同。 |
||
|
||
- **max pooling**:max pooling在时间序列上进行,pooling过程消除了不同语料样本在单词数量多少上的差异,并提炼出词向量中每一下标位置上的最大值。经过pooling后,样本被转化为一条固定维度的向量。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 并提炼出词向量中每一下标位置上的最大值。这句中,“每一个下标位置”得解释一下,小白用户会懵。 |
||
|
||
- **全连接隐层**:经过max pooling后的向量被送入一个具有两个隐层的DNN模型,隐层之间为全连接结构。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 22行送入“DNN模型”(可以去掉,或者换个别的词),因为整个是一个DNN模型,这里又写着送入DNN模型,会让小白用户觉得是递归了? |
||
|
||
|
||
- **输出层**:输出层的神经元数量和样本的类别数一致,例如在二分类问题中,输出层会有2个神经元。通过Softmax激活函数,我们保证输出层各神经元的输出之和为1,因此第i个神经元的输出就可以认为是样本属于第i类的预测概率。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Softmax激活函数将输出层神经元的输出结果归一化为一个概率分布,和为1。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
#### 通过Paddle实现该DNN结构的代码如下: | ||
```python | ||
import paddle.v2 as paddle | ||
|
||
def fc_net(input_dim, class_dim=2, emb_dim=256): | ||
# input layers | ||
data = paddle.layer.data("word", | ||
paddle.data_type.integer_value_sequence(input_dim)) | ||
lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim)) | ||
|
||
# embedding layer | ||
emb = paddle.layer.embedding(input=data, size=emb_dim) | ||
# max pooling | ||
seq_pool = paddle.layer.pooling( | ||
input=emb, pooling_type=paddle.pooling.Max()) | ||
|
||
# two hidden layers | ||
hd1 = paddle.layer.fc( | ||
input=seq_pool, | ||
size=128, | ||
act=paddle.activation.Tanh(), | ||
param_attr=paddle.attr.Param(initial_std=0.01)) | ||
hd2 = paddle.layer.fc( | ||
input=hd1, | ||
size=32, | ||
act=paddle.activation.Tanh(), | ||
param_attr=paddle.attr.Param(initial_std=0.01)) | ||
|
||
# output layer | ||
output = paddle.layer.fc( | ||
input=hd2, | ||
size=class_dim, | ||
act=paddle.activation.Softmax(), | ||
param_attr=paddle.attr.Param(initial_std=0.1)) | ||
|
||
cost = paddle.layer.classification_cost(input=output, label=lbl) | ||
|
||
return cost, output | ||
|
||
``` | ||
该DNN模型默认对输入的语料进行二分类(`class_dim=2`),embedding的词向量维度默认为256(`emd_dim=256`),两个隐层均使用Tanh激活函数(`act=paddle.activation.Tanh()`)。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 67行第二个括号是英文括号,全文都是中文括号,建议统一。 |
||
|
||
需要注意的是,该模型的输入数据为整数序列,而不是原始的英文单词序列。事实上,为了处理方便我们一般会事先将单词根据词频顺序进行id化,即将单词用整数替代。这一步一般在DNN模型之外完成。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 即将单词用整数替代 --> 即将单词用整数替代, 也就是单词在字典中的序号, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
##CNN模型 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 71行和73行 |
||
|
||
####CNN的模型结构如下图所示: | ||
|
||
<p align="center"> | ||
<img src="images/cnn_net.png" width = "90%" align="center"/><br/> | ||
图2. CNN文本分类模型 | ||
</p> | ||
|
||
#### 可以看到,模型主要分为如下几个部分: | ||
|
||
- **embedding层**:与DNN中embedding的作用一样,将英文单词转化为固定维度的向量。如图2中所示,将embedding得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设embedding_size=5,语料“The cat sat on the read mat”包含7个单词,那么得到的矩阵维度为7*5。 | ||
|
||
- **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和embedding得到的矩阵一致,卷积验证矩阵的高度方向进行。假设卷积核的高度为h,矩阵的高度为N,卷积的step_size为1,则卷积后得到的feature map为一个高度为N+1-h的向量。可以同时使用多个不同高度的卷积核,得到多个feature map。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. feature map也可以用中文词汇表示。 |
||
|
||
- **max pooling**: 对卷积得到的各个feature map分别进行max pooling操作。由于feature map本身已经是向量,因此这里的max pooling实际上就是简单地选出各个向量中的最大元素。各个最大元素又被并置在一起,组成新的向量,显然,该向量的维度等于feature map的数量,也就是卷积核的数量。 | ||
|
||
- **全连接与输出层**:将max pooling的结果通过全连接层输出,与DNN模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为1。 | ||
|
||
#### 通过Paddle实现该CNN结构的代码如下: | ||
|
||
```python | ||
import paddle.v2 as paddle | ||
|
||
def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128): | ||
# input layers | ||
data = paddle.layer.data("word", | ||
paddle.data_type.integer_value_sequence(input_dim)) | ||
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2)) | ||
|
||
#embedding layer | ||
emb = paddle.layer.embedding(input=data, size=emb_dim) | ||
|
||
# convolution layers with max pooling | ||
conv_3 = paddle.networks.sequence_conv_pool( | ||
input=emb, context_len=3, hidden_size=hid_dim) | ||
conv_4 = paddle.networks.sequence_conv_pool( | ||
input=emb, context_len=4, hidden_size=hid_dim) | ||
|
||
# fc and output layer | ||
output = paddle.layer.fc( | ||
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax()) | ||
|
||
cost = paddle.layer.classification_cost(input=output, label=lbl) | ||
|
||
return cost, output | ||
``` | ||
|
||
该CNN网络的输入数据类型和前面介绍过的DNN一致。`paddle.networks.sequence_conv_pool`为Paddle中已经封装好的带有pooling的文本序列卷积模块,该模块的`context_len`参数用于指定卷积核在同一时间覆盖的文本长度,也即图2中的卷积核的高度;`hidden_size`用于指定该类型的卷积核的数量。可以看到,上述代码定义的结构中使用了128个大小为3的卷积核和128个大小为4的卷积核,这些卷积的结果经过max pooling和结果并置后产生一个256维的向量,向量经过一个全连接层输出最终预测结果。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Paddle-》PaddlePaddle,全文文字中的Paddle均需改成PaddlePaddle |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import sys | ||
import paddle.v2 as paddle | ||
import gzip | ||
|
||
|
||
def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128): | ||
# input layers | ||
data = paddle.layer.data("word", | ||
paddle.data_type.integer_value_sequence(input_dim)) | ||
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2)) | ||
|
||
#embedding layer | ||
emb = paddle.layer.embedding(input=data, size=emb_dim) | ||
|
||
# convolution layers with max pooling | ||
conv_3 = paddle.networks.sequence_conv_pool( | ||
input=emb, context_len=3, hidden_size=hid_dim) | ||
conv_4 = paddle.networks.sequence_conv_pool( | ||
input=emb, context_len=4, hidden_size=hid_dim) | ||
|
||
# fc and output layer | ||
output = paddle.layer.fc( | ||
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax()) | ||
|
||
cost = paddle.layer.classification_cost(input=output, label=lbl) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you help with adding more than one evaluators to this configuration, for example, besides an evaluator to calculate the error rate, if it possible to add a precision-recall evaluator? I hope to test a configuration with more than one evaluators. Thanks for your work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done, added auc evaluator |
||
|
||
return cost, output | ||
|
||
|
||
def train_cnn_model(num_pass): | ||
# load word dictionary | ||
print 'load dictionary...' | ||
word_dict = paddle.dataset.imdb.word_dict() | ||
|
||
dict_dim = len(word_dict) | ||
class_dim = 2 | ||
# define data reader | ||
train_reader = paddle.batch( | ||
paddle.reader.shuffle( | ||
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000), | ||
batch_size=100) | ||
test_reader = paddle.batch( | ||
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100) | ||
|
||
# network config | ||
[cost, _] = convolution_net(dict_dim, class_dim=class_dim) | ||
# create parameters | ||
parameters = paddle.parameters.create(cost) | ||
# create optimizer | ||
adam_optimizer = paddle.optimizer.Adam( | ||
learning_rate=2e-3, | ||
regularization=paddle.optimizer.L2Regularization(rate=8e-4), | ||
model_average=paddle.optimizer.ModelAverage(average_window=0.5)) | ||
|
||
# create trainer | ||
trainer = paddle.trainer.SGD( | ||
cost=cost, parameters=parameters, update_equation=adam_optimizer) | ||
|
||
# Define end batch and end pass event handler | ||
def event_handler(event): | ||
if isinstance(event, paddle.event.EndIteration): | ||
if event.batch_id % 100 == 0: | ||
print "\nPass %d, Batch %d, Cost %f, %s" % ( | ||
event.pass_id, event.batch_id, event.cost, event.metrics) | ||
else: | ||
sys.stdout.write('.') | ||
sys.stdout.flush() | ||
if isinstance(event, paddle.event.EndPass): | ||
result = trainer.test(reader=test_reader, feeding=feeding) | ||
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) | ||
with gzip.open("cnn_params.tar.gz", 'w') as f: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is better that models are saved according to the pass number, otherwise, new models will overwrite the old model, as a result, it is impossible to do the model selection. |
||
parameters.to_tar(f) | ||
|
||
# begin training network | ||
feeding = {'word': 0, 'label': 1} | ||
trainer.train( | ||
reader=train_reader, | ||
event_handler=event_handler, | ||
feeding=feeding, | ||
num_passes=num_pass) | ||
|
||
print("Training finished.") | ||
|
||
|
||
def cnn_infer(): | ||
print("Begin to predict...") | ||
|
||
word_dict = paddle.dataset.imdb.word_dict() | ||
dict_dim = len(word_dict) | ||
class_dim = 2 | ||
|
||
[_, output] = convolution_net(dict_dim, class_dim=class_dim) | ||
parameters = paddle.parameters.Parameters.from_tar( | ||
gzip.open("cnn_params.tar.gz")) | ||
|
||
infer_data = [] | ||
infer_label_data = [] | ||
infer_data_num = 100 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The constraint on the number of test data can be removed. |
||
for item in paddle.dataset.imdb.test(word_dict): | ||
infer_data.append([item[0]]) | ||
infer_label_data.append(item[1]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The meaning is more unambiguous if rename |
||
if len(infer_data) == infer_data_num: | ||
break | ||
|
||
predictions = paddle.infer( | ||
output_layer=output, | ||
parameters=parameters, | ||
input=infer_data, | ||
field=['value']) | ||
for i, prob in enumerate(predictions): | ||
print prob, infer_label_data[i] | ||
|
||
|
||
if __name__ == "__main__": | ||
paddle.init(use_gpu=False, trainer_count=10) | ||
train_cnn_model(num_pass=10) | ||
cnn_infer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件暂时先删掉吧。ignore mac 的这个文件比较奇怪。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件还没删掉?