Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add text classification models #24

Merged
merged 33 commits into from
May 17, 2017
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
100796f
add 'text_classification_cnn.py' and 'text_classification_dnn.py'
JiayiFeng May 2, 2017
dfc100a
Merge pull request #1 from Canpio/dev
JiayiFeng May 2, 2017
a651dc4
remove 'import sqlite3'
JiayiFeng May 2, 2017
31ef20f
Merge branch 'develop' of https://github.com/PaddlePaddle/models into…
JiayiFeng May 3, 2017
046d0a1
update readme file
JiayiFeng May 4, 2017
868d4cb
update readme file
JiayiFeng May 4, 2017
dd827a2
update README.md and add dnn_net.png
JiayiFeng May 4, 2017
20c2ec9
update README.md
JiayiFeng May 4, 2017
77abc3c
change dnn_net.png
JiayiFeng May 4, 2017
0619165
update dnn_net.png
JiayiFeng May 4, 2017
da0eb4e
add .gitignore
JiayiFeng May 4, 2017
5a4ad99
update README.md
JiayiFeng May 5, 2017
0a39b52
update README.md
JiayiFeng May 5, 2017
1b01f6a
update README.md
JiayiFeng May 5, 2017
78531d8
update README.md
JiayiFeng May 5, 2017
55dc1a3
add cnn_net.png and update README.md
JiayiFeng May 5, 2017
ac67d5a
add commits in cnn net
JiayiFeng May 5, 2017
947e70d
add comments in cnn net
JiayiFeng May 5, 2017
763615e
finish README.md
JiayiFeng May 9, 2017
304df71
finish README.md
JiayiFeng May 9, 2017
225979c
fix problems finded in review
JiayiFeng May 9, 2017
3a712e0
change README.md
JiayiFeng May 9, 2017
f18b572
add and test auc evaluator
JiayiFeng May 10, 2017
ae6a883
Merge pull request #2 from Canpio/dev
JiayiFeng May 10, 2017
18a4bd9
remove copyright statement
JiayiFeng May 10, 2017
344a9a4
add section of 'self-define data reader' into README.md
JiayiFeng May 12, 2017
5a59fcd
update README.md
JiayiFeng May 12, 2017
55a88ad
update README.md and fine-tune network hyper-params to avoid over-fit…
JiayiFeng May 12, 2017
b0071ff
update README.md
JiayiFeng May 15, 2017
631ffdd
remove .gitignore
JiayiFeng May 16, 2017
e6cfcf1
update README.md and renew dnn_net.png
JiayiFeng May 16, 2017
f46dcb1
read->red
JiayiFeng May 17, 2017
6ff3565
read->red
JiayiFeng May 17, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件暂时先删掉吧。ignore mac 的这个文件比较奇怪。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件还没删掉?

2 changes: 2 additions & 0 deletions text_classification/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.pptx
*.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么需要忽略掉pptx和pdf?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pptx和pdf是我自己目录下作图中产生的一些临时文件,所以忽略

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

准备将临时文件都放入专门的temp目录

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自己的临时文件请不要上传到版本库里。.gitignore可以删去。

120 changes: 119 additions & 1 deletion text_classification/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,119 @@
TBD
# 文本分类
文本分类是机器学习中的一项常见任务,主要目的是根据一条文本的内容,判断该文本所属的类别。在本例子中,我们利用有标注的IMDB语料库训练二分类DNN和CNN模型,完成对语料的简单文本分类。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里稍微用一两句话介绍一下IMDB数据集是一个什么样的数据集。对不了解这个任务的用户更加友好一些。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于IMDB的介绍在后面“实验数据”一节中有。为了避免用户读到这里时迷惑,准备在这里把IMBD去掉,直接表示为“有标注的语料库”


## 实验数据
本例子的实验在IMDB数据集上进行。IMDB数据集包含了来自IMDb(互联网电影数据库)网站的5万条电影影评,并被标注为正面/负面两种评价。数据集被划分为train和test两部分,各2.5万条数据,正负样本的比例基本为1:1。样本直接以英文原文的形式表示。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. IMDB数据集包含了来自IMDb(这个b是小写么?)
  2. 数据集可以放链接。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b是小写,全称Internet Movie Database


## DNN模型

#### DNN的模型结构入下图所示:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第7行是##
第9行应该是###,下同。而且“DNN的模型结构如下图所示”也不适合作为标题,下同。


<p align="center">
<img src="images/dnn_net.png" width = "90%" align="center"/><br/>
图1. DNN文本分类模型
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这张图能显示出边框,请重新上传全白底的图片
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

</p>

#### 可以看到,模型主要分为如下几个部分:

- **embedding层**:IMDB的样本由原始的英文单词组成,为了方便模型的训练,必须通过embedding将英文单词转化为固定维度的向量。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

术语要统一,后面两层用的是中文,前面的embedding和max pooling也用中文。分别是词向量层和最大池化层。全文同。


- **max pooling**:max pooling在时间序列上进行,pooling过程消除了不同语料样本在单词数量多少上的差异,并提炼出词向量中每一下标位置上的最大值。经过pooling后,样本被转化为一条固定维度的向量。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

并提炼出词向量中每一下标位置上的最大值。这句中,“每一个下标位置”得解释一下,小白用户会懵。


- **全连接隐层**:经过max pooling后的向量被送入一个具有两个隐层的DNN模型,隐层之间为全连接结构。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22行送入“DNN模型”(可以去掉,或者换个别的词),因为整个是一个DNN模型,这里又写着送入DNN模型,会让小白用户觉得是递归了?



- **输出层**:输出层的神经元数量和样本的类别数一致,例如在二分类问题中,输出层会有2个神经元。通过Softmax激活函数,我们保证输出层各神经元的输出之和为1,因此第i个神经元的输出就可以认为是样本属于第i类的预测概率。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Softmax激活函数将输出层神经元的输出结果归一化为一个概率分布,和为1。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#### 通过Paddle实现该DNN结构的代码如下:
```python
import paddle.v2 as paddle

def fc_net(input_dim, class_dim=2, emb_dim=256):
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(class_dim))

# embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)
# max pooling
seq_pool = paddle.layer.pooling(
input=emb, pooling_type=paddle.pooling.Max())

# two hidden layers
hd1 = paddle.layer.fc(
input=seq_pool,
size=128,
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=0.01))
hd2 = paddle.layer.fc(
input=hd1,
size=32,
act=paddle.activation.Tanh(),
param_attr=paddle.attr.Param(initial_std=0.01))

# output layer
output = paddle.layer.fc(
input=hd2,
size=class_dim,
act=paddle.activation.Softmax(),
param_attr=paddle.attr.Param(initial_std=0.1))

cost = paddle.layer.classification_cost(input=output, label=lbl)

return cost, output

```
该DNN模型默认对输入的语料进行二分类(`class_dim=2`),embedding的词向量维度默认为256(`emd_dim=256`),两个隐层均使用Tanh激活函数(`act=paddle.activation.Tanh()`)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

67行第二个括号是英文括号,全文都是中文括号,建议统一。


需要注意的是,该模型的输入数据为整数序列,而不是原始的英文单词序列。事实上,为了处理方便我们一般会事先将单词根据词频顺序进行id化,即将单词用整数替代。这一步一般在DNN模型之外完成。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

即将单词用整数替代 --> 即将单词用整数替代, 也就是单词在字典中的序号,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


##CNN模型
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

71行和73行##后面要加空格,不然标题无法渲染出来。


####CNN的模型结构如下图所示:

<p align="center">
<img src="images/cnn_net.png" width = "90%" align="center"/><br/>
图2. CNN文本分类模型
</p>

#### 可以看到,模型主要分为如下几个部分:

- **embedding层**:与DNN中embedding的作用一样,将英文单词转化为固定维度的向量。如图2中所示,将embedding得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设embedding_size=5,语料“The cat sat on the read mat”包含7个单词,那么得到的矩阵维度为7*5。

- **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和embedding得到的矩阵一致,卷积验证矩阵的高度方向进行。假设卷积核的高度为h,矩阵的高度为N,卷积的step_size为1,则卷积后得到的feature map为一个高度为N+1-h的向量。可以同时使用多个不同高度的卷积核,得到多个feature map。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature map也可以用中文词汇表示。


- **max pooling**: 对卷积得到的各个feature map分别进行max pooling操作。由于feature map本身已经是向量,因此这里的max pooling实际上就是简单地选出各个向量中的最大元素。各个最大元素又被并置在一起,组成新的向量,显然,该向量的维度等于feature map的数量,也就是卷积核的数量。

- **全连接与输出层**:将max pooling的结果通过全连接层输出,与DNN模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为1。

#### 通过Paddle实现该CNN结构的代码如下:

```python
import paddle.v2 as paddle

def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))

#embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)

# convolution layers with max pooling
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)

# fc and output layer
output = paddle.layer.fc(
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())

cost = paddle.layer.classification_cost(input=output, label=lbl)

return cost, output
```

该CNN网络的输入数据类型和前面介绍过的DNN一致。`paddle.networks.sequence_conv_pool`为Paddle中已经封装好的带有pooling的文本序列卷积模块,该模块的`context_len`参数用于指定卷积核在同一时间覆盖的文本长度,也即图2中的卷积核的高度;`hidden_size`用于指定该类型的卷积核的数量。可以看到,上述代码定义的结构中使用了128个大小为3的卷积核和128个大小为4的卷积核,这些卷积的结果经过max pooling和结果并置后产生一个256维的向量,向量经过一个全连接层输出最终预测结果。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddle-》PaddlePaddle,全文文字中的Paddle均需改成PaddlePaddle

Binary file added text_classification/images/cnn_net.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added text_classification/images/dnn_net.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
131 changes: 131 additions & 0 deletions text_classification/text_classification_cnn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import sys
import paddle.v2 as paddle
import gzip


def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
# input layers
data = paddle.layer.data("word",
paddle.data_type.integer_value_sequence(input_dim))
lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))

#embedding layer
emb = paddle.layer.embedding(input=data, size=emb_dim)

# convolution layers with max pooling
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)

# fc and output layer
output = paddle.layer.fc(
input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())

cost = paddle.layer.classification_cost(input=output, label=lbl)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help with adding more than one evaluators to this configuration, for example, besides an evaluator to calculate the error rate, if it possible to add a precision-recall evaluator? I hope to test a configuration with more than one evaluators. Thanks for your work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added auc evaluator


return cost, output


def train_cnn_model(num_pass):
# load word dictionary
print 'load dictionary...'
word_dict = paddle.dataset.imdb.word_dict()

dict_dim = len(word_dict)
class_dim = 2
# define data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)

# network config
[cost, _] = convolution_net(dict_dim, class_dim=class_dim)
# create parameters
parameters = paddle.parameters.create(cost)
# create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))

# create trainer
trainer = paddle.trainer.SGD(
cost=cost, parameters=parameters, update_equation=adam_optimizer)

# Define end batch and end pass event handler
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print "\nPass %d, Batch %d, Cost %f, %s" % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics)
with gzip.open("cnn_params.tar.gz", 'w') as f:
Copy link
Collaborator

@lcy-seso lcy-seso May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better that models are saved according to the pass number, otherwise, new models will overwrite the old model, as a result, it is impossible to do the model selection.

parameters.to_tar(f)

# begin training network
feeding = {'word': 0, 'label': 1}
trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=num_pass)

print("Training finished.")


def cnn_infer():
print("Begin to predict...")

word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2

[_, output] = convolution_net(dict_dim, class_dim=class_dim)
parameters = paddle.parameters.Parameters.from_tar(
gzip.open("cnn_params.tar.gz"))

infer_data = []
infer_label_data = []
infer_data_num = 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constraint on the number of test data can be removed.

for item in paddle.dataset.imdb.test(word_dict):
infer_data.append([item[0]])
infer_label_data.append(item[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning is more unambiguous if rename infer_label_data to infer_data_label.

if len(infer_data) == infer_data_num:
break

predictions = paddle.infer(
output_layer=output,
parameters=parameters,
input=infer_data,
field=['value'])
for i, prob in enumerate(predictions):
print prob, infer_label_data[i]


if __name__ == "__main__":
paddle.init(use_gpu=False, trainer_count=10)
train_cnn_model(num_pass=10)
cnn_infer()
Loading