-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ltr case done. #31
ltr case done. #31
Conversation
请在 code review 之前,保证 travis-ci 的检查能够通过。 |
ltr/lambdaRank.py
Outdated
|
||
# two hidden layers | ||
hd1 = paddle.layer.fc( | ||
name="/hidden_1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layer里的name去掉吧,官方示例demo应该提供一个好习惯。 下同,去掉~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qingqing01 数据集和原来不太一样,可以接受这个变化。
ltr/lambdaRank.py
Outdated
size=1, | ||
act=paddle.activation.Linear(), | ||
param_attr=paddle.attr.Param(initial_std=0.01, name="output")) | ||
cost = paddle.layer.lambda_cost(input=output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里使用lambda_cost,和 https://github.com/lcy-seso/paddle_confs_v1/blob/master/ltr/listwise_ltr.conf 配置不同,@lcy-seso 这样可否?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
接口发生变化,例如给出的例子是一条item对应一条score,新接口改成sequence_based的结构
ltr/lambdaRank.py
Outdated
train_reader = paddle.batch( | ||
paddle.reader.shuffle(fill_default_train, buf_size=1000), batch_size=1000) | ||
test_reader = paddle.batch( | ||
paddle.reader.shuffle(fill_default_test, buf_size=1000), batch_size=1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test集不需要paddle.reader.shuffle~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix Done.
ltr/ranknet.py
Outdated
# ranknet is the classic pairwise learning to rank algorithm | ||
# http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf | ||
|
||
def half_ranknet(name_prefix, input_dim): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
存在和上面配置同样的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix Done.
ltr/ranknet.py
Outdated
if isinstance(event, paddle.event.EndIteration): | ||
if event.batch_id % 100 == 0: | ||
print "Pass %d Batch %d Cost %.9f" % ( | ||
event.pass_id, event.batch_id, event.cost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只有cost,不知道如何评估。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,时间限制没有实现ndcg的evaluator,将添加auc evaluator
travis-ci 的检查一直不通过,这样不能merge。没用通过pre-submit提交,没有格式化。 |
@dzhwinter 你没push?只看到1个Commits,没有看到更新后的代码~ |
Done,comment被过滤掉了,sorry |
还是没有加入evaluator,加入auc层存在auc index out of range的问题。更重要的是,ndcg@k,err@k这些是是衡量排序质量的基本度量。计划近期加入ndcg Evaluator。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议train和infer分两个步骤,文档里增加infer过程。
ltr/README.md
Outdated
|
||
## RankNet排序模型 | ||
|
||
[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)是一种经典的Pairwise的排序学习方法,是典型的前向神经网络排序模型。在文档集合S中的第i个文档记做`Ui`,它的文档特征向量记做`xi`,对于给定的一个文档对`<Ui, Uj>`,RankNet将输入的单个文档特征向量x映射到`f(x)`,得到`si=f(xi), sj=f(xj)`。将`Ui`相关性比Uj好的概率记做Pij,则 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ui
, Uj等数学公式格式不统一,一些有``,一些没有,请参考公式要求:$U_i$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done.
ltr/README.md
Outdated
|
||
$$C_{i,j}=-\bar{P_{i,j}}logP_{i,j}-(1-\bar{P_{i,j}})log(1-P_{i,j})$$ | ||
|
||
其中代表真实概率的 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是:" 其中¥\bar{P_{i,j}}$代表真实概率:"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
意思是“其中$\bar{P_{i,j}}$代表真实概率,记做”, fix
ltr/README.md
Outdated
|
||
同时,得到文档Ui在排序优化过程的梯度信息为 | ||
|
||
\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\frac{1}{1+e^{\sigma (s_{i}-s_{j})}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
图3. RankNet网络结构示意图 | ||
</p> | ||
|
||
- 全连接层(fully connected layer) : 指上一层中的每个节点都连接到下层网络。本例子中同样使用paddle.layer.fc实现,注意输入到RankCost层的全连接层输出为1x1的层结构 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用paddle.layer.fc ... 全连接层的维度为1。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
ltr/README.md
Outdated
</p> | ||
|
||
- 全连接层(fully connected layer) : 指上一层中的每个节点都连接到下层网络。本例子中同样使用paddle.layer.fc实现,注意输入到RankCost层的全连接层输出为1x1的层结构 | ||
- RankCost层: RankCost层是排序网络RankNet的核心,度量docA相关性是否比docB好,给出预测值并和label比较。使用了交叉熵(cross enctropy)作为度量损失函数,使用梯度下降方法进行优化。细节可见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)[4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
行尾需要句号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/lambdaRank.py
Outdated
train_reader = paddle.batch( | ||
paddle.reader.shuffle(fill_default_train, buf_size=100), batch_size=32) | ||
test_reader = paddle.batch( | ||
paddle.reader.shuffle(fill_default_test, buf_size=100), batch_size=32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test集不需要shuffle~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ltr/ranknet.py
Outdated
paddle.reader.shuffle(paddle.dataset.mq2007.train, buf_size=100), | ||
batch_size=100) | ||
test_reader = paddle.batch( | ||
paddle.reader.buffered(paddle.dataset.mq2007.test, size=100), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么train_reader没有使用paddle.reader.buffered,而test用了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为test不用shuffle,使用buffered比lambda函数接口用的人员门槛更低
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffered是有特殊用途的,并不是为了避免"lambda函数接口"设计的。book里很多例子也不用lambda函数的,如果非得用lambda,还是paddle.dataset.mq2007.test没有写正确~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
噢 fixit
ltr/ranknet.py
Outdated
batch_size=100) | ||
test_reader = paddle.batch( | ||
paddle.reader.buffered(paddle.dataset.mq2007.test, size=100), | ||
batch_size=100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size=100是随意设置的嘛? 这里是100,上面lambdaRank是32~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为lambdaRank对的是List,数量比rankNet少得多。否则看不到训练过程
import functools | ||
import paddle.v2 as paddle | ||
import numpy as np | ||
from metrics import ndcg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没看哪里用了ndcg~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndcg在training过程中,作为函数传不进去
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是排序的基准函数,python里不能传递到training过程中
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也就是说没用到? 文档中说明下metrics.py函数用途吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the recommendation!
ltr/README.md
Outdated
|
||
上述结构中使用了和前述图表相同的模型结构,和RankNet相似,分别使用了`hidden_size=10`的全连接层和`hidden_size=1`的全连接层。本例子中的input_dim指输入**单个文档**的特征dense_vector的维度,label取值为1,-1。每条输入样本为label,\<docA, docB\>的结构,以docA为例,输入input_dim的文档特征,依次变换成10维,1维特征,最终输入到LambdaCost层中。需要注意这里的label和data格式为**dense_vector_sequence**,表示一列文档得分或者文档特征组成的**序列**。 | ||
|
||
用户运行`python lambdaRank.py`将会把每个轮次的模型存下来,并在测试数据上测试效果。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python lambdaRank.py
重要执行命令另起一行,醒目一些~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文档写的挺好的~
ltr/README.md
Outdated
|
||
- Pointwise 方法 | ||
|
||
Pointwise方法是通过近似为回归问题解决排序问题,输入的单条样本为得分-文档,将每个查询-文档对的相关性得分作为实数分数或者序数分数,使得单个查询-文档对作为样本点(Pointwise的由来),训练排序模型。预测时候对于指定输入,给出查询-文档对的相关性得分 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“得分-文档“ 加粗突出
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
|
||
- Pairwise方法 | ||
|
||
Pairwise方法是通过近似为分类问题解决排序问题,输入的单条样本为标签-文档对。对于一次查询的多个结果文档,组合任意两个文档形成文档对作为输入样本。即学习一个二分类器,对输入的一对文档对AB(Pairwise的由来),根据A相关性是否比B好,二分类器给出分类标签+1或-1。对所有文档对进行分类,就可以得到一组偏序关系,从而构造文档全集的排序关系。该类方法的原理是对给定的文档全集S,降低排序中的逆序文档对的个数来降低排序错误,从而达到优化排序结果的目的。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“标签-文档对” 加粗突出
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
|
||
- Listwise方法 | ||
|
||
Listwise方法是直接优化排序列表,输入为单条样本为一个文档排列。通过构造合适的度量函数衡量当前文档排序和最优排序差值,优化度量函数得到排序模型。由于度量函数很多具有非连续性,优化困难。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“文档排列” 加粗突出
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
|
||
RankNet模型在命令行输入: | ||
|
||
`python ranknet.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意格式:
python ranknet.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
- 全连接层(fully connected layer) : 指上一层中的每个节点都连接到下层网络。本例子中同样使用paddle.layer.fc实现,注意输入到RankCost层的全连接层维度为1。 | ||
- RankCost层: RankCost层是排序网络RankNet的核心,度量docA相关性是否比docB好,给出预测值并和label比较。使用了交叉熵(cross enctropy)作为度量损失函数,使用梯度下降方法进行优化。细节可见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf)[4]。 | ||
|
||
由于Pairwise中的网络结构是左右对称,可定义一半网络结构,另一半共享网络参数。使用PaddlePaddle实现RankNet排序模型,定义网络结构的示例代码如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另一半共享网络参数
需要说明PaddlePaddle里如何共享参数~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
|
||
用户运行只需要运行命令: | ||
|
||
`python lambdaRank.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用代码格式~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
ltr/README.md
Outdated
|
||
`python ranknet.py` | ||
|
||
将会自动下载数据,训练RankNet模型,并将每个轮次的模型存下来,并在测试数据上测试效果。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要说明下这个脚本包含啥:训练和预测,预测的结果是啥?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
补充了预测部分
ltr/ranknet.py
Outdated
elif infer_query_id != query_id: | ||
break | ||
infer_data.append(feature_vector) | ||
predicitons = paddle.infer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
预测只返回predicitons,没有任何说明和打印信息,不知道predicitons是啥~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more helper information
ltr/lambdaRank.py
Outdated
if __name__ == '__main__': | ||
paddle.init(use_gpu=False, trainer_count=4) | ||
train_lambdaRank(100) | ||
lambdaRank_infer(pass_id=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok~
ltr/README.md
Outdated
docA_feature_vector : np.array, shape=(1, feature_dimension) | ||
""" | ||
yield label, docA_feature_vector, docB_feature_vector | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按照看reader设计格式写吧:https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/reader#data-reader-interface
这样直接yield,训练时就得用lambda吧。下面同样需要修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix done
另外,两种方法都提供训练结果~ |
提到的一些问题还没有修改,另外 @luotao1 麻烦再帮忙看下? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small modifications.
ltr/lambdaRank.py
Outdated
import numpy as np | ||
import functools | ||
|
||
#lambdaRank is listwise learning to rank algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LambdaRank is a listwise learning to rank algorithm.
ltr/lambdaRank.py
Outdated
|
||
# two hidden layers | ||
hd1 = paddle.layer.fc( | ||
name="/hidden_1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qingqing01 数据集和原来不太一样,可以接受这个变化。
ltr/README.md
Outdated
TBD | ||
## 排序学习(LearningToRank) | ||
|
||
排序学习技术[1] 是构建排序模型的机器学习方法,在信息检索,自然语言处理,数据挖掘等机器学场景中具有重要作用。排序学习的主要目的是对给定一组文档,对任意查询请求给出反映相关性的文档排序。在本例子中,利用标注过的语料库训练两种经典排序模型RankNet[4]和LamdaRank[6],分别可以生成对应的排序模型,能够对任意查询请求,给出相关性文档排序。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考文献这里的markdown标记不太对,请用下面这种标记方式:
[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
ltr/README.md
Outdated
|
||
可以使用常规的梯度下降方法进行优化。细节见[RankNet](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf) | ||
|
||
同时,得到文档Ui在排序优化过程的梯度信息为 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ui -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ltr/README.md
Outdated
|
||
$$\bar{P_{i,j}}=\frac{1}{2}(1+S_{i,j})$$ | ||
|
||
而Sij = {+1,-1},表示Ui和Uj组成的Pair的标签,即Ui相关性是否好于Uj。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一行的几个符号请都改成Latex公式。
- Sij = {+1,-1} -->
$S_{ij} = {+1,-1}$ - Ui -->
$U_i$ - Uj -->
$U_j$
pairwise_train_dataset = functools.partial(paddle.dataset.mq2007.train, format="pairwise") | ||
for label, left_doc, right_doc in pairwise_train_dataset(): | ||
... | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
63 ~ 64 行多余空格吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
ltr/README.md
Outdated
## 模型概览 | ||
|
||
对于排序模型,本样例中提供了PairWise方法的模型RankNet和ListWise方法的模型LambdaRank,分别代表了两类学习方法。PointWise方法的排序模型退化为回归问题,不予赘述。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些多余的空格都去掉吧,空一行即可。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ltr/README.md
Outdated
$$\lambda _{i,j}=\frac{\partial C}{\partial s_{i}} = \frac{1}{2}(1-S_{i,j})-\frac{1}{1+e^{\sigma (s_{i}-s_{j})}}$$ | ||
|
||
表示的含义是本轮排序优化过程中上升或者下降量。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
99~100行多余空格去掉
ltr/README.md
Outdated
|
||
## 总结 | ||
|
||
LearningToRank是和业务场景结合非常紧密的常用机器学习方法,排序模型构造方法一般可划分为PointWise方法,PairWise方法,ListWise方法,本例子中以LETOR的mq2007数据为例,阐述了PairWise的经典方法RankNet和ListWise方法中的LambdaRank,展示如何使用Paddle框架构造对应的排序模型结构,并提供了自定义数据类型样例。Paddle提供了灵活的编程接口,并能在单机单GPU和多机分布式多GPU无缝运行,可以实现LearningToRank类型任务。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PairWise 在文章中不统一,建议全文替换成统一的,Pairwise或则PairWise 都可以,但保持统一。
ltr/README.md
Outdated
|
||
## 参考文献 | ||
|
||
[1] https://en.wikipedia.org/wiki/Learning_to_rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考文献改为markdown的list表示,去掉方括号。
1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
bilstm topology added to text_classification model (Senta)
lambdaRank has a bug, here may be more clear to view
Fix #4