Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyDataProvider2在混合使用min_pool_size,integer_sequence, dense_sequence的情况下,疑似有数据丢失 #653

Closed
lcy-seso opened this issue Nov 29, 2016 · 7 comments
Labels

Comments

@lcy-seso
Copy link
Contributor

使用 PyDataProvider2 ,当设置的 min_pool_size 小于 训练/测试 样本总数时, 训练和测试都会出现丢弃样本的情况。

reyoung added a commit to reyoung/Paddle that referenced this issue Dec 7, 2016
* But not reproduce the problem.
gangliao added a commit that referenced this issue Dec 7, 2016
@lcy-seso
Copy link
Contributor Author

lcy-seso commented Dec 7, 2016

  1. 输入 3 个slot, 分别是 :
dense_vector(60, seq_type=SequenceType.SEQUENCE), 
dense_vector(1, seq_type=SequenceType.SEQUENCE), 
dense_vector(1, seq_type=SequenceType.SEQUENCE)
  1. 测试样本集总样本 3104 条,provider 参数设置如下:
@provider(use_seq=True,
          pool_size=100,
          min_pool_size=100,
          init_hook=on_init,
          should_shuffle=True)
  1. 走 paddle的 job = test, 每次测试的样本数目都不太一样。什么都不修改,run 四次测试脚本出现过 161, 163, 342, 100;

  2. 增大 pool_size 超过总测试样本数,现象一致。只有增大 min_pool_size 超过总测试样本数目时,才会测试所有测试样本。

  3. 在以上测试过程中没有设置 MEM_DATA_IN_PASS,可能和memory 没有关系。

@lcy-seso
Copy link
Contributor Author

lcy-seso commented Dec 7, 2016

补充一些测试结果。

  1. 以上现象和 cache=CacheType.CACHE_PASS_IN_MEM 无关;
  2. 出错的配置是这样:
dense_vector(60, seq_type=SequenceType.SEQUENCE), 
dense_vector(1, seq_type=SequenceType.SEQUENCE), 
dense_vector(1, seq_type=SequenceType.SEQUENCE)

yield 时:

yield vec, [[label]], [[seq_type]]

其中, label 和 seq_type 是 int (这里会不会有问题?);

  1. 后两个 slot 可以替换成 integer_value ,yield 时也相应改变,
    变成如下定义形式 :
dense_vector(60, seq_type=SequenceType.SEQUENCE),
integer_value(1),
integer_value(1)

yield 时:

yield vec, label, seq_type

以上错误消失

  1. 训练时,使用integer_value 作为label,设置 min_pool_size 小于 training sample 的总数,没有出现丢样本的问题。

@lcy-seso
Copy link
Contributor Author

lcy-seso commented Dec 7, 2016

string_slot 在预测时遇到以下两种情况,会有一定的价值,希望后期可以支持。
-(1)dataprovider 里面处理原始数据,丢弃不合法数据;

  • 这样可以把原始数据通过string slot 给进去,直接拿结果,而不用记录哪些样本被丢弃,再做后处理将预测结果与原始输入拼接。对小数据评估比较方便。

-(2) label 本身是string;

  • 如果支持string slot,测试就可以避免做一次string 到id 的映射,然后再映射回去,这样的操作;

string slot 都不会参与具体的运算,会在预测过程中提供一定的便利性。

@reyoung
Copy link
Collaborator

reyoung commented Dec 7, 2016

dataprovider 里面处理原始数据,丢弃不合法数据;

支持,请查询check参数

@lcy-seso
Copy link
Contributor Author

lcy-seso commented Dec 7, 2016

嗯~ 这个是想和 string slot一起使用,string slot确实是可以没有的,有的话,有一点点的便利性。

@reyoung reyoung changed the title PyDataProvider2 min_pool_size 小于总样本数目时,训练和测试都会丢弃样本。 PyDataProvider2在混合使用min_pool_size,integer_sequence, dense_sequence的情况下,疑似有数据丢失 Dec 7, 2016
@luotao1
Copy link
Contributor

luotao1 commented Dec 7, 2016

@lcy-seso
Copy link
Contributor Author

lcy-seso commented Dec 7, 2016

PydataProvider2 里面应该还没有。这个东西确实没啥实际用处。就是评估会方便一点点。

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
* fix_windows

* Final update 1.3 (PaddlePaddle#653)

* thorough clean

* delete_DS_Store

* update_1.3
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
* synchronize with develop (PaddlePaddle#642)

* update_commitid1.3 (PaddlePaddle#641)

* update inference c++ API doc (PaddlePaddle#634)

* update inference c++ API doc

* fix link

* thorough clean for doc (PaddlePaddle#644)

* thorough clean

* delete_DS_Store

* Cherrypick1.3 (PaddlePaddle#652)

* thorough clean

* delete_DS_Store

* [Don't merge now]update_install_doc (PaddlePaddle#643)

* update_install_doc

* follow_comments

* add maxdepth (PaddlePaddle#646)

* upload_md (PaddlePaddle#649)

* update_version (PaddlePaddle#650)

* Translation of 16 new apis (PaddlePaddle#651)

* fix_windows

* Final update 1.3 (PaddlePaddle#653)

* thorough clean

* delete_DS_Store

* update_1.3

* Deadlink fix (PaddlePaddle#654)

* fix_deadlinks

* update_docker

* Update release_note.rst

* Update index_cn.rst

* update_Paddle (PaddlePaddle#658)

* fix pic (PaddlePaddle#659)

* [to 1.3] cn api debug (PaddlePaddle#655) (PaddlePaddle#661)

* debug

* fix 2 -conv2d

* "锚" ==> anchor(s)
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
* synchronize with develop (PaddlePaddle#642)

* update_commitid1.3 (PaddlePaddle#641)

* update inference c++ API doc (PaddlePaddle#634)

* update inference c++ API doc

* fix link

* thorough clean for doc (PaddlePaddle#644)

* thorough clean

* delete_DS_Store

* Cherrypick1.3 (PaddlePaddle#652)

* thorough clean

* delete_DS_Store

* [Don't merge now]update_install_doc (PaddlePaddle#643)

* update_install_doc

* follow_comments

* add maxdepth (PaddlePaddle#646)

* upload_md (PaddlePaddle#649)

* update_version (PaddlePaddle#650)

* Translation of 16 new apis (PaddlePaddle#651)

* fix_windows

* Final update 1.3 (PaddlePaddle#653)

* thorough clean

* delete_DS_Store

* update_1.3

* Deadlink fix (PaddlePaddle#654)

* fix_deadlinks

* update_docker

* Update release_note.rst

* Update index_cn.rst

* update_Paddle (PaddlePaddle#658)

* fix pic (PaddlePaddle#659)

* [to 1.3] cn api debug (PaddlePaddle#655) (PaddlePaddle#661)

* debug

* fix 2 -conv2d

* "锚" ==> anchor(s)

* Weekly cherrypick0302 (PaddlePaddle#668)

* Update programming_guide.md (PaddlePaddle#664)

* Update programming_guide.md

* Update programming_guide_en.md

* Update cn api to 1.3 (PaddlePaddle#663)

* Update cn api to 1.3 fluid & layers

* Rest to 1.3

* Weeklyupdate 0301 (PaddlePaddle#666)

* Tables_rm_op

* update_op

* update_index

* update_book_0302 (PaddlePaddle#667)

* fix_format (PaddlePaddle#669) (PaddlePaddle#670)

* fix_format

* Update Tables.md

* Update Tables_en.md

* add dataset api_cn (PaddlePaddle#673)

* rm fluid.core in desigin_idea (PaddlePaddle#674)

* Update fluid_design_idea.md

* Update fluid_design_idea_en.md

* Fix array_read code example error. (PaddlePaddle#671)

Signed-off-by: zhaoyuchen <[email protected]>

* add data_reader_cn (PaddlePaddle#676)

* fix doc error (PaddlePaddle#675)

* update_book_commitid (PaddlePaddle#680)

* update_book_commitid

* commitid0309

* fix typo

* book indexes (PaddlePaddle#677)
Meiyim pushed a commit to Meiyim/Paddle that referenced this issue May 21, 2021
yaozhixin pushed a commit to graphcore/Paddle-fork that referenced this issue Apr 28, 2022
heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants