Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data format #872

Closed
wants to merge 20 commits into from
Closed

Conversation

beckett1124
Copy link
Contributor

No description provided.

@@ -0,0 +1,29 @@
## 需求
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要用 # 写一个标题,比如 # 常用数据集

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,29 @@
## 需求

Paddle目前提供了很多demo,且各demo运行时需要从原生网站下载其数据,并进行复杂的预处理过程,整个过程会耗费大量时间。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一行内容的“所以”应该是

所以我们会预处理好一些数据,放在Paddle的服务器上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Paddle目前提供了很多demo,且各demo运行时需要从原生网站下载其数据,并进行复杂的预处理过程,整个过程会耗费大量时间。

所以我们需要数据封装接口,采用import数据源的方式(如\:import paddle.data.amazon.review.GetJSON)来简化获取训练所需数据的时间;但是如果你习惯自己处理原生数据,我们依然提供原生数据接口来满足你的需求。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段的“所以”对应的原因,应该是

同时为了方便大家用Paddle做实验的时候,可以直接访问这些预处理好的数据,我们提供一套Python库。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


数据封装接口的目的是提供数据。不论是原生数据,还是预处理数据都通过import方式导入各模型进行训练;考虑到某些模型的预处理后的数据量依然很大,或有时就仅仅想训练相对较小的网络模型,没必要考虑全量数据,自动配置数据量大小必然更符合不同需求。整个接口初步设想如下:
* 开关来控制数据来源
* 导入数据接口时,带有开关(如:src\_from = True,来自预处理源;否则,来自原生数据源)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有时候预处理会改变数据的格式或者其他重要特性,以至于原始数据和预处理后的数据之间差别很大,访问方式也可能很大,一个boolean flag恐怕不足以覆盖这样的差别。另外,有些数据集不需要预处理。我建议原始数据和预处理后的数据可以用不同的Python package,比如: paddle.data.amazon.product_review. 是我们预处理后的数据,paddle.data.amazon.product_review.raw 是原始数据。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool flage只是区别数据来源,后续的处理在自个的package里面。done


```python
#导入数据,放在/tmp/data目录下
mnist=input_data.read_data_sets("/tmp/data",one_hot=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个例子里漏了import语句,所以没有体现出来Tensorflow的data package 的目录组织方式。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Xte,Yte = mnist.test.next_batch(200)

```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多了几个空行,导致paddle的格式检查不过。
请使用pre-commit脚本进行格式化。

Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我对data packages的设计有一些考虑。写在comments里了。

Data packages应该会和我们的新Paddle API有很强的相关性,建议和 @jacquesqiao @reyoung 一起讨论一下。看看如何让用户用起来最方便。

多谢 @beckett1124


amazon = input_data.load_dataset(
'Amazon',
'/Users/baidu/git/test_package/data',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个路径是什么意思?是不是说把数据下载下来,写到这个目录里?

我在想,是不是可以有更紧密地和Paddle API结合的方式?比如我看到乔龙飞和于洋的新Python API的设计里,好像是有一个函数,貌似叫 feed_data 的 —— 只要把一个data instance 传给这个参数,就可以用来训练模型或者测试模型了。

如果是这样,是不是每个data package可以提供一个fetch函数,获取一个data instance,这样我们的训练程序就可以写成:

for iter in range(1,100):
    trainer.feed(paddle.data.amazon.product_review.training.fetch())

就完了?

这个可以和 @reyoung @jacquesqiao 商量一下。

amazon = input_data.load_dataset(
'Amazon',
'/Users/baidu/git/test_package/data',
data_unneed=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_unneed 是什么意思?

'Amazon',
'/Users/baidu/git/test_package/data',
data_unneed=False,
src_flag=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_flag 是说用原始数据吗?原始的翻译是 raw

'/Users/baidu/git/test_package/data',
data_unneed=False,
src_flag=False)
batch = amazon.train.shrink_txt('train',10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shirnk_txt 是什么意思?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

切分txt的数据文件,shirnk_txt(‘train’,10)是指将train的文件切分至10%,用于训练

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块跟于洋和乔龙飞商量了,他们来做后续文件的切分。


使用demo如下所示\:
```python
from paddle.data import input_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我原以为我们会有一系列不容的packages:

paddle.data.amazon.product_review.raw  # 原始数据,json格式
paddle.data.amazon.product_review.training # 处理过的,json格式
paddle.data.amazon.product_review.testing  # 处理过的,json格式
paddle.data.switchboard.training.wav     # 处理过的,音频片段
paddle.data.switchboard.training.labels  # 处理过的,每个拼音片段对应的文字label
paddle.data.switchboard.testing.wav
paddle.data.switchboard.testing.labels
paddle.data.youtube.video       # Youtube 公开的一些视频的
paddle.data.youtube.subtitles  # 每个视频对应的字幕

这样设计比较自由,可以覆盖各种格式的训练数据。这也是我之前担心 src_flag 这样的flag可能导致我们的设计不够灵活的原因。

if not os.path.exists(file_path):
temp_file_name,_ = download_with_urlretrieve(source_url)
temp_file_path = os.getcwd()
os.rename(temp_file_name,src_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数列表有的有空格,有的没空格

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯 这个我修改一下

@@ -0,0 +1,100 @@
#/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

似乎要修改下 python/CMakeLists.txt 类似
file(GLOB UTILS_PY_FILES . ./paddle/data/*.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不需要,我问过廖纲,只需要修改setup.py.in 就可以。本地测试了,可以load上

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

几个小问题

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstring

添加整体文件的注释。docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import numpy as np
from six.moves import urllib
import stat

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加 __all__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不需要添加all吧

source_name = "cifar"
file_source = "cifar-10-batches-py"
#Set the download dir for cifar.
data_home = set_data_path(source_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_home是fetch的一个参数。默认是"~/paddle_data_directory"

即:

def fetch(directory=None):
    if directory is None:
         directory = "~/paddle_data_directory"

没必要用set_data_path这么复杂的东西

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好,done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是我们要考虑到权限的问题呀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Returns:
path to downloaded file.
"""
num_images_train = 50000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里好多参数没有用?比如num_images_train

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都去掉吧

#Set the download dir for cifar.
data_home = set_data_path(source_name)
filepath = data_download(data_home,source_url)
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要么不要注释,要么删掉。

永远不要再版本管理系统里使用注释来去掉代码。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return download_dir


def move_files(source_dire, target_dire):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数没必要有吧。直接调用shutils就好了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好,done

else:
tarfile.open(name=file_path, mode="r:gz").extractall(download_dir)
print("Data has been already downloaded and unpacked!")
return download_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方便的话返回解压缩的路径吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

g_file.close()
print("Data has been already downloaded and unpacked!")
os.remove(file_path)
return download_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

返回这几个文件的路径吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

data_url = urlparse.urljoin(source_url,file)
file_path = os.path.join(download_dir,file)
untar_path = os.path.join(download_dir,file.replace(".gz",""))
if not os.path.exists(file_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单纯的gzip文件不需要解压缩。。因为Python可以直接按照读取。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,168 @@
#/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

太多复制粘贴的代码了。请清理一下。

现在我们在paddle.data这个package下,可以写一个子module,比如叫download_http等等。或者,这几个文件都放到一个.py里面,叫fetch.py,里面暴露 fetch_mnist之类的函数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

BASE_URL = 'http://yann.lecun.com/exdb/mnist/%s-ubyte.gz'


class Categories(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazon那个dataset叫做Categories因为就是有很多分类,而mnist这个可能只有一个数据集,没有必要叫做Categories了。

:param directory:
:return:
"""
if directory is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mnist的数据集,一次就要下载那四个文件下来。不存在category的概念



class Categories(object):
M1m = "ml-1m"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M1m => 'ML1M'

ML 是MovieLens的缩写, 1M是指这个数据大小有1M

__all__ = ['fetch', 'Categories', 'preprocess']


def calculate_md5(fn):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数可以extract到一个公共文件里面。



class Categories(object):
Conll05test = "conll05st-tests"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这如果只有一个语料的话,就没有必要有这个categories对象了。


if directory is None:
directory = os.path.expanduser(
os.path.join('~', 'paddle_data', 'amazon'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地址为什么是amazon?



class Categories(object):
AclImdb = "aclImdb_v1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个的就不用Categories了。

@@ -0,0 +1,5 @@
"""
The :mod:`paddle.datasets` module includes utilities to load datasets,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle.datasets ==> paddle.data ?

label_file = os.path.join(directory, filename + label)
else:
image_file = os.path.join(directory, 't10' + image)
label_file = os.path.join(directory, 't10' + label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t10 -> t10k

@CLAassistant
Copy link

CLAassistant commented Jul 22, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ reyoung
❌ baidu
❌ beckett1124


baidu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
* add contribute faq

* refine with comments
@paddle-bot-old paddle-bot-old bot closed this Aug 26, 2022
@paddle-bot
Copy link

paddle-bot bot commented Aug 26, 2022

很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants