Add data format #872

beckett1124 · 2016-12-14T02:45:59Z

No description provided.

wangkuiyi · 2016-12-14T03:02:28Z

python/paddle/data/DATA.md

@@ -0,0 +1,29 @@
+## 需求


需要用 # 写一个标题，比如 # 常用数据集

wangkuiyi · 2016-12-14T03:03:28Z

python/paddle/data/DATA.md

@@ -0,0 +1,29 @@
+## 需求
+
+Paddle目前提供了很多demo，且各demo运行时需要从原生网站下载其数据，并进行复杂的预处理过程，整个过程会耗费大量时间。


这一行内容的“所以”应该是

所以我们会预处理好一些数据，放在Paddle的服务器上

wangkuiyi · 2016-12-14T03:04:33Z

python/paddle/data/DATA.md

+
+Paddle目前提供了很多demo，且各demo运行时需要从原生网站下载其数据，并进行复杂的预处理过程，整个过程会耗费大量时间。
+
+所以我们需要数据封装接口，采用import数据源的方式(如\：import paddle.data.amazon.review.GetJSON)来简化获取训练所需数据的时间；但是如果你习惯自己处理原生数据，我们依然提供原生数据接口来满足你的需求。


这一段的“所以”对应的原因，应该是

同时为了方便大家用Paddle做实验的时候，可以直接访问这些预处理好的数据，我们提供一套Python库。

wangkuiyi · 2016-12-14T03:07:40Z

python/paddle/data/DATA.md

+
+数据封装接口的目的是提供数据。不论是原生数据，还是预处理数据都通过import方式导入各模型进行训练；考虑到某些模型的预处理后的数据量依然很大，或有时就仅仅想训练相对较小的网络模型，没必要考虑全量数据，自动配置数据量大小必然更符合不同需求。整个接口初步设想如下：
+* 开关来控制数据来源
+   * 导入数据接口时，带有开关(如:src\_from = True，来自预处理源；否则,来自原生数据源)


有时候预处理会改变数据的格式或者其他重要特性，以至于原始数据和预处理后的数据之间差别很大，访问方式也可能很大，一个boolean flag恐怕不足以覆盖这样的差别。另外，有些数据集不需要预处理。我建议原始数据和预处理后的数据可以用不同的Python package，比如： paddle.data.amazon.product_review. 是我们预处理后的数据，paddle.data.amazon.product_review.raw 是原始数据。

bool flage只是区别数据来源，后续的处理在自个的package里面。done

wangkuiyi · 2016-12-14T03:08:39Z

python/paddle/data/DATA.md

+
+```python
+#导入数据，放在/tmp/data目录下
+mnist=input_data.read_data_sets("/tmp/data",one_hot=True)


这个例子里漏了import语句，所以没有体现出来Tensorflow的data package 的目录组织方式。

reyoung · 2016-12-15T07:04:43Z

python/paddle/data/DATA.md

+Xte,Yte = mnist.test.next_batch(200)
+
+```
+


多了几个空行，导致paddle的格式检查不过。
请使用pre-commit脚本进行格式化。

wangkuiyi

我对data packages的设计有一些考虑。写在comments里了。

Data packages应该会和我们的新Paddle API有很强的相关性，建议和 @jacquesqiao @reyoung 一起讨论一下。看看如何让用户用起来最方便。

多谢 @beckett1124 ！

wangkuiyi · 2017-01-04T15:34:52Z

python/paddle/data/DATA.md

+
+amazon = input_data.load_dataset(
+         'Amazon',
+         '/Users/baidu/git/test_package/data',


这个路径是什么意思？是不是说把数据下载下来，写到这个目录里？

我在想，是不是可以有更紧密地和Paddle API结合的方式？比如我看到乔龙飞和于洋的新Python API的设计里，好像是有一个函数，貌似叫 feed_data 的 —— 只要把一个data instance 传给这个参数，就可以用来训练模型或者测试模型了。

如果是这样，是不是每个data package可以提供一个fetch函数，获取一个data instance，这样我们的训练程序就可以写成：

for iter in range(1,100): trainer.feed(paddle.data.amazon.product_review.training.fetch())

就完了？

这个可以和 @reyoung @jacquesqiao 商量一下。

wangkuiyi · 2017-01-04T15:35:07Z

python/paddle/data/DATA.md

+amazon = input_data.load_dataset(
+         'Amazon',
+         '/Users/baidu/git/test_package/data',
+         data_unneed=False,


data_unneed 是什么意思？

wangkuiyi · 2017-01-04T15:35:25Z

python/paddle/data/DATA.md

+         'Amazon',
+         '/Users/baidu/git/test_package/data',
+         data_unneed=False,
+         src_flag=False)


src_flag 是说用原始数据吗？原始的翻译是 raw

wangkuiyi · 2017-01-04T15:36:10Z

python/paddle/data/DATA.md

+         '/Users/baidu/git/test_package/data',
+         data_unneed=False,
+         src_flag=False)
+batch = amazon.train.shrink_txt('train',10)


shirnk_txt 是什么意思？

切分txt的数据文件，shirnk_txt(‘train’，10)是指将train的文件切分至10%，用于训练

这块跟于洋和乔龙飞商量了，他们来做后续文件的切分。

wangkuiyi · 2017-01-04T15:36:44Z

python/paddle/data/DATA.md

+
+使用demo如下所示\:
+```python
+from paddle.data import input_data


我原以为我们会有一系列不容的packages：

paddle.data.amazon.product_review.raw # 原始数据，json格式 paddle.data.amazon.product_review.training # 处理过的，json格式 paddle.data.amazon.product_review.testing # 处理过的，json格式 paddle.data.switchboard.training.wav # 处理过的，音频片段 paddle.data.switchboard.training.labels # 处理过的，每个拼音片段对应的文字label paddle.data.switchboard.testing.wav paddle.data.switchboard.testing.labels paddle.data.youtube.video # Youtube 公开的一些视频的 paddle.data.youtube.subtitles # 每个视频对应的字幕

这样设计比较自由，可以覆盖各种格式的训练数据。这也是我之前担心 src_flag 这样的flag可能导致我们的设计不够灵活的原因。

jacquesqiao · 2017-01-10T08:41:16Z

python/paddle/data/cifar_10.py

+    if not os.path.exists(file_path):
+        temp_file_name,_ = download_with_urlretrieve(source_url)
+        temp_file_path = os.getcwd()
+        os.rename(temp_file_name,src_file)


参数列表有的有空格，有的没空格

嗯这个我修改一下

jacquesqiao · 2017-01-10T08:45:09Z

python/paddle/data/cifar_10.py

@@ -0,0 +1,100 @@
+#/usr/bin/env python


似乎要修改下 python/CMakeLists.txt 类似
file(GLOB UTILS_PY_FILES . ./paddle/data/*.py)

这个不需要，我问过廖纲，只需要修改setup.py.in 就可以。本地测试了，可以load上

https://github.com/PaddlePaddle/Paddle/pull/1099/files#diff-409abc2eda4ed1dba2c69defc807748dR18 swig已经被我简化了。

jacquesqiao

几个小问题

reyoung · 2017-01-11T02:48:46Z

python/paddle/data/amazon.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+


add docstring

添加整体文件的注释。docstring

reyoung · 2017-01-11T14:31:58Z

python/paddle/data/amazon.py

+import numpy as np
+from six.moves import urllib
+import stat
+


添加 __all__

这里不需要添加all吧

reyoung · 2017-01-11T14:34:03Z

python/paddle/data/cifar10.py

+    source_name = "cifar"
+    file_source = "cifar-10-batches-py"
+    #Set the download dir for cifar.
+    data_home = set_data_path(source_name)


data_home是fetch的一个参数。默认是"~/paddle_data_directory"

即:

def fetch(directory=None): if directory is None: directory = "~/paddle_data_directory"

没必要用set_data_path这么复杂的东西

但是我们要考虑到权限的问题呀

reyoung · 2017-01-11T14:34:32Z

python/paddle/data/cifar10.py

+    Returns:
+        path to downloaded file.
+    """
+    num_images_train = 50000


这里好多参数没有用？比如num_images_train

都去掉吧

reyoung · 2017-01-11T14:35:03Z

python/paddle/data/cifar10.py

+    #Set the download dir for cifar.
+    data_home = set_data_path(source_name)
+    filepath = data_download(data_home,source_url)
+    """


要么不要注释，要么删掉。

永远不要再版本管理系统里使用注释来去掉代码。

reyoung · 2017-01-11T14:42:03Z

python/paddle/data/amazon.py

+    return download_dir
+
+
+def move_files(source_dire, target_dire):


这个函数没必要有吧。直接调用shutils就好了

reyoung · 2017-01-11T14:43:00Z

python/paddle/data/cifar_10.py

+    else:
+        tarfile.open(name=file_path, mode="r:gz").extractall(download_dir)
+        print("Data has been already downloaded and unpacked!")
+    return download_dir


方便的话返回解压缩的路径吧

reyoung · 2017-01-11T14:43:43Z

python/paddle/data/mnist.py

+            g_file.close()
+            print("Data has been already downloaded and unpacked!")
+        os.remove(file_path)
+    return download_dir


返回这几个文件的路径吧

reyoung · 2017-01-11T14:44:29Z

python/paddle/data/mnist.py

+        data_url = urlparse.urljoin(source_url,file)
+        file_path = os.path.join(download_dir,file)
+        untar_path = os.path.join(download_dir,file.replace(".gz",""))
+        if not os.path.exists(file_path):


单纯的gzip文件不需要解压缩。。因为Python可以直接按照读取。

reyoung · 2017-01-11T14:46:14Z

python/paddle/data/recommendation.py

@@ -0,0 +1,168 @@
+#/usr/bin/env python


太多复制粘贴的代码了。请清理一下。

现在我们在paddle.data这个package下，可以写一个子module，比如叫download_http等等。或者，这几个文件都放到一个.py里面，叫fetch.py，里面暴露 fetch_mnist之类的函数

Refine amazon_product_reviews.py

Feature/data api

reyoung · 2017-01-18T06:00:44Z

python/paddle/data/mnist.py

+BASE_URL = 'http://yann.lecun.com/exdb/mnist/%s-ubyte.gz'
+
+
+class Categories(object):


Amazon那个dataset叫做Categories因为就是有很多分类，而mnist这个可能只有一个数据集，没有必要叫做Categories了。

reyoung · 2017-01-18T06:01:31Z

python/paddle/data/mnist.py

+    :param directory:
+    :return:
+    """
+    if directory is None:


mnist的数据集，一次就要下载那四个文件下来。不存在category的概念

reyoung · 2017-01-18T06:02:33Z

python/paddle/data/recommendation.py

+
+
+class Categories(object):
+    M1m = "ml-1m"


M1m => 'ML1M'

ML 是MovieLens的缩写， 1M是指这个数据大小有1M

reyoung · 2017-01-18T06:03:58Z

python/paddle/data/recommendation.py

+__all__ = ['fetch', 'Categories', 'preprocess']
+
+
+def calculate_md5(fn):


这个函数可以extract到一个公共文件里面。

reyoung · 2017-01-18T06:04:28Z

python/paddle/data/semantic.py

+
+
+class Categories(object):
+    Conll05test = "conll05st-tests"


这如果只有一个语料的话，就没有必要有这个categories对象了。

reyoung · 2017-01-18T06:04:47Z

python/paddle/data/semantic.py

+
+    if directory is None:
+        directory = os.path.expanduser(
+            os.path.join('~', 'paddle_data', 'amazon'))


这个地址为什么是amazon?

reyoung · 2017-01-18T06:05:07Z

python/paddle/data/sentiment.py

+
+
+class Categories(object):
+    AclImdb = "aclImdb_v1"


一个的就不用Categories了。

wangkuiyi · 2017-01-16T20:06:10Z

python/paddle/data/__init__.py

@@ -0,0 +1,5 @@
+"""
+The :mod:`paddle.datasets` module includes utilities to load datasets,


paddle.datasets ==> paddle.data ?

qingqing01 · 2017-02-13T07:30:57Z

python/paddle/data/mnist.py

+        label_file = os.path.join(directory, filename + label)
+    else:
+        image_file = os.path.join(directory, 't10' + image)
+        label_file = os.path.join(directory, 't10' + label)


t10 -> t10k

CLAassistant · 2019-07-22T19:47:24Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ reyoung
❌ baidu
❌ beckett1124

baidu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

* add contribute faq * refine with comments

paddle-bot · 2022-08-26T08:58:01Z

很抱歉，经过我们的反复讨论，你的PR暂未达到合入标准，请阅读飞桨原生算子开发规范，你可以重新提交新的PR，我们先将此PR关闭，感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

wangkuiyi reviewed Dec 14, 2016

View reviewed changes

reyoung requested changes Dec 15, 2016

View reviewed changes

wangkuiyi requested changes Jan 4, 2017

View reviewed changes

add cifar

a4044bb

beckett1124 force-pushed the develop branch from 03b1c77 to a4044bb Compare January 10, 2017 08:33

jacquesqiao reviewed Jan 10, 2017

View reviewed changes

jacquesqiao requested changes Jan 10, 2017

View reviewed changes

beckett1124 added 5 commits January 10, 2017 16:48

update cifar

22a8d06

update cifar

7192a6b

add mnist and amazon

0913bbc

update amazon and mnist

a4ed798

add other data

ce0f5b0

reyoung requested changes Jan 11, 2017

View reviewed changes

baidu and others added 11 commits January 13, 2017 11:45

update code

1373977

update

ee9b1c6

update

c53599f

Refine amazon_product_reviews.py

6173153

Merge pull request #1 from reyoung/feature/data_api

a97f44c

Refine amazon_product_reviews.py

Add md5 checks.

7972f74

Add preprocess method for amazon reviews

294f298

Done with amazon product reviews.

20c96b7

Merge pull request #2 from reyoung/feature/data_api

60b6ef5

Feature/data api

add new file path

0c76c64

update file path

9a803d0

reyoung requested changes Jan 18, 2017

View reviewed changes

add Data md

c6a2600

wangkuiyi reviewed Jan 20, 2017

View reviewed changes

baidu added 2 commits February 6, 2017 20:11

new

3fa7f21

updata amazon & cifar & mnist

bee88c9

qingqing01 requested changes Feb 13, 2017

View reviewed changes

qingqing01 mentioned this pull request Feb 14, 2017

data reader for mnist #1325

Closed

reyoung mentioned this pull request Feb 21, 2017

Implement of paddle.v2.data.fetcher #1405

Closed

This was referenced Feb 23, 2017

add data package sentiment #1433

Closed

add data package sentiment #1434

Closed

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019

add contribute faq (PaddlePaddle#872)

f7bd0fa

* add contribute faq * refine with comments

paddle-bot-old bot closed this Aug 26, 2022

paddle-bot bot added the status: not progressed label Aug 26, 2022

		@@ -0,0 +1,29 @@
		## 需求

		Paddle目前提供了很多demo，且各demo运行时需要从原生网站下载其数据，并进行复杂的预处理过程，整个过程会耗费大量时间。


		Paddle目前提供了很多demo，且各demo运行时需要从原生网站下载其数据，并进行复杂的预处理过程，整个过程会耗费大量时间。

		所以我们需要数据封装接口，采用import数据源的方式(如\：import paddle.data.amazon.review.GetJSON)来简化获取训练所需数据的时间；但是如果你习惯自己处理原生数据，我们依然提供原生数据接口来满足你的需求。

		return download_dir


		def move_files(source_dire, target_dire):

		BASE_URL = 'http://yann.lecun.com/exdb/mnist/%s-ubyte.gz'


		class Categories(object):

		__all__ = ['fetch', 'Categories', 'preprocess']


		def calculate_md5(fn):

		@@ -0,0 +1,5 @@
		"""
		The :mod:`paddle.datasets` module includes utilities to load datasets,

		Xte,Yte = mnist.test.next_batch(200)

		```

Add data format #872

Add data format #872

Conversation

beckett1124 commented Dec 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Jul 22, 2019 • edited Loading

paddle-bot bot commented Aug 26, 2022

CLAassistant commented Jul 22, 2019 •

edited

Loading