-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data format #872
Add data format #872
Conversation
python/paddle/data/DATA.md
Outdated
@@ -0,0 +1,29 @@ | |||
## 需求 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要用 # 写一个标题,比如 # 常用数据集
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/DATA.md
Outdated
@@ -0,0 +1,29 @@ | |||
## 需求 | |||
|
|||
Paddle目前提供了很多demo,且各demo运行时需要从原生网站下载其数据,并进行复杂的预处理过程,整个过程会耗费大量时间。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一行内容的“所以”应该是
所以我们会预处理好一些数据,放在Paddle的服务器上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/DATA.md
Outdated
|
||
Paddle目前提供了很多demo,且各demo运行时需要从原生网站下载其数据,并进行复杂的预处理过程,整个过程会耗费大量时间。 | ||
|
||
所以我们需要数据封装接口,采用import数据源的方式(如\:import paddle.data.amazon.review.GetJSON)来简化获取训练所需数据的时间;但是如果你习惯自己处理原生数据,我们依然提供原生数据接口来满足你的需求。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一段的“所以”对应的原因,应该是
同时为了方便大家用Paddle做实验的时候,可以直接访问这些预处理好的数据,我们提供一套Python库。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/DATA.md
Outdated
|
||
数据封装接口的目的是提供数据。不论是原生数据,还是预处理数据都通过import方式导入各模型进行训练;考虑到某些模型的预处理后的数据量依然很大,或有时就仅仅想训练相对较小的网络模型,没必要考虑全量数据,自动配置数据量大小必然更符合不同需求。整个接口初步设想如下: | ||
* 开关来控制数据来源 | ||
* 导入数据接口时,带有开关(如:src\_from = True,来自预处理源;否则,来自原生数据源) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有时候预处理会改变数据的格式或者其他重要特性,以至于原始数据和预处理后的数据之间差别很大,访问方式也可能很大,一个boolean flag恐怕不足以覆盖这样的差别。另外,有些数据集不需要预处理。我建议原始数据和预处理后的数据可以用不同的Python package,比如: paddle.data.amazon.product_review. 是我们预处理后的数据,paddle.data.amazon.product_review.raw 是原始数据。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool flage只是区别数据来源,后续的处理在自个的package里面。done
python/paddle/data/DATA.md
Outdated
|
||
```python | ||
#导入数据,放在/tmp/data目录下 | ||
mnist=input_data.read_data_sets("/tmp/data",one_hot=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个例子里漏了import语句,所以没有体现出来Tensorflow的data package 的目录组织方式。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/DATA.md
Outdated
Xte,Yte = mnist.test.next_batch(200) | ||
|
||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多了几个空行,导致paddle的格式检查不过。
请使用pre-commit脚本进行格式化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我对data packages的设计有一些考虑。写在comments里了。
Data packages应该会和我们的新Paddle API有很强的相关性,建议和 @jacquesqiao @reyoung 一起讨论一下。看看如何让用户用起来最方便。
多谢 @beckett1124 !
python/paddle/data/DATA.md
Outdated
|
||
amazon = input_data.load_dataset( | ||
'Amazon', | ||
'/Users/baidu/git/test_package/data', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个路径是什么意思?是不是说把数据下载下来,写到这个目录里?
我在想,是不是可以有更紧密地和Paddle API结合的方式?比如我看到乔龙飞和于洋的新Python API的设计里,好像是有一个函数,貌似叫 feed_data 的 —— 只要把一个data instance 传给这个参数,就可以用来训练模型或者测试模型了。
如果是这样,是不是每个data package可以提供一个fetch函数,获取一个data instance,这样我们的训练程序就可以写成:
for iter in range(1,100):
trainer.feed(paddle.data.amazon.product_review.training.fetch())
就完了?
这个可以和 @reyoung @jacquesqiao 商量一下。
python/paddle/data/DATA.md
Outdated
amazon = input_data.load_dataset( | ||
'Amazon', | ||
'/Users/baidu/git/test_package/data', | ||
data_unneed=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_unneed 是什么意思?
python/paddle/data/DATA.md
Outdated
'Amazon', | ||
'/Users/baidu/git/test_package/data', | ||
data_unneed=False, | ||
src_flag=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src_flag 是说用原始数据吗?原始的翻译是 raw
python/paddle/data/DATA.md
Outdated
'/Users/baidu/git/test_package/data', | ||
data_unneed=False, | ||
src_flag=False) | ||
batch = amazon.train.shrink_txt('train',10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shirnk_txt 是什么意思?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
切分txt的数据文件,shirnk_txt(‘train’,10)是指将train的文件切分至10%,用于训练
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块跟于洋和乔龙飞商量了,他们来做后续文件的切分。
python/paddle/data/DATA.md
Outdated
|
||
使用demo如下所示\: | ||
```python | ||
from paddle.data import input_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我原以为我们会有一系列不容的packages:
paddle.data.amazon.product_review.raw # 原始数据,json格式
paddle.data.amazon.product_review.training # 处理过的,json格式
paddle.data.amazon.product_review.testing # 处理过的,json格式
paddle.data.switchboard.training.wav # 处理过的,音频片段
paddle.data.switchboard.training.labels # 处理过的,每个拼音片段对应的文字label
paddle.data.switchboard.testing.wav
paddle.data.switchboard.testing.labels
paddle.data.youtube.video # Youtube 公开的一些视频的
paddle.data.youtube.subtitles # 每个视频对应的字幕
这样设计比较自由,可以覆盖各种格式的训练数据。这也是我之前担心 src_flag 这样的flag可能导致我们的设计不够灵活的原因。
python/paddle/data/cifar_10.py
Outdated
if not os.path.exists(file_path): | ||
temp_file_name,_ = download_with_urlretrieve(source_url) | ||
temp_file_path = os.getcwd() | ||
os.rename(temp_file_name,src_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数列表有的有空格,有的没空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯 这个我修改一下
python/paddle/data/cifar_10.py
Outdated
@@ -0,0 +1,100 @@ | |||
#/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
似乎要修改下 python/CMakeLists.txt 类似
file(GLOB UTILS_PY_FILES . ./paddle/data/*.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个不需要,我问过廖纲,只需要修改setup.py.in 就可以。本地测试了,可以load上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
几个小问题
python/paddle/data/amazon.py
Outdated
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docstring
添加整体文件的注释。docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/amazon.py
Outdated
import numpy as np | ||
from six.moves import urllib | ||
import stat | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
添加 __all__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不需要添加all吧
python/paddle/data/cifar10.py
Outdated
source_name = "cifar" | ||
file_source = "cifar-10-batches-py" | ||
#Set the download dir for cifar. | ||
data_home = set_data_path(source_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_home是fetch的一个参数。默认是"~/paddle_data_directory"
即:
def fetch(directory=None):
if directory is None:
directory = "~/paddle_data_directory"
没必要用set_data_path这么复杂的东西
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好,done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
但是我们要考虑到权限的问题呀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/cifar10.py
Outdated
Returns: | ||
path to downloaded file. | ||
""" | ||
num_images_train = 50000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里好多参数没有用?比如num_images_train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
都去掉吧
python/paddle/data/cifar10.py
Outdated
#Set the download dir for cifar. | ||
data_home = set_data_path(source_name) | ||
filepath = data_download(data_home,source_url) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要么不要注释,要么删掉。
永远不要再版本管理系统里使用注释来去掉代码。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯
python/paddle/data/amazon.py
Outdated
return download_dir | ||
|
||
|
||
def move_files(source_dire, target_dire): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数没必要有吧。直接调用shutils就好了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好,done
python/paddle/data/cifar_10.py
Outdated
else: | ||
tarfile.open(name=file_path, mode="r:gz").extractall(download_dir) | ||
print("Data has been already downloaded and unpacked!") | ||
return download_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
方便的话返回解压缩的路径吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/mnist.py
Outdated
g_file.close() | ||
print("Data has been already downloaded and unpacked!") | ||
os.remove(file_path) | ||
return download_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
返回这几个文件的路径吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/mnist.py
Outdated
data_url = urlparse.urljoin(source_url,file) | ||
file_path = os.path.join(download_dir,file) | ||
untar_path = os.path.join(download_dir,file.replace(".gz","")) | ||
if not os.path.exists(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单纯的gzip文件不需要解压缩。。因为Python可以直接按照读取。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/data/recommendation.py
Outdated
@@ -0,0 +1,168 @@ | |||
#/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
太多复制粘贴的代码了。请清理一下。
现在我们在paddle.data这个package下,可以写一个子module,比如叫download_http等等。或者,这几个文件都放到一个.py里面,叫fetch.py,里面暴露 fetch_mnist之类的函数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Refine amazon_product_reviews.py
Feature/data api
python/paddle/data/mnist.py
Outdated
BASE_URL = 'http://yann.lecun.com/exdb/mnist/%s-ubyte.gz' | ||
|
||
|
||
class Categories(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazon那个dataset叫做Categories因为就是有很多分类,而mnist这个可能只有一个数据集,没有必要叫做Categories了。
:param directory: | ||
:return: | ||
""" | ||
if directory is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mnist的数据集,一次就要下载那四个文件下来。不存在category的概念
|
||
|
||
class Categories(object): | ||
M1m = "ml-1m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M1m => 'ML1M'
ML 是MovieLens的缩写, 1M是指这个数据大小有1M
__all__ = ['fetch', 'Categories', 'preprocess'] | ||
|
||
|
||
def calculate_md5(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数可以extract到一个公共文件里面。
|
||
|
||
class Categories(object): | ||
Conll05test = "conll05st-tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这如果只有一个语料的话,就没有必要有这个categories对象了。
python/paddle/data/semantic.py
Outdated
|
||
if directory is None: | ||
directory = os.path.expanduser( | ||
os.path.join('~', 'paddle_data', 'amazon')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地址为什么是amazon?
|
||
|
||
class Categories(object): | ||
AclImdb = "aclImdb_v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个的就不用Categories了。
@@ -0,0 +1,5 @@ | |||
""" | |||
The :mod:`paddle.datasets` module includes utilities to load datasets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle.datasets
==> paddle.data
?
label_file = os.path.join(directory, filename + label) | ||
else: | ||
image_file = os.path.join(directory, 't10' + image) | ||
label_file = os.path.join(directory, 't10' + label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t10 -> t10k
baidu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
* add contribute faq * refine with comments
很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。 |
No description provided.