-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dependency on nltk for paddle __init__. #27388
Remove dependency on nltk for paddle __init__. #27388
Conversation
test=develop
Thanks for your contribution! |
✅ This PR's description meets the template requirements! |
还请 @XiaoguangHu01 @jzhang533 @swtkiwi 帮忙review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看起来只用到了nltk.download
可以把这两个数据集手动下载下来,上传到https://dataset.bj.bcebos.com上。(甚至可以线下先处理好数据格式),然后代码里下载及加载数据即可。
用户环境里完全不需要有nltk吗?
当前数据集已经上传,目前会优先从 https://dataset.bj.bcebos.com 下载 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/dataset/sentiment.py#L45 还用到了 def get_word_dict():
"""
Sorted the words by the frequency of words which occur in sample
:return:
words_freq_sorted
"""
words_freq_sorted = list()
word_freq_dict = collections.defaultdict(int)
download_data_if_not_yet()
for category in movie_reviews.categories():
for field in movie_reviews.fileids(category):
for words in movie_reviews.words(field):
word_freq_dict[words] += 1
words_sort_list = list(six.iteritems(word_freq_dict))
words_sort_list.sort(key=cmp_to_key(lambda a, b: b[1] - a[1]))
for index, word in enumerate(words_sort_list):
words_freq_sorted.append((word[0], index))
return words_freq_sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 下载数据好像不需要一个back off策略,即便需要,从 http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz 下载也会更合适。
get_word_dict
是统计词表的吧,这个自己实现也不麻烦。(肯定比实现RNN简单多了)- 为了一个3M的moview_reviews数据集,引入一个nltk,太不值当了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
跟泽裕商量了一下。这个数据集直接delete好了,基于以下原因:
- 初步调研看到没有教程或者项目依赖本数据集提供的接口;
- 相关的功能有计划迁移至paddlenlp;
- 情感分析类的任务的示例教程不依赖这个数据集
- 干净的去掉nltk,可以优化安装体验
TODO:
麻烦删除相关的code和fluiddoc中的文档。
相关文档已在FluidDoc中提交删除PR PaddlePaddle/docs#2664 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Describe
Remove dependency on nltk for paddle init.