Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on nltk for paddle __init__. #27388

Merged
merged 2 commits into from
Sep 24, 2020

Conversation

guoshengCS
Copy link
Contributor

@guoshengCS guoshengCS commented Sep 17, 2020

PR types

Others

PR changes

Others

Describe

Remove dependency on nltk for paddle init.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Sep 17, 2020

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@guoshengCS
Copy link
Contributor Author

guoshengCS commented Sep 18, 2020

2020-09-17 19:48:22 ****************
2020-09-17 19:48:22 0. You must have one RD (XiaoguangHu01 or lanxianghit) and one TPM (saxon-zh or jzhang533 or swtkiwi or Heeenrrry or TCChenlong) approval for the api change for the management reason of API interface.
2020-09-17 19:48:22 
2020-09-17 19:48:22 There are 1 approved errors.
2020-09-17 19:48:22 ****************
2020-09-17 19:48:22 API Difference is: 
2020-09-17 19:48:22 - paddle.dataset.sentiment.train (ArgSpec(args=[], varargs=None, keywords=None, defaults=None), ('document', '5337749ac9c45d13f4ef6e992b2f0ff8'))
2020-09-17 19:48:22 ?   

还请 @XiaoguangHu01 @jzhang533 @swtkiwi 帮忙review

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看起来只用到了nltk.download

可以把这两个数据集手动下载下来,上传到https://dataset.bj.bcebos.com上。(甚至可以线下先处理好数据格式),然后代码里下载及加载数据即可。
用户环境里完全不需要有nltk吗?

@guoshengCS
Copy link
Contributor Author

当前数据集已经上传,目前会优先从 https://dataset.bj.bcebos.com 下载 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/dataset/sentiment.py#L45

还用到了movie_review接口,这个有nltk的数据集设计在里面 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/dataset/sentiment.py#L80 相对麻烦些

def get_word_dict():
    """
    Sorted the words by the frequency of words which occur in sample
    :return:
        words_freq_sorted
    """
    words_freq_sorted = list()
    word_freq_dict = collections.defaultdict(int)
    download_data_if_not_yet()

    for category in movie_reviews.categories():
        for field in movie_reviews.fileids(category):
            for words in movie_reviews.words(field):
                word_freq_dict[words] += 1
    words_sort_list = list(six.iteritems(word_freq_dict))
    words_sort_list.sort(key=cmp_to_key(lambda a, b: b[1] - a[1]))
    for index, word in enumerate(words_sort_list):
        words_freq_sorted.append((word[0], index))
    return words_freq_sorted

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

跟泽裕商量了一下。这个数据集直接delete好了,基于以下原因:

  • 初步调研看到没有教程或者项目依赖本数据集提供的接口;
  • 相关的功能有计划迁移至paddlenlp;
  • 情感分析类的任务的示例教程不依赖这个数据集
  • 干净的去掉nltk,可以优化安装体验

TODO:
麻烦删除相关的code和fluiddoc中的文档。

@guoshengCS
Copy link
Contributor Author

相关文档已在FluidDoc中提交删除PR PaddlePaddle/docs#2664

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

@saxon-zh saxon-zh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guoshengCS guoshengCS merged commit 081fb2f into PaddlePaddle:develop Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants