Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add]将 xdoctest 引入到飞桨框架工作流中 #540

Closed
wants to merge 1 commit into from

Conversation

megemini
Copy link
Contributor

PR types

New features

PR changes

Docs

Describe

[used AI Studio]

中国软件开源创新大赛:飞桨框架任务挑战赛

@SigureMo @Ligoml

请评审!谢谢!

@paddle-bot
Copy link

paddle-bot bot commented May 21, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@megemini
Copy link
Contributor Author

另外:

  1. Paddle 的 CI 流水线,尤其是百度效率云 iPipe 的具体配置,我还不太清楚。
  2. 目前分析,已有代码的格式转换 google 样式,好像大部分要人工参与,不知道有没有什么好的方式?!

还请帮忙指导一下,谢谢!

Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

很棒的 RFC!不过有些细节需要稍微调整下~

|提交作者 | megemini (柳顺) |
|提交时间 | 2023-05-21 |
|版本号 | V1.0 |
|依赖飞桨版本 | paddlepaddle>2.4 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该在 develop 分支上开发


### 2.1 文档建设

更新 Paddle 贡献指南中的文档: [开发 API Python 端](https://www.paddlepaddle.org.cn/documentation/docs/zh/dev_guides/api_contributing_guides/new_python_api_cn.html#api-python) 。以此规范后续代码的开发。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API 文档书写规范 也应同步修改~


更新 Paddle 贡献指南中的文档: [开发 API Python 端](https://www.paddlepaddle.org.cn/documentation/docs/zh/dev_guides/api_contributing_guides/new_python_api_cn.html#api-python) 。以此规范后续代码的开发。

添加 `Example` 示例代码的写作要求,要求符合 `xdoctest` 中的 `google` style,即,在示例 `Example` 中代码需要以 `>>>` 开头。且保留目前的 `code-block` 提示,从而不影响中文文档的生成工作。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在示例 Example 中代码需要以 >>> 开头。且保留目前的 code-block 提示

很不错的方案~不过需要确认下,带有 code-block 这种方式是兼容 xdoctest 的嘛?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体 xdoctest 的源码 parser.py 我只是大体看了一下,目前咱们的 .. code-block:: python 在 xdoctest 应该是当作 TEXT 来处理的,所以没啥影响。

用一个简单的例子可以验证一下:

def test(a):
    """this is docstring...

    Examples:

        .. code-block:: python

            this is a test...

            >>> a = 3
            >>> print(a)
            3

    """
    pass

得到结果是可以的:

$ xdoctest --style=google test_simple.py 

=====================================
_  _ ___  ____ ____ ___ ____ ____ ___
 \/  |  \ |  | |     |  |___ [__   |
_/\_ |__/ |__| |___  |  |___ ___]  |

=====================================

Start doctest_module('test_simple.py')
Listing tests
gathering tests
running 1 test(s)
====== <exec> ======
* DOCTEST : test_simple.py::test:0, line 5 <- wrt source file
DOCTEST SOURCE
6 >>> a = 3
7 >>> print(a)
  3
DOCTEST STDOUT/STDERR
3
DOCTEST RESULT
* SUCCESS: test_simple.py::test:0
====== </exec> ======
============
=== 1 passed in 0.09 seconds ===


Paddle 代码的 CI 流水线相关工具放置在 [Paddle/tools/](https://github.com/PaddlePaddle/Paddle/tree/develop/tools) 目录下。

目前对于 python 示例代码的检查,主要通过 [Paddle/tools/codestyle/docstring_checker.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/codestyle/docstring_checker.py) 完成。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是 paddle/tools/sampcd_processor.py 吧?关于 docstring_checker,是一个没有起作用的工具,可参见 PaddlePaddle/Paddle#47821

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,这个我再具体看看然后改一下~

print("Sample code check is successful!")
```

此方法存在较多问题,比如,无法验证代码与示例中的结果是否一致,无法处理本应报错的示例代码等。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无法处理本应报错的示例代码

这是指?报错的示例代码现阶段应该会在 CI 中报错的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xdoctest 可以捕获 Error 的输出进行检查:

def test(a):
    """this is docstring...

    Examples:

        .. code-block:: python

            this is a test...

            >>> raise ValueError
            Traceback (most recent call last):
              File "<stdin>", line 1, in <module>
            ValueError

    """
    pass

执行 xdoctest :

$ xdoctest --style=google test_error.py 

=====================================
_  _ ___  ____ ____ ___ ____ ____ ___
 \/  |  \ |  | |     |  |___ [__   |
_/\_ |__/ |__| |___  |  |___ ___]  |

=====================================

Start doctest_module('test_error.py')
Listing tests
gathering tests
running 1 test(s)
====== <exec> ======
* DOCTEST : test_error.py::test:0, line 5 <- wrt source file
DOCTEST SOURCE
6 >>> raise ValueError
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  ValueError
DOCTEST STDOUT/STDERR
DOCTEST RESULT
* SUCCESS: test_error.py::test:0
====== </exec> ======
============
=== 1 passed in 0.09 seconds ===

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

了解~


目前 Paddle 中 python 相关代码,主要放置在 [Paddle/python/paddle/](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle) 目录下。

其中包括 `2334` 个 python 文件,包括示例代码 `341` 段。(commit `8acbf10bd51026c0a41423c2826b7cc886ad1e74`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

包括示例代码 341

这里的统计来源是?只有 341 个示例代码嘛?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这里简单改了一下 docs/ci_scripts/chinese_samplecode_processor.py 进行统计:

import math
import os
import pickle
import shutil
import subprocess
import multiprocessing
import sys

import glob

def remove_desc_code(srcls, filename):
    if filename == 'fluid_cn/one_hot_cn.rst':
        srcls.pop(13)
        srcls.pop(28)
        srcls.pop(44)
    if filename == 'layers_cn/one_hot_cn.rst':
        srcls.pop(15)
        srcls.pop(30)
        srcls.pop(46)
    if filename == 'profiler_cn/profiler_cn.rst':
        srcls.pop(41)
    if filename == 'layers_cn/natural_exp_decay_cn.rst':
        srcls.pop(13)
    if filename == 'layers_cn/transpose_cn.rst':
        srcls.pop(20)
    if filename == 'layers_cn/array_length_cn.rst':
        srcls.pop(36)
    if filename == 'layers_cn/inverse_time_decay_cn.rst':
        srcls.pop(13)
    if filename == 'layers_cn/stack_cn.rst':
        srcls.pop(12)
        srcls.pop(33)
    if filename == 'layers_cn/sums_cn.rst':
        srcls.pop(11)
    if filename == 'layers_cn/sum_cn.rst':
        for i in range(len(srcls) - 1, 61, -1):
            srcls.pop(i)
    if filename == 'layers_cn/softmax_cn.rst':
        srcls.pop(30)
        srcls.pop(57)
    if filename == 'layers_cn/array_write_cn.rst':
        srcls.pop(37)
    if filename == 'layers_cn/lod_append_cn.rst':
        srcls.pop(11)
    if filename == 'layers_cn/reorder_lod_tensor_by_rank_cn.rst':
        srcls.pop(25)
    if filename == 'layers_cn/round_cn.rst':
        srcls.pop(10)
    if filename == 'layers_cn/squeeze_cn.rst':
        srcls.pop(11)
        srcls.pop(19)
        srcls.pop(27)
    if filename == 'layers_cn/unsqueeze_cn.rst':
        srcls.pop(11)
    if filename == 'layers_cn/array_read_cn.rst':
        srcls.pop(51)
    if filename == 'layers_cn/scatter_cn.rst':
        srcls.pop(9)
    if filename == 'layers_cn/topk_cn.rst':
        srcls.pop(11)
    if filename == 'optimizer_cn/ModelAverage_cn.rst':
        srcls.pop(15)
    return srcls


def check_indent(code_line):
    indent = ""
    for c in code_line:
        if c == '\t':
            indent += '    '
        elif c == ' ':
            indent += ' '
        if c != ' ' and c != '\t':
            break
    return indent


def find_all(src_str, substr):
    indices = []
    get_one = src_str.find(substr)
    while get_one != -1:
        indices.append(get_one)
        get_one = src_str.find(substr, get_one + 1)
    return indices


def extract_sample_code(srcfile, status_all):
    content = ""

    filename = srcfile.name
    srcc = srcfile.read()
    srcfile.seek(0, 0)
    srcls = srcfile.readlines()
    srcls = remove_desc_code(
        srcls, filename
    )  # remove description info for samplecode
    status = []
    sample_code_begins = find_all(srcc, " code-block:: python")
    if len(sample_code_begins) == 0:
        status.append(-1)

    else:
        for i in range(0, len(srcls)):
            if srcls[i].find(".. code-block:: python") != -1:
                content = ""
                start = i

                blank_line = 1
                while srcls[start + blank_line].strip() == '':
                    blank_line += 1

                startindent = ""
                # remove indent error
                if srcls[start + blank_line].find("from") != -1:
                    startindent += srcls[start + blank_line][
                        : srcls[start + blank_line].find("from")
                    ]
                elif srcls[start + blank_line].find("import") != -1:
                    startindent += srcls[start + blank_line][
                        : srcls[start + blank_line].find("import")
                    ]
                else:
                    startindent += check_indent(srcls[start + blank_line])
                content += srcls[start + blank_line][len(startindent) :]
                for j in range(start + blank_line + 1, len(srcls)):
                    # planish a blank line
                    if (
                        not srcls[j].startswith(startindent)
                        and srcls[j] != '\n'
                    ):
                        break
                    if srcls[j].find(" code-block:: python") != -1:
                        break
                    content += srcls[j].replace(startindent, "", 1)
                status.append(run_sample_code(content, filename))

    status_all[filename] = status
    return status_all, content

def run_sample_code(content, filename):
    return 0

def test(file):
    temp = []
    src = open(file, 'r')
    status_all = {}
    _, content = extract_sample_code(src, status_all)
    temp.append(status_all)
    src.close()
    return temp, content

if __name__ == '__main__':
    with open('codes.txt', 'w') as f_codes:
        codes = []
        count = 0
        count_codes = 0
        for root, dirs, files in os.walk('/home/shun/Documents/Projects/paddle_xdoctest/Paddle-develop/python/paddle'):
            # print("当前目录:", root)
            # print("子目录列表:", dirs)
            # print("文件列表:", files)
            for f in files:
                if f.endswith('.py'):
                    count += 1
                    filename = os.path.join(root, f)
                    _, _codes = test(filename)
                    if _codes:
                        count_codes += 1

                        f_codes.write('-'*30 + str(count_codes))
                        f_codes.write('\n')
                        f_codes.write(filename + '\t' + '-'*30)
                        f_codes.write('\n')
                        f_codes.write(_codes)                        
                        f_codes.write('\n')
                        
    print('total...', count)
    print('total code...', count_codes)

这里抽出来就这么多,我感觉也有点少,不过 python 的文件数好像也对 就没深究了 呵呵 。。。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以试一下 Paddle 下的脚本 paddle/tools/sampcd_processor.py


3. 后期收尾阶段:切换流水线至 Paddle 代码中,可移除 Paddle docs 的代码检查。
- 中英文 [API 文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html#api) 特性更新,可以复制带有 `>>>` 提示符的代码示例,包含代码与注释,不含输出。
- 代码检查移交(可选),将代码检查的工作全部从 Paddle docs 移交至 Paddle 代码的 CI 流水线中进行。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于前面所述,Paddle 和 docs 是同时包含代码检查的,这里的一些表述需要修改下

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码检查移交(可选)

我觉得「可选」可以删掉,因为同时使用两个工具来检查会徒增维护成本,该阶段可以移除原有的代码检查

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以复制带有 >>> 提示符的代码示例,包含代码与注释,不含输出。

这个前中期的代码复制是如何保证的呢?用户在前中期看到、复制的代码是包含 >>> 和注释的吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方没写详细~

目前 docs 是用 sphinx 构建的吧?模板是不是在 templates_path = ["/templates"] 下面?

我还真没用过 sphinx 构建过文档,不确定前中期看到和复制的代码是什么样的,这个地方单独把这个特性拎出来也是为了跟踪一下。

- 后续行中没有 `>>>` 开头的语句视为输出,其上一行必须以 `>>>` 开头。
- 空行视为新的代码段开始

但是,由于 `xdoctest` 中也暂无此类强行的格式检查,所以,此设计项作为可选。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该阶段是否可以将 .. code-block:: 及缩进移除呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以啊~ 如果确认不需要 .. code-block:: ,相应的需要修改 Paddle 代码和 Paddle docs 对于示例代码的抽取。

这样的话,建议单独拎一个特性出来~

不过,这里还是要确认一下,由于 xdoctest 对于目前的示例代码是 “兼容” 的,也就是会自动跳过,咱们后面是否需要强制检查这个格式?所以我这里把 2.3 不再兼容旧格式(可选) 列为了可选。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不过,这里还是要确认一下,由于 xdoctest 对于目前的示例代码是 “兼容” 的,也就是会自动跳过,咱们后面是否需要强制检查这个格式?所以我这里把 2.3 不再兼容旧格式(可选) 列为了可选。

如果没有检查的话,会有开发者因为使用了旧的格式而被跳过吧,这样相应的代码即便发生了错误也无法被检查出来了,这是不太能接受的,所以还是比较建议有这样的一个检查的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赞同!:)

- 影响 Paddle 代码与 Paddle docs 的 CI 流水线
- 影响目前 python API 的示例代码写作方式
- 影响文档 `开发 API Python 端` 的页面显示
- 影响中英文 API 文档的示例代码显示与代码复制
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可否按照 https://github.com/PaddlePaddle/community/blob/master/rfcs/design_template.md#%E4%B8%83%E5%BD%B1%E5%93%8D%E9%9D%A2 分成几类来描述下呢?可以稍微展开说下影响有多大,是否可控


另外,对于无法验证输出一致性的示例(随机分布)、需要特殊环境(如需要GPU、文件存储)等均无特殊处理。


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外最好额外提一下,Paddle 现有的代码检查工具的原理是运行时抽取 docstring 还是静态代码分析?xdoctest 又是如何抽取的?

值得注意的是,运行时抽取有一个优势是即便是 C++ 代码中定义的 Docstring 也是可以正确抽取出来的,而静态代码分析则是不太容易做到的,这一点可以确定一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是很理解,是用 xdoctest 抽取 c++ 中的例子?

Copy link
Member

@SigureMo SigureMo May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

比如对于

https://github.com/PaddlePaddle/Paddle/blob/83a12b1110677d98b92c1734cdcc3a31e480ac67/paddle/fluid/pybind/cuda_streams_py.cc#L130-L137

是通过 pybind11 暴露的 API,其生成的文档见

https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/device/cuda/Stream_cn.html#stream

这个 API 的示例代码现有的示例代码检查工具是可以检查的吗?xdoctest 是可以检查的吗?需要对比一下~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯 xdoctest 可以动态解析:

    analysis (str, default='auto'):
        if 'static', only static analysis is used to parse call
        definitions. If 'auto', uses dynamic analysis for compiled python
        extensions, but static analysis elsewhere, if 'dynamic', then
        dynamic analysis is used to parse all calldefs.
def parse_dynamic_calldefs(modpath_or_module):
...
    if getattr(module, '__doc__'):
        calldefs['__doc__'] = static.CallDefNode(
            callname='__doc__',
            docstr=module.__doc__,
            lineno=0,
            doclineno=1,
            doclineno_end=1,
            args=None
        )
...

paddle.device.cuda.Stream.__doc__ 我看能正常抽取出来,但是具体 xdoctest 怎么处理,这个要具体做的时候关注一下!我单独分一个特性出来跟踪吧~ :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants