CI: speedup docstring check consecutive runs #57826

dontgoto · 2024-03-13T00:30:04Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

This PR brings the runtime of the docstring check CI down to 2-3 minutes from about 20 minutes.

Currently, check_code.sh docstring does multiple calls to validate_docstrings.py due to various error types that have function exceptions. Each run of validate_docstringstakes about 2-3 minutes, leading to the current 20 minutes runtime just for the docstring checks.

The runtime for consecutive calls is brought down to that of a single call by changing the argument parsing of validate_docstrings, adding -- to separate parameters that previously were separate function calls. Additionally, a cache is added to reuse the parsing results between runs. The cache size should pose no problem, the Python instance ran by `check_code.sh docstring' only reserves about 95MB of memory on my machine.

…heck_performance

…nce' into improve_docstring_check_performance

MarcoGorelli · 2024-03-13T11:38:40Z

This PR brings the runtime of the docstring check CI down to 2-3 minutes from about 20 minutes.

Wow - my hat is raised above my head

@datapythonista fancy taking a look?

datapythonista · 2024-03-13T15:52:57Z

Thanks @dontgoto for making the CI much faster.

While I'm open to getting this merged, I'm a bit unsure about using this approach. In an ideal world we'd like to call validate_docstrings just once for all errors and for all files. This is clearly not the case today, but hopefully we'll eventually get there, and making things significantly more complex for something hopefully lasting only few months may not be worth.

Also, if we want to implement this I think I'd prefer another API where we can simply specify which files and errors to ignore together. Not sure exactly the best way to do this, but something like this should be clearer and simpler IMHO: ./validate_docstring.py --ignore pandas/core.py,PR03 --ignore pandas/frame.py,EX01,EX02

As I said, I'm not opposed to merge this PR, but I think it's making the validator significantly more complex to understand, which is probably worth for the speedup now, but thinking longer term not so much.

What do you think @dontgoto ?

dontgoto · 2024-03-13T16:09:43Z

I think I can change the command line args and their preprocessing quite easily to match your --ignore idea, but in the end still run everything sequentially. Your parameter variant is indeed easier to understand.

I agree that making the rest of the code match this kind of structure and ditching the repeated sequential calls would be preferable, but maybe a bit much of a time investment.

I might look into that in the future though, I have an idea for another PR that might make the docstring checks "commit-hookably" fast, but I still need to test it out.

datapythonista · 2024-03-13T16:27:12Z

I think implementing what I said is not trivial, but I think just one validate call per function would be enough if we do that, so I think it should be as fast as your implementation here, or maybe even a bit faster.

dontgoto · 2024-03-14T09:55:03Z

I pushed a first version of the --ignore parameter. In my opinion, the behavior of the CLI parameters is now easier to understand, but the parsing logic in the script got more complex.

Let me know what you think about the changes to the .py. I am ambivalent, both versions have their own pain points.

If we go ahead with the current version, I'll tidy it up and add tests for the parsing. Merge conflicts for docstring exceptions removed from main in the meantime are to be expected.

dontgoto · 2024-03-14T10:02:33Z

think just one validate call per function would be enough if we do that, so I think it should be as fast as your implementation here, or maybe even a bit faster.

Regarding performance, everything outside the cached validate function is basically free. Calls of main finish in milliseconds or less once the validate document cache is filled in the first run. So no need to minimize calls of main just from the performance angle.

datapythonista · 2024-03-14T17:30:48Z

Thanks for the work on this @dontgoto.

What I had in mind is different to what you implemented. And I think it should be simpler, and reasonably fast.

Regardless of the command line API we implement, we would end up with a list of function names and which errors we need to ignore for them. We have this stored in a variable, and then we call the validator normally: https://github.com/pandas-dev/pandas/blob/main/scripts/validate_docstrings.py#L317

At this point, we have in the result the errors that have been found in the function. If the error is in the list of errors to ignore, we can just remove that from the result. This way we don't need cache, we don't need to much extra complexity, and we call the parsing and the validation just once.

What do you think?

…nce' into improve_docstring_check_performance

…heck_performance

dontgoto · 2024-03-16T14:00:55Z

I agree. The version I previously pushed was a half measure, just adding a variant of the CMD parameter required for this solution, but shoehorning it into the old logic. I refactored everything to use the new parameter, simplifying the parsing. I am satisfied with this solution, let me know whether you agree.

For the new CMD parameter for_error_ignore_functions I find a mapping of error: list[funcs_to_ignore] to be the best fit when considering the maintenance of the exception lists in the code_checks.sh. Just the initial formatting changes there are not nice.

Open for a different name for the parameter though.

datapythonista

Thank you for the work here @dontgoto, really great job. I think this is way clearer, and also thanks for fixing the typos in the file.

I added a couple of comments that I think should make the code clearer, both in the script and in code_checks.sh, but in general your changes look great.

scripts/validate_docstrings.py

ci/code_checks.sh

scripts/validate_docstrings.py

…nce' into improve_docstring_check_performance

datapythonista · 2024-03-16T22:48:33Z

Looks great, thanks a lot for the work here @dontgoto

@jordan-d-murphy @tqa236 do you mind having a look here and sharing any feedback on this change? Thanks!

ci/code_checks.sh

jordan-d-murphy · 2024-03-17T00:14:47Z

This is such a refreshing PR! Love to see this. I am in full support of this new approach. I added one suggestion - but leave it up to you if you think it's valuable to include or not.

My main thoughts on this are:

I love this new approach. I appreciate all the work that was done on this. Would love to see this merged in.
Once this is merged in, I can close the following Issues which I opened based on the previous approach we were using in check_code.sh / validate_docstrings.py

DOC: fix GL08 errors in docstrings
DOC: fix PR01 errors in docstrings
DOC: fix PR07 errors in docstrings
DOC: fix SA01 errors in docstrings
DOC: fix RT03 errors in docstrings
DOC: fix PR02 errors in docstrings

After closing the above issues, I can open a new issue to address fixing the docstrings that follows this new approach
And finally, there seems to be one cryptic failing CI check, ASAN / UBSAN - would like to see this resolved and all green on the CI, but as the logs got deleted, it's hard to tell if this is related to this PR or some outside issue.

dontgoto · 2024-03-17T01:56:51Z

Thanks!

And finally, there seems to be one cryptic failing CI check, ASAN / UBSAN - would like to see this resolved and all green on the CI, but as the logs got deleted, it's hard to tell if this is related to this PR or some outside issue.

It seems that this test is currently failing on this and many other PRs, unit tests are failing on main as well.

I'd be happy to see this merged since resolving the conflicts with ongoing doc fix PRs is a hassle. Let me know if there is anything else blocking this.

Thanks again for all the great feedback and the welcoming atmosphere :)

jordan-d-murphy · 2024-03-17T02:45:20Z

Awesome! Lgtm 🙂

tqa236 · 2024-03-17T08:38:12Z

Hello, this is a great improvement! LGTM too.

datapythonista · 2024-03-17T22:02:16Z

Amazing job @dontgoto.

I see the code_checks job takes 20 minutes instead of 40 after this, and I think this will help a lot with the efforts to fix errors in docstrings.

dontgoto added 7 commits March 13, 2024 00:56

support running multiple docstring checks in one call

439395a

add test

6afa708

group docstring calls in code_checks

03fe545

fix after merge, add cache size limit

c01a78e

merge

0cdb574

merge

841793f

prepare PR

0ed525a

dontgoto requested a review from mroeschke as a code owner March 13, 2024 00:30

dontgoto added 4 commits March 13, 2024 01:42

revert CLI formatting

22c8dce

Merge branch 'main' into improve_docstring_check_performance

8194d25

Merge remote-tracking branch 'upstream/main' into improve_docstring_c…

5d3bd40

…heck_performance

Merge remote-tracking branch 'origin/improve_docstring_check_performa…

1e27de8

…nce' into improve_docstring_check_performance

datapythonista added Docs CI Continuous Integration labels Mar 13, 2024

dontgoto added 2 commits March 14, 2024 10:15

introduce new shell parameter

99f0cee

fix merge errors

06cebc7

Merge branch 'main' into improve_docstring_check_performance

bf5da55

dontgoto added 3 commits March 16, 2024 14:54

use for_error_ignore_functions everywhere

69545dc

Merge remote-tracking branch 'origin/improve_docstring_check_performa…

e9a3a8c

…nce' into improve_docstring_check_performance

Merge remote-tracking branch 'upstream/main' into improve_docstring_c…

12aac1b

…heck_performance

dontgoto added 2 commits March 16, 2024 15:12

cleanup

96f269b

remove whitespace changes

6140bb5

datapythonista reviewed Mar 16, 2024

View reviewed changes

scripts/validate_docstrings.py Outdated Show resolved Hide resolved

ci/code_checks.sh Outdated Show resolved Hide resolved

scripts/validate_docstrings.py Outdated Show resolved Hide resolved

scripts/validate_docstrings.py Outdated Show resolved Hide resolved

dontgoto added 4 commits March 16, 2024 19:25

Merge branch 'main' into improve_docstring_check_performance

1038b45

integrate review comments

013514c

Merge remote-tracking branch 'origin/improve_docstring_check_performa…

5bf704d

…nce' into improve_docstring_check_performance

remove cmd formatting code

a172c67

jordan-d-murphy reviewed Mar 17, 2024

View reviewed changes

ci/code_checks.sh Show resolved Hide resolved

dontgoto added 2 commits March 17, 2024 02:37

add comment

3e315c7

merge main and fix conflicts in code_checks.sh

337db47

fix line cont char

a958037

dontgoto mentioned this pull request Mar 17, 2024

CI: Docstring validation is slow #57578

Open

datapythonista merged commit 89898a6 into pandas-dev:main Mar 17, 2024
46 of 47 checks passed

jordan-d-murphy mentioned this pull request Mar 18, 2024

CI: Better error control in the validation of docstrings #57879

Merged

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024

CI: speedup docstring check consecutive runs (pandas-dev#57826)

e1a3d39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: speedup docstring check consecutive runs #57826

CI: speedup docstring check consecutive runs #57826

dontgoto commented Mar 13, 2024 •

edited

Loading

MarcoGorelli commented Mar 13, 2024

datapythonista commented Mar 13, 2024

dontgoto commented Mar 13, 2024

datapythonista commented Mar 13, 2024

dontgoto commented Mar 14, 2024

dontgoto commented Mar 14, 2024

datapythonista commented Mar 14, 2024

dontgoto commented Mar 16, 2024

datapythonista left a comment

datapythonista commented Mar 16, 2024

jordan-d-murphy commented Mar 17, 2024 •

edited

Loading

dontgoto commented Mar 17, 2024 •

edited

Loading

jordan-d-murphy commented Mar 17, 2024

tqa236 commented Mar 17, 2024

datapythonista commented Mar 17, 2024

CI: speedup docstring check consecutive runs #57826

CI: speedup docstring check consecutive runs #57826

Conversation

dontgoto commented Mar 13, 2024 • edited Loading

MarcoGorelli commented Mar 13, 2024

datapythonista commented Mar 13, 2024

dontgoto commented Mar 13, 2024

datapythonista commented Mar 13, 2024

dontgoto commented Mar 14, 2024

dontgoto commented Mar 14, 2024

datapythonista commented Mar 14, 2024

dontgoto commented Mar 16, 2024

datapythonista left a comment

Choose a reason for hiding this comment

datapythonista commented Mar 16, 2024

jordan-d-murphy commented Mar 17, 2024 • edited Loading

dontgoto commented Mar 17, 2024 • edited Loading

jordan-d-murphy commented Mar 17, 2024

tqa236 commented Mar 17, 2024

datapythonista commented Mar 17, 2024

dontgoto commented Mar 13, 2024 •

edited

Loading

jordan-d-murphy commented Mar 17, 2024 •

edited

Loading

dontgoto commented Mar 17, 2024 •

edited

Loading