Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: speedup docstring check consecutive runs #57826

Merged

Conversation

dontgoto
Copy link
Contributor

@dontgoto dontgoto commented Mar 13, 2024

This PR brings the runtime of the docstring check CI down to 2-3 minutes from about 20 minutes.

Currently, check_code.sh docstring does multiple calls to validate_docstrings.py due to various error types that have function exceptions. Each run of validate_docstringstakes about 2-3 minutes, leading to the current 20 minutes runtime just for the docstring checks.

The runtime for consecutive calls is brought down to that of a single call by changing the argument parsing of validate_docstrings, adding -- to separate parameters that previously were separate function calls. Additionally, a cache is added to reuse the parsing results between runs. The cache size should pose no problem, the Python instance ran by `check_code.sh docstring' only reserves about 95MB of memory on my machine.

@dontgoto dontgoto requested a review from mroeschke as a code owner March 13, 2024 00:30
@MarcoGorelli
Copy link
Member

This PR brings the runtime of the docstring check CI down to 2-3 minutes from about 20 minutes.

Wow - my hat is raised above my head

@datapythonista fancy taking a look?

@datapythonista datapythonista added Docs CI Continuous Integration labels Mar 13, 2024
@datapythonista
Copy link
Member

Thanks @dontgoto for making the CI much faster.

While I'm open to getting this merged, I'm a bit unsure about using this approach. In an ideal world we'd like to call validate_docstrings just once for all errors and for all files. This is clearly not the case today, but hopefully we'll eventually get there, and making things significantly more complex for something hopefully lasting only few months may not be worth.

Also, if we want to implement this I think I'd prefer another API where we can simply specify which files and errors to ignore together. Not sure exactly the best way to do this, but something like this should be clearer and simpler IMHO: ./validate_docstring.py --ignore pandas/core.py,PR03 --ignore pandas/frame.py,EX01,EX02

As I said, I'm not opposed to merge this PR, but I think it's making the validator significantly more complex to understand, which is probably worth for the speedup now, but thinking longer term not so much.

What do you think @dontgoto ?

@dontgoto
Copy link
Contributor Author

I think I can change the command line args and their preprocessing quite easily to match your --ignore idea, but in the end still run everything sequentially. Your parameter variant is indeed easier to understand.

I agree that making the rest of the code match this kind of structure and ditching the repeated sequential calls would be preferable, but maybe a bit much of a time investment.

I might look into that in the future though, I have an idea for another PR that might make the docstring checks "commit-hookably" fast, but I still need to test it out.

@datapythonista
Copy link
Member

I think implementing what I said is not trivial, but I think just one validate call per function would be enough if we do that, so I think it should be as fast as your implementation here, or maybe even a bit faster.

@dontgoto
Copy link
Contributor Author

I pushed a first version of the --ignore parameter. In my opinion, the behavior of the CLI parameters is now easier to understand, but the parsing logic in the script got more complex.

Let me know what you think about the changes to the .py. I am ambivalent, both versions have their own pain points.

If we go ahead with the current version, I'll tidy it up and add tests for the parsing. Merge conflicts for docstring exceptions removed from main in the meantime are to be expected.

@dontgoto
Copy link
Contributor Author

think just one validate call per function would be enough if we do that, so I think it should be as fast as your implementation here, or maybe even a bit faster.

Regarding performance, everything outside the cached validate function is basically free. Calls of main finish in milliseconds or less once the validate document cache is filled in the first run. So no need to minimize calls of main just from the performance angle.

@datapythonista
Copy link
Member

Thanks for the work on this @dontgoto.

What I had in mind is different to what you implemented. And I think it should be simpler, and reasonably fast.

Regardless of the command line API we implement, we would end up with a list of function names and which errors we need to ignore for them. We have this stored in a variable, and then we call the validator normally: https://github.com/pandas-dev/pandas/blob/main/scripts/validate_docstrings.py#L317

At this point, we have in the result the errors that have been found in the function. If the error is in the list of errors to ignore, we can just remove that from the result. This way we don't need cache, we don't need to much extra complexity, and we call the parsing and the validation just once.

What do you think?

@dontgoto
Copy link
Contributor Author

I agree. The version I previously pushed was a half measure, just adding a variant of the CMD parameter required for this solution, but shoehorning it into the old logic. I refactored everything to use the new parameter, simplifying the parsing. I am satisfied with this solution, let me know whether you agree.

For the new CMD parameter for_error_ignore_functions I find a mapping of error: list[funcs_to_ignore] to be the best fit when considering the maintenance of the exception lists in the code_checks.sh. Just the initial formatting changes there are not nice.

Open for a different name for the parameter though.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work here @dontgoto, really great job. I think this is way clearer, and also thanks for fixing the typos in the file.

I added a couple of comments that I think should make the code clearer, both in the script and in code_checks.sh, but in general your changes look great.

scripts/validate_docstrings.py Outdated Show resolved Hide resolved
ci/code_checks.sh Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
scripts/validate_docstrings.py Outdated Show resolved Hide resolved
@datapythonista
Copy link
Member

Looks great, thanks a lot for the work here @dontgoto

@jordan-d-murphy @tqa236 do you mind having a look here and sharing any feedback on this change? Thanks!

@jordan-d-murphy
Copy link
Contributor

jordan-d-murphy commented Mar 17, 2024

This is such a refreshing PR! Love to see this. I am in full support of this new approach. I added one suggestion - but leave it up to you if you think it's valuable to include or not.

My main thoughts on this are:

  1. I love this new approach. I appreciate all the work that was done on this. Would love to see this merged in.

  2. Once this is merged in, I can close the following Issues which I opened based on the previous approach we were using in check_code.sh / validate_docstrings.py

DOC: fix GL08 errors in docstrings
DOC: fix PR01 errors in docstrings
DOC: fix PR07 errors in docstrings
DOC: fix SA01 errors in docstrings
DOC: fix RT03 errors in docstrings
DOC: fix PR02 errors in docstrings

  1. After closing the above issues, I can open a new issue to address fixing the docstrings that follows this new approach

  2. And finally, there seems to be one cryptic failing CI check, ASAN / UBSAN - would like to see this resolved and all green on the CI, but as the logs got deleted, it's hard to tell if this is related to this PR or some outside issue.

@dontgoto
Copy link
Contributor Author

dontgoto commented Mar 17, 2024

Thanks!

  • And finally, there seems to be one cryptic failing CI check, ASAN / UBSAN - would like to see this resolved and all green on the CI, but as the logs got deleted, it's hard to tell if this is related to this PR or some outside issue.

It seems that this test is currently failing on this and many other PRs, unit tests are failing on main as well.

I'd be happy to see this merged since resolving the conflicts with ongoing doc fix PRs is a hassle. Let me know if there is anything else blocking this.

Thanks again for all the great feedback and the welcoming atmosphere :)

@jordan-d-murphy
Copy link
Contributor

Awesome! Lgtm 🙂

@tqa236
Copy link
Contributor

tqa236 commented Mar 17, 2024

Hello, this is a great improvement! LGTM too.

@datapythonista datapythonista merged commit 89898a6 into pandas-dev:main Mar 17, 2024
46 of 47 checks passed
@datapythonista
Copy link
Member

Amazing job @dontgoto.

I see the code_checks job takes 20 minutes instead of 40 after this, and I think this will help a lot with the efforts to fix errors in docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants