-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to validate_docstring script to be able to check all docstrings at once #22408
Changes to validate_docstring script to be able to check all docstrings at once #22408
Conversation
Codecov Report
@@ Coverage Diff @@
## master #22408 +/- ##
==========================================
+ Coverage 92.18% 92.19% +<.01%
==========================================
Files 169 169
Lines 50830 50873 +43
==========================================
+ Hits 46860 46904 +44
+ Misses 3970 3969 -1
Continue to review full report at Codecov.
|
@@ -490,25 +490,25 @@ def _import_path(self, klass=None, func=None): | |||
return base_path | |||
|
|||
def test_good_class(self): | |||
assert validate_one(self._import_path( # noqa: F821 | |||
klass='GoodDocStrings')) == 0 | |||
assert len(validate_one(self._import_path( # noqa: F821 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PEP8 can just use implicit truthiness of empty list, so just not ...['errors']
instead of len(...['errors']) == 0
. Applicable to other tests below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general that is true, however in this case I think it is good to test it is actually a zero-length list and not something else Falsey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on both comments, I'll change as @WillAyd suggests, and will also assert that the result is a list.
@@ -546,7 +546,6 @@ def test_bad_generic_functions(self, func): | |||
marks=pytest.mark.xfail) | |||
]) | |||
def test_bad_examples(self, capsys, klass, func, msgs): | |||
validate_one(self._import_path(klass=klass, func=func)) # noqa:F821 | |||
err = capsys.readouterr().err | |||
res = validate_one(self._import_path(klass=klass, func=func)) # noqa:F821 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency let's use result
instead of res
for msg in msgs: | ||
assert msg in err | ||
assert msg in ''.join(res['errors']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to join as a string here or can we just check for inclusion in the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong, the messages we're checking are no the whole error message, but part of it. So, this is equivalent to something like:
assert 'error in summary' in ''.join(['there is a error in summary for this doc', 'some other error'])
I think it makes more sense to use the join
, as the error message will be clearer if this fails. Compared to checking all in a comprehension and them doing an any
(which I guess will just say False
is not True
).
scripts/validate_docstrings.py
Outdated
for line in f: | ||
line = line.strip() | ||
if len(line) == len(previous_line): | ||
if set(line) == set('-'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to consolidate this and below conditional into one for readability and less copy/paste?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somehow agree with what you say, but I think in this case there is no way (at least I couldn't find it) to merge both use cases without increasing the amount of code and its complexity.
scripts/validate_docstrings.py
Outdated
pass | ||
else: | ||
continue | ||
def get_api_items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather complex function - can we add a docstring to explain what this is doing?
return fname | ||
try: | ||
fname = inspect.getsourcefile(self.code_obj) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly? the inspect.getsourcefile()
returns the source file where a Python object is implemented. The except TypeError
is explain in the comment below.
else: | ||
return validate_one(function) | ||
doc_info = validate_one(func_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason why we write different output for one vs all simply console readability? Wondering if we shouldn't align the output regardless of how this is invoked (maybe a separate PR depending on complexity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but I think the lack of consistency is necessary in this case.
If the script is called for one object (a user validating the docstring they are changing), the "pretty" output in the console is what makes sense to return.
If the script is called for all objects (we as maintainers analyzing the status), a JSON with the information to be analyzed with pandas makes more sense than any console formatting.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'm on the fence about it myself. Not something that needs to be addressed here per se but food for thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a quick summary of the changes? (it seems more a bit of refactor than new functionality? Or are there user-visible changes?)
@@ -490,25 +490,25 @@ def _import_path(self, klass=None, func=None): | |||
return base_path | |||
|
|||
def test_good_class(self): | |||
assert validate_one(self._import_path( # noqa: F821 | |||
klass='GoodDocStrings')) == 0 | |||
assert len(validate_one(self._import_path( # noqa: F821 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general that is true, however in this case I think it is good to test it is actually a zero-length list and not something else Falsey?
Thanks for the great feedback. I addressed most of the points, and commented on the others. @jorisvandenbossche agree some more info about the PR had to be provided. I edited the description to include what's the goal, and what else has been changed. |
scripts/validate_docstrings.py
Outdated
pass | ||
else: | ||
continue | ||
def get_api_items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the docstring it helps a lot! Is there a way to make a minimal test for this? Don't want to go overboard with tests here but I am worried that if this were to somehow break in the future it could allow subtle failures in subsequent tests to go through unnoticed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, but I'd address it in a separate PR
OK, that sounds good to me! |
@WillAyd let me know if there is any blocker to merge this |
@datapythonista sorry I think the size of the diff is misleading; I didn't realize I was asking you to change things that were orthogonal to your change in the first place! Is this addressing something critical that we need to get in quickly? On one side I suppose it's OK because this isn't going to break anything critical, but on the other hand I do think our test coverage here is pretty weak and think it would be easier to address sooner than later |
Not critical, but it'll take me few days to have the time to write those tests. And as the PR modifies the whole script, it is blocking (or will generate conflicts) with any other change to it. #22423 is already a case, and I'd like to start creating issues for tasks in #20298, as they will save us a lot of time when reviewing the docstring updates. But it can surely wait a couple of weeks if you think it's worth adding the tests here. |
Thanks @datapythonista - I'd prefer tests up front. Hate to be a blocker on this, but realistically the longer we wait for test coverage the harder it becomes to ever get there |
Hello @datapythonista! Thanks for updating the PR.
Comment last updated on September 28, 2018 at 15:36 Hours UTC |
@WillAyd I added the requested tests, let me know if this can be merge now |
@jorisvandenbossche can you take a look at this? it's the PR that makes the docstring validation script generate the list with the errors in all docstrings. Once this is merged I'll start creating individual tickets for all the points, so we start validating all possible errors in the docstrings (and with some luck we can add it to the CI soon). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry missed the last ping
for msg in msgs: | ||
assert msg in err | ||
assert msg in ' '.join(result['errors']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but what does this add? Can't we just do the normal inclusion check against the list instead of joining into a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg in result['errors']
is different than this, for example 'a' in 'foo bar'
is True
, but a in ['foo', 'bar']
is False
.
I think the .join()
is simpler than another loop, is it what you would do, or you were thinking on the previous case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No but I thought msg
would explicitly match one of the errors
; will take a look more deeply on next review if that assumption is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert msg in ' '.join(result['errors']) | ||
|
||
|
||
class TestApiItems(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also minor but since this test isn't used for any testing (i.e. not collected by the runner) can we remove the Test
prefix?
(3, 'random.seed'), | ||
(4, 'random.randint')]) | ||
def test_item_name(self, idx, name): | ||
res = list(validate_docstrings.get_api_items(self.api_doc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use result
instead of res
?
continue | ||
def get_api_items(api_doc_fd): | ||
""" | ||
Parse api.rst file from the documentation, and extract all the functions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the first sentence one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this? Maybe missed on prior review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got the one liner summary above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake - thanks for clarifying
|
||
if not errs: | ||
sys.stderr.write('Docstring for "{}" correct. :)\n'.format(func_name)) | ||
def validate_all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this to be the main function of the change I was thinking the test cases would cover this but don't see that - oversight on my end or something we can test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is just boilerplate code, I don't think it adds any value to test that, as we would be mocking everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm if we mocked the return of get_api_items
to just return say one docstring could we assert that this function behaves the same (or similar - minor differences perhaps expected) to validate_one
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure if that would be useful, in my experience those kind of test just break when the function being mocked changes, and don't add that much value.
But anyway, I'd surely prefer to spend the scarce time I have to make pandas better in something else than finding ways to test a function that loops over the elements of a function that is tested, and calls another function that is tested for every item, in a script that validates docstrings. :)
Let's move forward if you don't mind, there are plenty of validations that I want to add to the docstrings, that will save a lot of time of the reviewing of doc PRs, and this is a blocker to implement those.
@WillAyd addressed or commented your review, can you take another look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - just some more comments / detail
for msg in msgs: | ||
assert msg in err | ||
assert msg in ' '.join(result['errors']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue | ||
def get_api_items(api_doc_fd): | ||
""" | ||
Parse api.rst file from the documentation, and extract all the functions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this? Maybe missed on prior review
self.doc = NumpyDocString(self.clean_doc) | ||
|
||
def __len__(self): | ||
return len(self.raw_doc) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what's with the addition of the staticmethod
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
staticmethod
makes it explicit that this is a function inside of a class, and not a normal method that accesses the class attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the calls within the code base are using an instance of the class (save the example in the docstring). With that being the case would prefer to remove the decorator unless there's actually a use case for it; otherwise just adds fluff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem at all to make it a regular method, I leave often static methods as regular methods, as it makes almost no difference.
But I don't understand your comment about using an instance of the class. Do you mean that it's not something that we usually use?
The way I see it is that methods of classes in general access the attributes and other methods of the class. As they are encapsulate a set of data and code that somehow works together. Functions in the other hand are somehow independent. In few cases, there are functions that while they do not access anything else from the class, they make more sense inside the class than outside. This is the case here, as _load_obj
is only used by other methods of the class, and living outside at the same level of the class Docstring
, IMO makes it look more important in the code than what really is, as conceptually it makes more sense at the same level of the other methods.
Assuming that as a method of the class is a better decision, having it as an staticmethod makes very quick to understand that it's a function inside of a class, and not a method that can access other methods or attributes. I personally find this information very valuable when reading code. Normal methods can access anything in the class, adding an extra complexity in the way that you can't see them as pure input->output units, but that can instead use somehow global variables. A static method doesn't have this complexity, and I know I can expect it to use only whatever parameter it receives, return a result, and not modify anything else. So, it gives a lot of information without having to read its code. The cost is just having the text @staticmethod
instead of self
. I think the few characters of difference is a very good price for the amount of information it provides. And I don't think that not using them often in the pandas code itself is a good reason for not using it here. May be we should start using them more there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on explicitly using staticmethod, if it is actually one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that I don't see the benefit of just adding things like this without any accompanying test regardless of how trivial the change may seem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit lost on what's the connection between the staticmethod and the testing here.
The benefit of this PR is that makes the code more organized and better. And it implements a way to extract detailed information of the errors in the docstrings. So I can quickly generate the list of functions that fail a specific validation, fix the problem in all them, and start checking in the CI those validations. So the errors don't happen again, and I don't need to waste time on manually reviewing them in all the docstrings PRs.
A huge benefit for me. :) So would be great to have this merged. Is there any blocker, or can I merge on green?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've blocked long enough so if others are fine with it majority rules :-)
Just generally trying to minimize PRs and keep them directed towards one clear goal at a time, hence the pushback on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, I'm trying to maximize PRs, so hopefully we'll end up in a good equilibrium. ;) After this is merged I'll start creating issues for the points in #20298, so we have better validation of the docstrings.
@jorisvandenbossche happy to merge this?
obj = getattr(obj, part) | ||
return obj | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for staticmethod
return None | ||
|
||
@property | ||
def type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a property? Only see one usage in the code base which makes me thing the benefit of adding to the class is minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think conceptually this is more an attribute that is being calculated, than a method that performs an action, and that's what I use properties for. I think it helps read the code being explicit on that.
|
||
if not errs: | ||
sys.stderr.write('Docstring for "{}" correct. :)\n'.format(func_name)) | ||
def validate_all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm if we mocked the return of get_api_items
to just return say one docstring could we assert that this function behaves the same (or similar - minor differences perhaps expected) to validate_one
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment from me then I'll leave to others for review :-)
self.doc = NumpyDocString(self.clean_doc) | ||
|
||
def __len__(self): | ||
return len(self.raw_doc) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the calls within the code base are using an instance of the class (save the example in the docstring). With that being the case would prefer to remove the decorator unless there's actually a use case for it; otherwise just adds fluff
Merging this, so we can finally unblock #20298. If there is any other comment I'll take care of it in a separate PR. |
Sorry forgot about this one. Thanks for moving this forward! |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR includes two main changes:
scripts/validate_docstrings.py
script without parameters, it returns a JSON with the validation of all docstrings.The previous behavior when calling the script with parameters was to generate a csv with basic information about each docstring (not the full validation, but basic things like whether sections existed, or the file where the function is defined).
The main change is that
validate_one()
now returns the results as Python objects, instead of printing directly to the console. Few other changes to make the code cleaner are implemented (mainly moving_load_obj
and_original_callable
inside theDocstring
class.