Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to validate_docstring script to be able to check all docstrings at once #22408

Merged
merged 13 commits into from
Oct 13, 2018
Merged

Changes to validate_docstring script to be able to check all docstrings at once #22408

merged 13 commits into from
Oct 13, 2018

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista commented Aug 18, 2018

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This PR includes two main changes:

  1. When calling the scripts/validate_docstrings.py script without parameters, it returns a JSON with the validation of all docstrings.
  2. Refactoring the structure of the code, mainly to be able to implement 1.

The previous behavior when calling the script with parameters was to generate a csv with basic information about each docstring (not the full validation, but basic things like whether sections existed, or the file where the function is defined).

The main change is that validate_one() now returns the results as Python objects, instead of printing directly to the console. Few other changes to make the code cleaner are implemented (mainly moving _load_obj and _original_callable inside the Docstring class.

@codecov
Copy link

codecov bot commented Aug 18, 2018

Codecov Report

Merging #22408 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22408      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50830    50873      +43     
==========================================
+ Hits        46860    46904      +44     
+ Misses       3970     3969       -1
Flag Coverage Δ
#multiple 90.61% <ø> (+0.01%) ⬆️
#single 42.32% <ø> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/dtypes.py 95.56% <0%> (-0.56%) ⬇️
pandas/core/internals/blocks.py 93.48% <0%> (-0.37%) ⬇️
pandas/core/reshape/merge.py 93.89% <0%> (-0.26%) ⬇️
pandas/core/ops.py 97.19% <0%> (-0.19%) ⬇️
pandas/core/arrays/categorical.py 95.62% <0%> (-0.13%) ⬇️
pandas/io/pytables.py 92.44% <0%> (-0.05%) ⬇️
pandas/core/generic.py 96.65% <0%> (-0.02%) ⬇️
pandas/core/strings.py 98.63% <0%> (ø) ⬆️
pandas/core/dtypes/base.py 100% <0%> (ø) ⬆️
pandas/core/tools/datetimes.py 85.22% <0%> (ø) ⬆️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f771ef6...db782fd. Read the comment docs.

@@ -490,25 +490,25 @@ def _import_path(self, klass=None, func=None):
return base_path

def test_good_class(self):
assert validate_one(self._import_path( # noqa: F821
klass='GoodDocStrings')) == 0
assert len(validate_one(self._import_path( # noqa: F821
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PEP8 can just use implicit truthiness of empty list, so just not ...['errors'] instead of len(...['errors']) == 0. Applicable to other tests below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general that is true, however in this case I think it is good to test it is actually a zero-length list and not something else Falsey?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on both comments, I'll change as @WillAyd suggests, and will also assert that the result is a list.

@@ -546,7 +546,6 @@ def test_bad_generic_functions(self, func):
marks=pytest.mark.xfail)
])
def test_bad_examples(self, capsys, klass, func, msgs):
validate_one(self._import_path(klass=klass, func=func)) # noqa:F821
err = capsys.readouterr().err
res = validate_one(self._import_path(klass=klass, func=func)) # noqa:F821
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency let's use result instead of res

for msg in msgs:
assert msg in err
assert msg in ''.join(res['errors'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to join as a string here or can we just check for inclusion in the list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong, the messages we're checking are no the whole error message, but part of it. So, this is equivalent to something like:

assert 'error in summary' in ''.join(['there is a error in summary for this doc', 'some other error'])

I think it makes more sense to use the join, as the error message will be clearer if this fails. Compared to checking all in a comprehension and them doing an any (which I guess will just say False is not True).

for line in f:
line = line.strip()
if len(line) == len(previous_line):
if set(line) == set('-'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to consolidate this and below conditional into one for readability and less copy/paste?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somehow agree with what you say, but I think in this case there is no way (at least I couldn't find it) to merge both use cases without increasing the amount of code and its complexity.

pass
else:
continue
def get_api_items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather complex function - can we add a docstring to explain what this is doing?

return fname
try:
fname = inspect.getsourcefile(self.code_obj)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly? the inspect.getsourcefile() returns the source file where a Python object is implemented. The except TypeError is explain in the comment below.

else:
return validate_one(function)
doc_info = validate_one(func_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason why we write different output for one vs all simply console readability? Wondering if we shouldn't align the output regardless of how this is invoked (maybe a separate PR depending on complexity)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but I think the lack of consistency is necessary in this case.

If the script is called for one object (a user validating the docstring they are changing), the "pretty" output in the console is what makes sense to return.

If the script is called for all objects (we as maintainers analyzing the status), a JSON with the information to be analyzed with pandas makes more sense than any console formatting.

Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I'm on the fence about it myself. Not something that needs to be addressed here per se but food for thought

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a quick summary of the changes? (it seems more a bit of refactor than new functionality? Or are there user-visible changes?)

@@ -490,25 +490,25 @@ def _import_path(self, klass=None, func=None):
return base_path

def test_good_class(self):
assert validate_one(self._import_path( # noqa: F821
klass='GoodDocStrings')) == 0
assert len(validate_one(self._import_path( # noqa: F821
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general that is true, however in this case I think it is good to test it is actually a zero-length list and not something else Falsey?

@datapythonista
Copy link
Member Author

Thanks for the great feedback. I addressed most of the points, and commented on the others.

@jorisvandenbossche agree some more info about the PR had to be provided. I edited the description to include what's the goal, and what else has been changed.

pass
else:
continue
def get_api_items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the docstring it helps a lot! Is there a way to make a minimal test for this? Don't want to go overboard with tests here but I am worried that if this were to somehow break in the future it could allow subtle failures in subsequent tests to go through unnoticed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, but I'd address it in a separate PR

@jorisvandenbossche
Copy link
Member

I edited the description to include what's the goal, and what else has been changed.

OK, that sounds good to me!

@datapythonista
Copy link
Member Author

@WillAyd let me know if there is any blocker to merge this

@jreback jreback added this to the 0.24.0 milestone Aug 22, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 22, 2018

@datapythonista sorry I think the size of the diff is misleading; I didn't realize I was asking you to change things that were orthogonal to your change in the first place!

Is this addressing something critical that we need to get in quickly? On one side I suppose it's OK because this isn't going to break anything critical, but on the other hand I do think our test coverage here is pretty weak and think it would be easier to address sooner than later

@datapythonista
Copy link
Member Author

Not critical, but it'll take me few days to have the time to write those tests. And as the PR modifies the whole script, it is blocking (or will generate conflicts) with any other change to it.

#22423 is already a case, and I'd like to start creating issues for tasks in #20298, as they will save us a lot of time when reviewing the docstring updates.

But it can surely wait a couple of weeks if you think it's worth adding the tests here.

@WillAyd
Copy link
Member

WillAyd commented Aug 22, 2018

Thanks @datapythonista - I'd prefer tests up front. Hate to be a blocker on this, but realistically the longer we wait for test coverage the harder it becomes to ever get there

@pep8speaks
Copy link

pep8speaks commented Sep 28, 2018

Hello @datapythonista! Thanks for updating the PR.

Comment last updated on September 28, 2018 at 15:36 Hours UTC

@datapythonista
Copy link
Member Author

@WillAyd I added the requested tests, let me know if this can be merge now

@datapythonista
Copy link
Member Author

@jorisvandenbossche can you take a look at this? it's the PR that makes the docstring validation script generate the list with the errors in all docstrings.

Once this is merged I'll start creating individual tickets for all the points, so we start validating all possible errors in the docstrings (and with some luck we can add it to the CI soon).

@jreback
Copy link
Contributor

jreback commented Oct 7, 2018

cc @jorisvandenbossche @WillAyd

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed the last ping

for msg in msgs:
assert msg in err
assert msg in ' '.join(result['errors'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but what does this add? Can't we just do the normal inclusion check against the list instead of joining into a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg in result['errors'] is different than this, for example 'a' in 'foo bar' is True, but a in ['foo', 'bar'] is False.

I think the .join() is simpler than another loop, is it what you would do, or you were thinking on the previous case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No but I thought msg would explicitly match one of the errors; will take a look more deeply on next review if that assumption is incorrect

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging this locally still don't think the string concatenation is necessary?

image

assert msg in ' '.join(result['errors'])


class TestApiItems(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minor but since this test isn't used for any testing (i.e. not collected by the runner) can we remove the Test prefix?

(3, 'random.seed'),
(4, 'random.randint')])
def test_item_name(self, idx, name):
res = list(validate_docstrings.get_api_items(self.api_doc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use result instead of res?

continue
def get_api_items(api_doc_fd):
"""
Parse api.rst file from the documentation, and extract all the functions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the first sentence one line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this? Maybe missed on prior review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got the one liner summary above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake - thanks for clarifying


if not errs:
sys.stderr.write('Docstring for "{}" correct. :)\n'.format(func_name))
def validate_all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this to be the main function of the change I was thinking the test cases would cover this but don't see that - oversight on my end or something we can test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is just boilerplate code, I don't think it adds any value to test that, as we would be mocking everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm if we mocked the return of get_api_items to just return say one docstring could we assert that this function behaves the same (or similar - minor differences perhaps expected) to validate_one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure if that would be useful, in my experience those kind of test just break when the function being mocked changes, and don't add that much value.

But anyway, I'd surely prefer to spend the scarce time I have to make pandas better in something else than finding ways to test a function that loops over the elements of a function that is tested, and calls another function that is tested for every item, in a script that validates docstrings. :)

Let's move forward if you don't mind, there are plenty of validations that I want to add to the docstrings, that will save a lot of time of the reviewing of doc PRs, and this is a blocker to implement those.

@datapythonista
Copy link
Member Author

@WillAyd addressed or commented your review, can you take another look please?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - just some more comments / detail

for msg in msgs:
assert msg in err
assert msg in ' '.join(result['errors'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging this locally still don't think the string concatenation is necessary?

image

continue
def get_api_items(api_doc_fd):
"""
Parse api.rst file from the documentation, and extract all the functions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this? Maybe missed on prior review

self.doc = NumpyDocString(self.clean_doc)

def __len__(self):
return len(self.raw_doc)

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what's with the addition of the staticmethod here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staticmethod makes it explicit that this is a function inside of a class, and not a normal method that accesses the class attributes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the calls within the code base are using an instance of the class (save the example in the docstring). With that being the case would prefer to remove the decorator unless there's actually a use case for it; otherwise just adds fluff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem at all to make it a regular method, I leave often static methods as regular methods, as it makes almost no difference.

But I don't understand your comment about using an instance of the class. Do you mean that it's not something that we usually use?

The way I see it is that methods of classes in general access the attributes and other methods of the class. As they are encapsulate a set of data and code that somehow works together. Functions in the other hand are somehow independent. In few cases, there are functions that while they do not access anything else from the class, they make more sense inside the class than outside. This is the case here, as _load_obj is only used by other methods of the class, and living outside at the same level of the class Docstring, IMO makes it look more important in the code than what really is, as conceptually it makes more sense at the same level of the other methods.

Assuming that as a method of the class is a better decision, having it as an staticmethod makes very quick to understand that it's a function inside of a class, and not a method that can access other methods or attributes. I personally find this information very valuable when reading code. Normal methods can access anything in the class, adding an extra complexity in the way that you can't see them as pure input->output units, but that can instead use somehow global variables. A static method doesn't have this complexity, and I know I can expect it to use only whatever parameter it receives, return a result, and not modify anything else. So, it gives a lot of information without having to read its code. The cost is just having the text @staticmethod instead of self. I think the few characters of difference is a very good price for the amount of information it provides. And I don't think that not using them often in the pandas code itself is a good reason for not using it here. May be we should start using them more there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on explicitly using staticmethod, if it is actually one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that I don't see the benefit of just adding things like this without any accompanying test regardless of how trivial the change may seem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit lost on what's the connection between the staticmethod and the testing here.

The benefit of this PR is that makes the code more organized and better. And it implements a way to extract detailed information of the errors in the docstrings. So I can quickly generate the list of functions that fail a specific validation, fix the problem in all them, and start checking in the CI those validations. So the errors don't happen again, and I don't need to waste time on manually reviewing them in all the docstrings PRs.

A huge benefit for me. :) So would be great to have this merged. Is there any blocker, or can I merge on green?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've blocked long enough so if others are fine with it majority rules :-)

Just generally trying to minimize PRs and keep them directed towards one clear goal at a time, hence the pushback on this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, I'm trying to maximize PRs, so hopefully we'll end up in a good equilibrium. ;) After this is merged I'll start creating issues for the points in #20298, so we have better validation of the docstrings.

@jorisvandenbossche happy to merge this?

obj = getattr(obj, part)
return obj

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for staticmethod

return None

@property
def type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a property? Only see one usage in the code base which makes me thing the benefit of adding to the class is minimal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conceptually this is more an attribute that is being calculated, than a method that performs an action, and that's what I use properties for. I think it helps read the code being explicit on that.


if not errs:
sys.stderr.write('Docstring for "{}" correct. :)\n'.format(func_name))
def validate_all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm if we mocked the return of get_api_items to just return say one docstring could we assert that this function behaves the same (or similar - minor differences perhaps expected) to validate_one?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment from me then I'll leave to others for review :-)

self.doc = NumpyDocString(self.clean_doc)

def __len__(self):
return len(self.raw_doc)

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the calls within the code base are using an instance of the class (save the example in the docstring). With that being the case would prefer to remove the decorator unless there's actually a use case for it; otherwise just adds fluff

@datapythonista
Copy link
Member Author

Merging this, so we can finally unblock #20298. If there is any other comment I'll take care of it in a separate PR.

@datapythonista datapythonista merged commit 85dc171 into pandas-dev:master Oct 13, 2018
@jorisvandenbossche
Copy link
Member

Sorry forgot about this one. Thanks for moving this forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants