Changes to validate_docstring script to be able to check all docstrings at once #22408

datapythonista · 2018-08-18T13:28:48Z

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR includes two main changes:

When calling the scripts/validate_docstrings.py script without parameters, it returns a JSON with the validation of all docstrings.
Refactoring the structure of the code, mainly to be able to implement 1.

The previous behavior when calling the script with parameters was to generate a csv with basic information about each docstring (not the full validation, but basic things like whether sections existed, or the file where the function is defined).

The main change is that validate_one() now returns the results as Python objects, instead of printing directly to the console. Few other changes to make the code cleaner are implemented (mainly moving _load_obj and _original_callable inside the Docstring class.

…d for all docstrings

codecov · 2018-08-18T14:17:09Z

Codecov Report

Merging #22408 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #22408      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50830    50873      +43     
==========================================
+ Hits        46860    46904      +44     
+ Misses       3970     3969       -1

Flag	Coverage Δ
#multiple	`90.61% <ø> (+0.01%)`	⬆️
#single	`42.32% <ø> (-0.05%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/dtypes.py	`95.56% <0%> (-0.56%)`	⬇️
pandas/core/internals/blocks.py	`93.48% <0%> (-0.37%)`	⬇️
pandas/core/reshape/merge.py	`93.89% <0%> (-0.26%)`	⬇️
pandas/core/ops.py	`97.19% <0%> (-0.19%)`	⬇️
pandas/core/arrays/categorical.py	`95.62% <0%> (-0.13%)`	⬇️
pandas/io/pytables.py	`92.44% <0%> (-0.05%)`	⬇️
pandas/core/generic.py	`96.65% <0%> (-0.02%)`	⬇️
pandas/core/strings.py	`98.63% <0%> (ø)`	⬆️
pandas/core/dtypes/base.py	`100% <0%> (ø)`	⬆️
pandas/core/tools/datetimes.py	`85.22% <0%> (ø)`	⬆️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f771ef6...db782fd. Read the comment docs.

WillAyd · 2018-08-19T19:52:29Z

pandas/tests/scripts/test_validate_docstrings.py

@@ -490,25 +490,25 @@ def _import_path(self, klass=None, func=None):
        return base_path

    def test_good_class(self):
-        assert validate_one(self._import_path(  # noqa: F821
-            klass='GoodDocStrings')) == 0
+        assert len(validate_one(self._import_path(  # noqa: F821


For PEP8 can just use implicit truthiness of empty list, so just not ...['errors'] instead of len(...['errors']) == 0. Applicable to other tests below

In general that is true, however in this case I think it is good to test it is actually a zero-length list and not something else Falsey?

based on both comments, I'll change as @WillAyd suggests, and will also assert that the result is a list.

WillAyd · 2018-08-19T19:53:16Z

pandas/tests/scripts/test_validate_docstrings.py

@@ -546,7 +546,6 @@ def test_bad_generic_functions(self, func):
                     marks=pytest.mark.xfail)
    ])
    def test_bad_examples(self, capsys, klass, func, msgs):
-        validate_one(self._import_path(klass=klass, func=func))  # noqa:F821
-        err = capsys.readouterr().err
+        res = validate_one(self._import_path(klass=klass, func=func))  # noqa:F821


For consistency let's use result instead of res

WillAyd · 2018-08-19T19:53:57Z

pandas/tests/scripts/test_validate_docstrings.py

        for msg in msgs:
-            assert msg in err
+            assert msg in ''.join(res['errors'])


Do we need to join as a string here or can we just check for inclusion in the list?

If I'm not wrong, the messages we're checking are no the whole error message, but part of it. So, this is equivalent to something like:

assert 'error in summary' in ''.join(['there is a error in summary for this doc', 'some other error'])

I think it makes more sense to use the join, as the error message will be clearer if this fails. Compared to checking all in a comprehension and them doing an any (which I guess will just say False is not True).

WillAyd · 2018-08-19T19:55:46Z

scripts/validate_docstrings.py

+        for line in f:
+            line = line.strip()
+            if len(line) == len(previous_line):
+                if set(line) == set('-'):


Is there a way to consolidate this and below conditional into one for readability and less copy/paste?

I somehow agree with what you say, but I think in this case there is no way (at least I couldn't find it) to merge both use cases without increasing the amount of code and its complexity.

WillAyd · 2018-08-19T19:56:39Z

scripts/validate_docstrings.py

-            pass
-        else:
-            continue
+def get_api_items():


Rather complex function - can we add a docstring to explain what this is doing?

WillAyd · 2018-08-19T19:57:57Z

scripts/validate_docstrings.py

-            return fname
+        try:
+            fname = inspect.getsourcefile(self.code_obj)
+        except TypeError:


What does this do?

what exactly? the inspect.getsourcefile() returns the source file where a Python object is implemented. The except TypeError is explain in the comment below.

WillAyd · 2018-08-19T20:01:55Z

scripts/validate_docstrings.py

    else:
-        return validate_one(function)
+        doc_info = validate_one(func_name)


Is the reason why we write different output for one vs all simply console readability? Wondering if we shouldn't align the output regardless of how this is invoked (maybe a separate PR depending on complexity)

I see your point, but I think the lack of consistency is necessary in this case.

If the script is called for one object (a user validating the docstring they are changing), the "pretty" output in the console is what makes sense to return.

If the script is called for all objects (we as maintainers analyzing the status), a JSON with the information to be analyzed with pandas makes more sense than any console formatting.

Does it make sense?

Yea I'm on the fence about it myself. Not something that needs to be addressed here per se but food for thought

jorisvandenbossche

Can you give a quick summary of the changes? (it seems more a bit of refactor than new functionality? Or are there user-visible changes?)

jorisvandenbossche · 2018-08-20T06:38:39Z

pandas/tests/scripts/test_validate_docstrings.py

@@ -490,25 +490,25 @@ def _import_path(self, klass=None, func=None):
        return base_path

    def test_good_class(self):
-        assert validate_one(self._import_path(  # noqa: F821
-            klass='GoodDocStrings')) == 0
+        assert len(validate_one(self._import_path(  # noqa: F821


In general that is true, however in this case I think it is good to test it is actually a zero-length list and not something else Falsey?

…trings

datapythonista · 2018-08-20T12:57:51Z

Thanks for the great feedback. I addressed most of the points, and commented on the others.

@jorisvandenbossche agree some more info about the PR had to be provided. I edited the description to include what's the goal, and what else has been changed.

WillAyd · 2018-08-20T17:27:35Z

scripts/validate_docstrings.py

-            pass
-        else:
-            continue
+def get_api_items():


Thanks for the docstring it helps a lot! Is there a way to make a minimal test for this? Don't want to go overboard with tests here but I am worried that if this were to somehow break in the future it could allow subtle failures in subsequent tests to go through unnoticed

Makes sense to me, but I'd address it in a separate PR

jorisvandenbossche · 2018-08-20T18:29:01Z

I edited the description to include what's the goal, and what else has been changed.

OK, that sounds good to me!

…trings

datapythonista · 2018-08-22T10:17:41Z

@WillAyd let me know if there is any blocker to merge this

WillAyd · 2018-08-22T13:20:29Z

@datapythonista sorry I think the size of the diff is misleading; I didn't realize I was asking you to change things that were orthogonal to your change in the first place!

Is this addressing something critical that we need to get in quickly? On one side I suppose it's OK because this isn't going to break anything critical, but on the other hand I do think our test coverage here is pretty weak and think it would be easier to address sooner than later

datapythonista · 2018-08-22T14:00:13Z

Not critical, but it'll take me few days to have the time to write those tests. And as the PR modifies the whole script, it is blocking (or will generate conflicts) with any other change to it.

#22423 is already a case, and I'd like to start creating issues for tasks in #20298, as they will save us a lot of time when reviewing the docstring updates.

But it can surely wait a couple of weeks if you think it's worth adding the tests here.

WillAyd · 2018-08-22T14:02:13Z

Thanks @datapythonista - I'd prefer tests up front. Hate to be a blocker on this, but realistically the longer we wait for test coverage the harder it becomes to ever get there

…trings

pep8speaks · 2018-09-28T15:21:39Z

Hello @datapythonista! Thanks for updating the PR.

There are no PEP8 issues in the file scripts/tests/test_validate_docstrings.py !
There are no PEP8 issues in the file scripts/validate_docstrings.py !

Comment last updated on September 28, 2018 at 15:36 Hours UTC

…andas into validate_all_docstrings

…trings

datapythonista · 2018-09-29T23:34:40Z

@WillAyd I added the requested tests, let me know if this can be merge now

datapythonista · 2018-10-01T21:00:47Z

@jorisvandenbossche can you take a look at this? it's the PR that makes the docstring validation script generate the list with the errors in all docstrings.

Once this is merged I'll start creating individual tickets for all the points, so we start validating all possible errors in the docstrings (and with some luck we can add it to the CI soon).

jreback · 2018-10-07T23:11:44Z

cc @jorisvandenbossche @WillAyd

WillAyd

Sorry missed the last ping

WillAyd · 2018-10-08T03:28:50Z

scripts/tests/test_validate_docstrings.py

        for msg in msgs:
-            assert msg in err
+            assert msg in ' '.join(result['errors'])


Minor nit but what does this add? Can't we just do the normal inclusion check against the list instead of joining into a string?

msg in result['errors'] is different than this, for example 'a' in 'foo bar' is True, but a in ['foo', 'bar'] is False.

I think the .join() is simpler than another loop, is it what you would do, or you were thinking on the previous case?

No but I thought msg would explicitly match one of the errors; will take a look more deeply on next review if that assumption is incorrect

Debugging this locally still don't think the string concatenation is necessary?

WillAyd · 2018-10-08T03:30:04Z

scripts/tests/test_validate_docstrings.py

+            assert msg in ' '.join(result['errors'])
+
+
+class TestApiItems(object):


Also minor but since this test isn't used for any testing (i.e. not collected by the runner) can we remove the Test prefix?

WillAyd · 2018-10-08T03:32:22Z

scripts/tests/test_validate_docstrings.py

+                                          (3, 'random.seed'),
+                                          (4, 'random.randint')])
+    def test_item_name(self, idx, name):
+        res = list(validate_docstrings.get_api_items(self.api_doc))


Can you use result instead of res?

WillAyd · 2018-10-08T03:33:56Z

scripts/validate_docstrings.py

-            continue
+def get_api_items(api_doc_fd):
+    """
+    Parse api.rst file from the documentation, and extract all the functions,


Should we make the first sentence one line?

Can you change this? Maybe missed on prior review

You've got the one liner summary above.

My mistake - thanks for clarifying

WillAyd · 2018-10-08T03:46:28Z

scripts/validate_docstrings.py


-    if not errs:
-        sys.stderr.write('Docstring for "{}" correct. :)\n'.format(func_name))
+def validate_all():


Assuming this to be the main function of the change I was thinking the test cases would cover this but don't see that - oversight on my end or something we can test?

This function is just boilerplate code, I don't think it adds any value to test that, as we would be mocking everything.

Hmm if we mocked the return of get_api_items to just return say one docstring could we assert that this function behaves the same (or similar - minor differences perhaps expected) to validate_one?

Not really sure if that would be useful, in my experience those kind of test just break when the function being mocked changes, and don't add that much value.

But anyway, I'd surely prefer to spend the scarce time I have to make pandas better in something else than finding ways to test a function that loops over the elements of a function that is tested, and calls another function that is tested for every item, in a script that validates docstrings. :)

Let's move forward if you don't mind, there are plenty of validations that I want to add to the docstrings, that will save a lot of time of the reviewing of doc PRs, and this is a blocker to implement those.

datapythonista · 2018-10-08T04:02:43Z

@WillAyd addressed or commented your review, can you take another look please?

WillAyd

Thanks - just some more comments / detail

WillAyd · 2018-10-08T04:18:33Z

scripts/tests/test_validate_docstrings.py

        for msg in msgs:
-            assert msg in err
+            assert msg in ' '.join(result['errors'])


Debugging this locally still don't think the string concatenation is necessary?

WillAyd · 2018-10-08T04:18:57Z

scripts/validate_docstrings.py

-            continue
+def get_api_items(api_doc_fd):
+    """
+    Parse api.rst file from the documentation, and extract all the functions,


Can you change this? Maybe missed on prior review

WillAyd · 2018-10-08T04:20:57Z

scripts/validate_docstrings.py

        self.doc = NumpyDocString(self.clean_doc)

    def __len__(self):
        return len(self.raw_doc)

+    @staticmethod


Out of curiosity what's with the addition of the staticmethod here?

staticmethod makes it explicit that this is a function inside of a class, and not a normal method that accesses the class attributes

All of the calls within the code base are using an instance of the class (save the example in the docstring). With that being the case would prefer to remove the decorator unless there's actually a use case for it; otherwise just adds fluff

Not a problem at all to make it a regular method, I leave often static methods as regular methods, as it makes almost no difference.

But I don't understand your comment about using an instance of the class. Do you mean that it's not something that we usually use?

The way I see it is that methods of classes in general access the attributes and other methods of the class. As they are encapsulate a set of data and code that somehow works together. Functions in the other hand are somehow independent. In few cases, there are functions that while they do not access anything else from the class, they make more sense inside the class than outside. This is the case here, as _load_obj is only used by other methods of the class, and living outside at the same level of the class Docstring, IMO makes it look more important in the code than what really is, as conceptually it makes more sense at the same level of the other methods.

Assuming that as a method of the class is a better decision, having it as an staticmethod makes very quick to understand that it's a function inside of a class, and not a method that can access other methods or attributes. I personally find this information very valuable when reading code. Normal methods can access anything in the class, adding an extra complexity in the way that you can't see them as pure input->output units, but that can instead use somehow global variables. A static method doesn't have this complexity, and I know I can expect it to use only whatever parameter it receives, return a result, and not modify anything else. So, it gives a lot of information without having to read its code. The cost is just having the text @staticmethod instead of self. I think the few characters of difference is a very good price for the amount of information it provides. And I don't think that not using them often in the pandas code itself is a good reason for not using it here. May be we should start using them more there.

+1 on explicitly using staticmethod, if it is actually one

My point is that I don't see the benefit of just adding things like this without any accompanying test regardless of how trivial the change may seem

I'm a bit lost on what's the connection between the staticmethod and the testing here.

The benefit of this PR is that makes the code more organized and better. And it implements a way to extract detailed information of the errors in the docstrings. So I can quickly generate the list of functions that fail a specific validation, fix the problem in all them, and start checking in the CI those validations. So the errors don't happen again, and I don't need to waste time on manually reviewing them in all the docstrings PRs.

A huge benefit for me. :) So would be great to have this merged. Is there any blocker, or can I merge on green?

I've blocked long enough so if others are fine with it majority rules :-)

Just generally trying to minimize PRs and keep them directed towards one clear goal at a time, hence the pushback on this one

hehe, I'm trying to maximize PRs, so hopefully we'll end up in a good equilibrium. ;) After this is merged I'll start creating issues for the points in #20298, so we have better validation of the docstrings.

@jorisvandenbossche happy to merge this?

WillAyd · 2018-10-08T04:21:08Z

scripts/validate_docstrings.py

+            obj = getattr(obj, part)
+        return obj
+
+    @staticmethod


Same question for staticmethod

WillAyd · 2018-10-08T04:22:54Z

scripts/validate_docstrings.py

+                return None
+
+    @property
+    def type(self):


Does this need to be a property? Only see one usage in the code base which makes me thing the benefit of adding to the class is minimal

I think conceptually this is more an attribute that is being calculated, than a method that performs an action, and that's what I use properties for. I think it helps read the code being explicit on that.

WillAyd · 2018-10-08T04:24:47Z

scripts/validate_docstrings.py


-    if not errs:
-        sys.stderr.write('Docstring for "{}" correct. :)\n'.format(func_name))
+def validate_all():


Hmm if we mocked the return of get_api_items to just return say one docstring could we assert that this function behaves the same (or similar - minor differences perhaps expected) to validate_one?

WillAyd

One last comment from me then I'll leave to others for review :-)

WillAyd · 2018-10-09T16:58:33Z

scripts/validate_docstrings.py

        self.doc = NumpyDocString(self.clean_doc)

    def __len__(self):
        return len(self.raw_doc)

+    @staticmethod


All of the calls within the code base are using an instance of the class (save the example in the docstring). With that being the case would prefer to remove the decorator unless there's actually a use case for it; otherwise just adds fluff

datapythonista · 2018-10-13T18:20:36Z

Merging this, so we can finally unblock #20298. If there is any other comment I'll take care of it in a separate PR.

jorisvandenbossche · 2018-10-13T18:43:10Z

Sorry forgot about this one. Thanks for moving this forward!

…gs at once (pandas-dev#22408)

datapythonista added 3 commits August 18, 2018 12:16

Main refactoring of validate_docstrings scripts, so it can be execute…

b31fc10

…d for all docstrings

Fixing errors with introspection

5d2815f

Adapting tests to the new format

5bb7b9b

datapythonista added the Docs label Aug 18, 2018

datapythonista requested review from WillAyd and jorisvandenbossche August 18, 2018 13:28

datapythonista mentioned this pull request Aug 18, 2018

Add tests for docstring Validation Script + py27 compat #20061

Merged

4 tasks

WillAyd requested changes Aug 19, 2018

View reviewed changes

jorisvandenbossche reviewed Aug 20, 2018

View reviewed changes

datapythonista added 2 commits August 20, 2018 13:09

Merge remote-tracking branch 'upstream/master' into validate_all_docs…

86327d4

…trings

Addressing comments from the code review

188f4a3

WillAyd reviewed Aug 20, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into validate_all_docs…

0d25298

…trings

jreback added this to the 0.24.0 milestone Aug 22, 2018

datapythonista added 2 commits September 1, 2018 11:20

Merge remote-tracking branch 'upstream/master' into validate_all_docs…

577308e

…trings

Merging from master

156b997

datapythonista added 4 commits September 28, 2018 16:36

Fixing pep8

3a5e7fd

Merge branch 'validate_all_docstrings' of github.com:datapythonista/p…

1a8aee9

…andas into validate_all_docstrings

Merge remote-tracking branch 'upstream/master' into validate_all_docs…

aa4807a

…trings

Adding tests to the function that parses api.rst

1d2c5c0

WillAyd requested changes Oct 8, 2018

View reviewed changes

Addressing comments from review

db782fd

WillAyd requested changes Oct 8, 2018

View reviewed changes

WillAyd requested changes Oct 9, 2018

View reviewed changes

datapythonista merged commit 85dc171 into pandas-dev:master Oct 13, 2018

datapythonista mentioned this pull request Nov 4, 2018

Make validate_docstrings.py ready for the CI #23481

Closed

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

Changes to validate_docstring script to be able to check all docstrin…

51e8433

…gs at once (pandas-dev#22408)

xhochy mentioned this pull request Feb 1, 2019

ARROW-3903: [Python] Random array generator for Arrow conversion and Parquet testing apache/arrow#3301

Closed

		assert msg in ' '.join(result['errors'])


		class TestApiItems(object):

Changes to validate_docstring script to be able to check all docstrings at once #22408

Changes to validate_docstring script to be able to check all docstrings at once #22408

Conversation

datapythonista commented Aug 18, 2018 • edited Loading

codecov bot commented Aug 18, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista commented Aug 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 20, 2018

datapythonista commented Aug 22, 2018

WillAyd commented Aug 22, 2018

datapythonista commented Aug 22, 2018

WillAyd commented Aug 22, 2018

pep8speaks commented Sep 28, 2018 • edited Loading

Comment last updated on September 28, 2018 at 15:36 Hours UTC

datapythonista commented Sep 29, 2018

datapythonista commented Oct 1, 2018

jreback commented Oct 7, 2018

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista commented Oct 8, 2018

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista commented Oct 13, 2018

jorisvandenbossche commented Oct 13, 2018

datapythonista commented Aug 18, 2018 •

edited

Loading

codecov bot commented Aug 18, 2018 •

edited

Loading

pep8speaks commented Sep 28, 2018 •

edited

Loading