Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7746] More typing fixes #11038

Closed
wants to merge 7 commits into from
Closed

Conversation

chadrik
Copy link
Contributor

@chadrik chadrik commented Mar 4, 2020

Another round of fixes.

After this and the filesystem PR are merged, there will be only 32 errors remaining. I'd like to hand the remainder of the errors over to @robertwb and @udim to resolve, since many of them require some specific Beam knowledge to solve. I will of course be happy to help (we could connect on slack via the Beam channel?).

Then we can put the PythonLint job into effect!


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@chadrik chadrik force-pushed the python-typing-misc2 branch 6 times, most recently from 944c908 to dcdf5bc Compare March 5, 2020 06:51
@chadrik
Copy link
Contributor Author

chadrik commented Mar 5, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 6, 2020

Run Portable_Python PreCommit

@chadrik chadrik force-pushed the python-typing-misc2 branch 2 times, most recently from b709911 to 8a40a6e Compare March 6, 2020 23:47
@chadrik
Copy link
Contributor Author

chadrik commented Mar 7, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 7, 2020

R: @udim
R: @robertwb

I broke this down into commits to give some context to the more noteworthy changes.

@chadrik chadrik force-pushed the python-typing-misc2 branch 7 times, most recently from a7f02d3 to 950a508 Compare March 10, 2020 03:23
@chadrik
Copy link
Contributor Author

chadrik commented Mar 10, 2020

Run Python PreCommit

3 similar comments
@chadrik
Copy link
Contributor Author

chadrik commented Mar 10, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 10, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 10, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 11, 2020

Tests passing!

Have a look when you get a chance. I'm hoping this one is pretty straight-forward.

@chadrik
Copy link
Contributor Author

chadrik commented Mar 11, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 12, 2020

Arg. Github showed this as failed even after I refreshed, and then as soon as I did Run Python PreCommit, it updated to show that it had succeeded briefly and then started the job again, which now failed. Every time I run it, it seems to fail with a different error. I don't think they are related to the changes here.

@chadrik
Copy link
Contributor Author

chadrik commented Mar 12, 2020

Run Python PreCommit

1 similar comment
@chadrik
Copy link
Contributor Author

chadrik commented Mar 12, 2020

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Mar 12, 2020

Ok, this is stabilized again. Would love a review.

@udim
Copy link
Member

udim commented Mar 16, 2020

Looking at this today

@@ -244,7 +272,7 @@ def is_bounded(self):
return True


class RangeTracker(object):
class RangeTracker(Generic[PositionT]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@udim For the base RangeTracker if we only consider the methods implemented by the class the "position" type could be Any, but based on the docs it seemed that any meaningful position would be comparable/sortable. Let me know if you think we should move the Position restriction down to a subclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Positions might be opaque byte strings, where splitting happens externally. This might be important for cross-language transforms.

@lukecwik, @robertwb, @chamikaramj , do you have opinion on RangeTracker position types? Should they all support the Position protocol (defined above)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thin Any is right here. Usually they'll be primitive, sortable types, but that's not a requirement.

Comment on lines +61 to +69
class PortableObject(Protocol):
def to_runner_api(self, __context):
# type: (PipelineContext) -> Any
pass

class _PipelineContextMap(object):
@classmethod
def from_runner_api(cls, __proto, __context):
# type: (Any, PipelineContext) -> Any
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we go with an adjective here instead of a noun: Portable?

@@ -171,7 +172,7 @@ def get_responses():
self._worker_thread_pool.shutdown()
# get_responses may be blocked on responses.get(), but we need to return
# control to its caller.
self._responses.put(no_more_work)
self._responses.put(no_more_work) # type: ignore[arg-type]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably replace this with the new Sentinel pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. There are ~20 places in the code that assign object() as value.
One of them is even called READER_THREAD_IS_DONE_SENTINEL. :)

self._done = True
self._requests.put(self._DONE)
self._requests.put(self._DONE) # type: ignore[arg-type]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another opportunity for the Sentinel pattern

Comment on lines +25 to +30
class Sentinel(enum.Enum):
"""
A type-safe sentinel class
"""

sentinel = object()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested in how I settled on this design, I documented my experience creating a type-safe sentinel pattern on stackoverflow, here: https://stackoverflow.com/questions/57959664/handling-conditional-logic-sentinel-value-with-mypy/60605919#60605919

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.
Sentinel is the type.
SPLIT_POINTS_UNKNOWN is the unique value.
Inheriting from Enum (vs calling Enum()) simplifies pickling (not sure necessary, but doesn't hurt).

Copy link
Member

@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review is taking me forever. I'm at 18 out of 52 files reviewed, but releasing 16 comments I've written so far.

sdks/python/apache_beam/metrics/metric.py Show resolved Hide resolved
sdks/python/mypy.ini Show resolved Hide resolved
sdks/python/apache_beam/io/iobase.py Show resolved Hide resolved
sdks/python/apache_beam/io/iobase.py Show resolved Hide resolved
sdks/python/apache_beam/coders/coders.py Show resolved Hide resolved
Comment on lines +25 to +30
class Sentinel(enum.Enum):
"""
A type-safe sentinel class
"""

sentinel = object()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.
Sentinel is the type.
SPLIT_POINTS_UNKNOWN is the unique value.
Inheriting from Enum (vs calling Enum()) simplifies pickling (not sure necessary, but doesn't hurt).

@@ -171,7 +172,7 @@ def get_responses():
self._worker_thread_pool.shutdown()
# get_responses may be blocked on responses.get(), but we need to return
# control to its caller.
self._responses.put(no_more_work)
self._responses.put(no_more_work) # type: ignore[arg-type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. There are ~20 places in the code that assign object() as value.
One of them is even called READER_THREAD_IS_DONE_SENTINEL. :)

@@ -353,6 +359,8 @@ def release(self, instruction_id):
self.cached_bundle_processors[descriptor_id].append(processor)

def shutdown(self):
# type: (...) -> None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be () -> None right?
Same for the next 2 hints below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. looks like I got a little sloppy. will fix in my next push.

@@ -1300,12 +1300,13 @@ def to_runner_api_parameter(self, context):
common_urns.requirements.REQUIRES_STATEFUL_PROCESSING.urn)
from apache_beam.runners.common import DoFnSignature
sig = DoFnSignature(self.fn)
is_splittable = sig.is_splittable_dofn()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if checking get_restriction_coder() return type instead of is_splittable_dofn() is future proof.

Also, I don't understand the change, from a mypy correctness perspective.

Copy link
Contributor Author

@chadrik chadrik Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if checking get_restriction_coder() return type instead of is_splittable_dofn() is future proof.

get_restriction_coder() calls is_splittable_dofn() and returns None if it's not splittable. So I interpreted a None result from this method to mean "is not splittable".

  def get_restriction_coder(self):
    # type: () -> Optional[TupleCoder]

    """Get coder for a restriction when processing an SDF. """
    if self.is_splittable_dofn():
      return TupleCoder([
          (self.get_restriction_provider().restriction_coder()),
          (self.get_watermark_estimator_provider().estimator_state_coder())
      ])
    else:
      return None

I don't understand the change, from a mypy correctness perspective.

Here's the problem:

    if is_splittable:
      restriction_coder = sig.get_restriction_coder()  #  returns Optional[TupleCoder]
      restriction_coder_id = context.coders.get_id(restriction_coder)  # does not accept Optional!
    else:
      restriction_coder_id = None

With my changes, we naturally drop the optionality before passing the value to context.coders.get_id(). We also avoid a redundant call to is_splittable_dofn(), FWIW.

I see two options:

  1. keep my changes and update the documentation of get_restriction_coder() to clarify that None result indicates "is not splittable"
  2. revert my changes and add assert restriction_coder is None before the call to context.coders.get_id()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, but ParDoPayload below accepts both splittable and restriction_coder_id keywords (theoretically splittable might be True while restriction_coder_id is None), so I think it's safer to do this:

is_splittable = sig.is_splittable_dofn()
restriction_coder = sig.get_restriction_coder()
if restriction_coder:
  restriction_coder_id = context.coders.get_id(
      restriction_coder)  # type: typing.Optional[str]
else:
  restriction_coder_id = None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an error to say is_splittable_dofn is True without returning a restriction coder as well and vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an error to say is_splittable_dofn is True without returning a restriction coder as well and vice versa.

This seems to validate my earlier assessment that a None result from this get_restriction_coder means "is not splittable", and therefore that my proposed change is valid.

@chadrik
Copy link
Contributor Author

chadrik commented Mar 18, 2020 via email

@chadrik
Copy link
Contributor Author

chadrik commented May 4, 2020

Hi everyone, I have some availability to finish this PR off now. I'm going to rebase it soon. @udim do you have the time to help me get this through review?

@chadrik
Copy link
Contributor Author

chadrik commented May 4, 2020

I've rebased onto master. We jumped from 32 errors to 260+. We're going to need to make a concerted effort to get these typing MRs through, and beat the merge conflict fatigue. Can we make it happen?

@robertwb
Copy link
Contributor

robertwb commented May 5, 2020 via email

@udim
Copy link
Member

udim commented May 5, 2020

Hi everyone, I have some availability to finish this PR off now. I'm going to rebase it soon. @udim do you have the time to help me get this through review?

Yeah, I'll make another pass later today

@chadrik chadrik force-pushed the python-typing-misc2 branch from fc9883b to f528b58 Compare May 5, 2020 22:07
@robertwb
Copy link
Contributor

robertwb commented May 6, 2020

I created #11620 to prevent regression. This way we can tackle things module by module until we have good coverage, without having to worry (as much) about huge PRs and regressions. (We should figure out how to coordinate the work of fixing up the modules to avoid overlap.)

@chadrik chadrik force-pushed the python-typing-misc2 branch from be54ef2 to d4ecb74 Compare May 7, 2020 01:19
@chadrik
Copy link
Contributor Author

chadrik commented May 7, 2020

Rebased on top of #11620, fixed some more issues, and removed some gating from mypy.ini, so mypy checks are all green.

@chadrik
Copy link
Contributor Author

chadrik commented May 7, 2020

Run Python PreCommit

@robertwb
Copy link
Contributor

OK, this seems to have stalled again. When I go to look at this, I have trouble keeping track of what's reviewed and what's not.

Chad, could you split this up into several PRs, one per subdirectory (plus maybe one for all the the packages with only a small number of affected files)? I think that'll be faster for reviewing pieces and getting them in as we can. (It seems the cross-package interaction will be relatively minimal.)

@stale
Copy link

stale bot commented Jul 12, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@stale stale bot added the stale label Jul 12, 2020
@stale
Copy link

stale bot commented Jul 20, 2020

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants