-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7746] More typing fixes #11038
Conversation
944c908
to
dcdf5bc
Compare
Run Python PreCommit |
Run Portable_Python PreCommit |
b709911
to
8a40a6e
Compare
Run Python PreCommit |
a7f02d3
to
950a508
Compare
Run Python PreCommit |
3 similar comments
Run Python PreCommit |
Run Python PreCommit |
Run Python PreCommit |
Tests passing! Have a look when you get a chance. I'm hoping this one is pretty straight-forward. |
Run Python PreCommit |
Arg. Github showed this as failed even after I refreshed, and then as soon as I did |
Run Python PreCommit |
1 similar comment
Run Python PreCommit |
Ok, this is stabilized again. Would love a review. |
Looking at this today |
@@ -244,7 +272,7 @@ def is_bounded(self): | |||
return True | |||
|
|||
|
|||
class RangeTracker(object): | |||
class RangeTracker(Generic[PositionT]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@udim For the base RangeTracker
if we only consider the methods implemented by the class the "position" type could be Any
, but based on the docs it seemed that any meaningful position would be comparable/sortable. Let me know if you think we should move the Position
restriction down to a subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Positions might be opaque byte strings, where splitting happens externally. This might be important for cross-language transforms.
@lukecwik, @robertwb, @chamikaramj , do you have opinion on RangeTracker position types? Should they all support the Position protocol (defined above)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thin Any is right here. Usually they'll be primitive, sortable types, but that's not a requirement.
class PortableObject(Protocol): | ||
def to_runner_api(self, __context): | ||
# type: (PipelineContext) -> Any | ||
pass | ||
|
||
class _PipelineContextMap(object): | ||
@classmethod | ||
def from_runner_api(cls, __proto, __context): | ||
# type: (Any, PipelineContext) -> Any | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we go with an adjective here instead of a noun: Portable
?
@@ -171,7 +172,7 @@ def get_responses(): | |||
self._worker_thread_pool.shutdown() | |||
# get_responses may be blocked on responses.get(), but we need to return | |||
# control to its caller. | |||
self._responses.put(no_more_work) | |||
self._responses.put(no_more_work) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably replace this with the new Sentinel
pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea. There are ~20 places in the code that assign object()
as value.
One of them is even called READER_THREAD_IS_DONE_SENTINEL
. :)
self._done = True | ||
self._requests.put(self._DONE) | ||
self._requests.put(self._DONE) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another opportunity for the Sentinel
pattern
class Sentinel(enum.Enum): | ||
""" | ||
A type-safe sentinel class | ||
""" | ||
|
||
sentinel = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're interested in how I settled on this design, I documented my experience creating a type-safe sentinel pattern on stackoverflow, here: https://stackoverflow.com/questions/57959664/handling-conditional-logic-sentinel-value-with-mypy/60605919#60605919
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG.
Sentinel is the type.
SPLIT_POINTS_UNKNOWN is the unique value.
Inheriting from Enum (vs calling Enum()) simplifies pickling (not sure necessary, but doesn't hurt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review is taking me forever. I'm at 18 out of 52 files reviewed, but releasing 16 comments I've written so far.
class Sentinel(enum.Enum): | ||
""" | ||
A type-safe sentinel class | ||
""" | ||
|
||
sentinel = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG.
Sentinel is the type.
SPLIT_POINTS_UNKNOWN is the unique value.
Inheriting from Enum (vs calling Enum()) simplifies pickling (not sure necessary, but doesn't hurt).
@@ -171,7 +172,7 @@ def get_responses(): | |||
self._worker_thread_pool.shutdown() | |||
# get_responses may be blocked on responses.get(), but we need to return | |||
# control to its caller. | |||
self._responses.put(no_more_work) | |||
self._responses.put(no_more_work) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea. There are ~20 places in the code that assign object()
as value.
One of them is even called READER_THREAD_IS_DONE_SENTINEL
. :)
@@ -353,6 +359,8 @@ def release(self, instruction_id): | |||
self.cached_bundle_processors[descriptor_id].append(processor) | |||
|
|||
def shutdown(self): | |||
# type: (...) -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be () -> None
right?
Same for the next 2 hints below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. looks like I got a little sloppy. will fix in my next push.
@@ -1300,12 +1300,13 @@ def to_runner_api_parameter(self, context): | |||
common_urns.requirements.REQUIRES_STATEFUL_PROCESSING.urn) | |||
from apache_beam.runners.common import DoFnSignature | |||
sig = DoFnSignature(self.fn) | |||
is_splittable = sig.is_splittable_dofn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if checking get_restriction_coder() return type instead of is_splittable_dofn() is future proof.
Also, I don't understand the change, from a mypy correctness perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if checking get_restriction_coder() return type instead of is_splittable_dofn() is future proof.
get_restriction_coder()
calls is_splittable_dofn()
and returns None
if it's not splittable. So I interpreted a None
result from this method to mean "is not splittable".
def get_restriction_coder(self):
# type: () -> Optional[TupleCoder]
"""Get coder for a restriction when processing an SDF. """
if self.is_splittable_dofn():
return TupleCoder([
(self.get_restriction_provider().restriction_coder()),
(self.get_watermark_estimator_provider().estimator_state_coder())
])
else:
return None
I don't understand the change, from a mypy correctness perspective.
Here's the problem:
if is_splittable:
restriction_coder = sig.get_restriction_coder() # returns Optional[TupleCoder]
restriction_coder_id = context.coders.get_id(restriction_coder) # does not accept Optional!
else:
restriction_coder_id = None
With my changes, we naturally drop the optionality before passing the value to context.coders.get_id()
. We also avoid a redundant call to is_splittable_dofn()
, FWIW.
I see two options:
- keep my changes and update the documentation of
get_restriction_coder()
to clarify thatNone
result indicates "is not splittable" - revert my changes and add
assert restriction_coder is None
before the call tocontext.coders.get_id()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why, but ParDoPayload
below accepts both splittable and restriction_coder_id keywords (theoretically splittable might be True while restriction_coder_id is None), so I think it's safer to do this:
is_splittable = sig.is_splittable_dofn()
restriction_coder = sig.get_restriction_coder()
if restriction_coder:
restriction_coder_id = context.coders.get_id(
restriction_coder) # type: typing.Optional[str]
else:
restriction_coder_id = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an error to say is_splittable_dofn is True without returning a restriction coder as well and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an error to say is_splittable_dofn is True without returning a restriction coder as well and vice versa.
This seems to validate my earlier assessment that a None result from this get_restriction_coder
means "is not splittable", and therefore that my proposed change is valid.
Great comments and questions. I’m in the middle of rolling out our COVID
plan at work so it may take me a bit to get you proper answers but I’ll
start chipping away at it as soon as I can.
…On Tue, Mar 17, 2020 at 7:30 PM Udi Meiri ***@***.***> wrote:
***@***.**** commented on this pull request.
This review is taking me forever. I'm at 18 out of 52 files reviewed, but
releasing 16 comments I've written so far.
------------------------------
In sdks/python/apache_beam/runners/worker/operations.py
<#11038 (comment)>:
> @@ -569,7 +580,7 @@ def __init__(self,
self.tagged_receivers = None # type: Optional[_TaggedReceivers]
# A mapping of timer tags to the input "PCollections" they come in on.
self.timer_inputs = timer_inputs or {}
- self.input_info = None # type: Optional[Tuple[str, str, coders.WindowedValueCoder, MutableMapping[str, str]]]
So the MutableMapping hint here was a mistake?
------------------------------
In sdks/python/apache_beam/metrics/metric.py
<#11038 (comment)>:
> @@ -56,7 +55,7 @@ class Metrics(object):
@staticmethod
def get_namespace(namespace):
# type: (Union[Type, str]) -> str
- if inspect.isclass(namespace):
+ if isinstance(namespace, type):
Hopefully no one uses old-style classes any more (types.ClassType).
------------------------------
In sdks/python/mypy.ini
<#11038 (comment)>:
> color_output = true
-# uncomment this to see how close we are to being complete
+# required setting for dmypy:
+follow_imports = error
Does this mean having to supply all imported modules on the mypy command
line?
------------------------------
In sdks/python/apache_beam/io/iobase.py
<#11038 (comment)>:
> +
+class Position(Protocol):
+ def __lt__(self, other):
+ pass
+
+ def __le__(self, other):
+ pass
+
+ def __gt__(self, other):
+ pass
+
+ def __ge__(self, other):
+ pass
+
+
+PositionT = TypeVar('PositionT', bound='Position')
I don't understand the usage of PositionT. Isn't Position already a type?
It seems that you could replace all uses of PositionT with Position and
it'd work the same.
------------------------------
In sdks/python/apache_beam/io/iobase.py
<#11038 (comment)>:
> @@ -95,8 +115,11 @@
#
# Type for start and stop positions are specific to the bounded source and must
# be consistent throughout.
-SourceBundle = namedtuple(
- 'SourceBundle', 'weight source start_position stop_position')
+SourceBundle = NamedTuple(
+ 'SourceBundle',
+ [('weight', Optional[float]), ('source', 'BoundedSource'),
It seems that weight is non-Optional.
------------------------------
In sdks/python/apache_beam/io/range_trackers.py
<#11038 (comment)>:
> @@ -42,7 +48,7 @@
_LOGGER = logging.getLogger(__name__)
-class OffsetRangeTracker(iobase.RangeTracker):
+class OffsetRangeTracker(iobase.RangeTracker[int]):
"""A 'RangeTracker' for non-negative positions of type 'long'."""
s/long/int/
------------------------------
In sdks/python/apache_beam/io/range_trackers.py
<#11038 (comment)>:
> self._start_position = start_position
self._stop_position = stop_position
self._lock = threading.Lock()
- self._last_claim = self.UNSTARTED
+ # the return on investment for properly typing this is low. cast it.
Did you mean that the ROI is high, not low?
------------------------------
In sdks/python/apache_beam/io/range_trackers.py
<#11038 (comment)>:
> def fraction_to_position(self, fraction, start, end):
+ # type: (float, Optional[iobase.PositionT], Optional[iobase.PositionT]) -> Optional[iobase.PositionT]
The return value seems to be non-Optional.
------------------------------
In sdks/python/apache_beam/io/iobase.py
<#11038 (comment)>:
> @@ -244,7 +272,7 @@ def is_bounded(self):
return True
-class RangeTracker(object):
+class RangeTracker(Generic[PositionT]):
I'm not sure. Positions might be opaque byte strings, where splitting
happens externally. This might be important for cross-language transforms.
@lukecwik <https://github.com/lukecwik>, @robertwb
<https://github.com/robertwb>, @chamikaramj
<https://github.com/chamikaramj> , do you have opinion on RangeTracker
position types? Should they all support the Position protocol (defined
above)?
------------------------------
In sdks/python/apache_beam/io/restriction_trackers.py
<#11038 (comment)>:
> @@ -52,6 +55,7 @@ def __hash__(self):
return hash((type(self), self.start, self.stop))
def split(self, desired_num_offsets_per_split, min_num_offsets_per_split=1):
+ # type: (...) -> Iterator[OffsetRange]
Input looks like (int, int). Any reason to leave it empty?
------------------------------
In sdks/python/apache_beam/ml/gcp/naturallanguageml.py
<#11038 (comment)>:
> @@ -74,7 +75,9 @@ def __init__(
def to_dict(document):
# type: (Document) -> Mapping[str, Optional[str]]
if document.from_gcs:
- dict_repr = {'gcs_content_uri': document.content}
+ dict_repr = {
+ 'gcs_content_uri': document.content
+ } # type: Dict[str, Optional[str]]
document.content is not Optional
------------------------------
In sdks/python/apache_beam/coders/coders.py
<#11038 (comment)>:
> @@ -387,8 +387,11 @@ def register_structured_urn(urn, cls):
"""Register a coder that's completely defined by its urn and its
component(s), if any, which are passed to construct the instance.
"""
- cls.to_runner_api_parameter = (
- lambda self, unused_context: (urn, None, self._get_component_coders()))
+ setattr(
Could you explain (in a comment perhaps) why using setattr here is
necessary?
------------------------------
In sdks/python/apache_beam/utils/sentinel.py
<#11038 (comment)>:
> +class Sentinel(enum.Enum):
+ """
+ A type-safe sentinel class
+ """
+
+ sentinel = object()
SG.
Sentinel is the type.
SPLIT_POINTS_UNKNOWN is the unique value.
Inheriting from Enum (vs calling Enum()) simplifies pickling (not sure
necessary, but doesn't hurt).
------------------------------
In sdks/python/apache_beam/runners/worker/sdk_worker.py
<#11038 (comment)>:
> @@ -171,7 +172,7 @@ def get_responses():
self._worker_thread_pool.shutdown()
# get_responses may be blocked on responses.get(), but we need to return
# control to its caller.
- self._responses.put(no_more_work)
+ self._responses.put(no_more_work) # type: ignore[arg-type]
Sounds like a good idea. There are ~20 places in the code that assign
object() as value.
One of them is even called READER_THREAD_IS_DONE_SENTINEL. :)
------------------------------
In sdks/python/apache_beam/runners/worker/sdk_worker.py
<#11038 (comment)>:
> @@ -353,6 +359,8 @@ def release(self, instruction_id):
self.cached_bundle_processors[descriptor_id].append(processor)
def shutdown(self):
+ # type: (...) -> None
This can be () -> None right?
Same for the next 2 hints below.
------------------------------
In sdks/python/apache_beam/transforms/core.py
<#11038 (comment)>:
> @@ -1300,12 +1300,13 @@ def to_runner_api_parameter(self, context):
common_urns.requirements.REQUIRES_STATEFUL_PROCESSING.urn)
from apache_beam.runners.common import DoFnSignature
sig = DoFnSignature(self.fn)
- is_splittable = sig.is_splittable_dofn()
Not sure if checking get_restriction_coder() return type instead of
is_splittable_dofn() is future proof.
Also, I don't understand the change, from a mypy correctness perspective.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#11038 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAPOEY43E2LYCWV3TS7KD3RIAW23ANCNFSM4LAZDSSQ>
.
|
Hi everyone, I have some availability to finish this PR off now. I'm going to rebase it soon. @udim do you have the time to help me get this through review? |
I've rebased onto master. We jumped from 32 errors to 260+. We're going to need to make a concerted effort to get these typing MRs through, and beat the merge conflict fatigue. Can we make it happen? |
Yes, let's make it happen! I'll help out as well.
…On Mon, May 4, 2020 at 11:47 AM Chad Dombrova ***@***.***> wrote:
I've rebased onto master. We jumped from 32 errors to 260+. We're going to need to make a concerted effort to get these typing MRs through, and beat the merge conflict fatigue. Can we make it happen?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yeah, I'll make another pass later today |
fc9883b
to
f528b58
Compare
I created #11620 to prevent regression. This way we can tackle things module by module until we have good coverage, without having to worry (as much) about huge PRs and regressions. (We should figure out how to coordinate the work of fixing up the modules to avoid overlap.) |
Should be safe, but isolating for more scrutiny
be54ef2
to
d4ecb74
Compare
Rebased on top of #11620, fixed some more issues, and removed some gating from mypy.ini, so mypy checks are all green. |
Run Python PreCommit |
OK, this seems to have stalled again. When I go to look at this, I have trouble keeping track of what's reviewed and what's not. Chad, could you split this up into several PRs, one per subdirectory (plus maybe one for all the the packages with only a small number of affected files)? I think that'll be faster for reviewing pieces and getting them in as we can. (It seems the cross-package interaction will be relatively minimal.) |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Another round of fixes.
After this and the filesystem PR are merged, there will be only 32 errors remaining. I'd like to hand the remainder of the errors over to @robertwb and @udim to resolve, since many of them require some specific Beam knowledge to solve. I will of course be happy to help (we could connect on slack via the Beam channel?).
Then we can put the PythonLint job into effect!
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.