-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move away from ClientMiddleware and ClientAuthHandler in pydeephaven #5489
Conversation
I believe this explains the SEGFAULT we see with This is the cython code (pyx) implementation of flight_client.authenticate: https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/python/pyarrow/_flight.pyx#L1440
The call to Authenticate on line 1461 is to grpc_client.cc line 860:
Many calls in the same file use the auth_handler_ member that is being assigned on line 862 above, eg, DoPut:
SetToken on the same file, line 86:
None of this has any locking or protection against races to write/read. In particular a second call to python's flight_client.authenticate is going to invoke operator= on |
@@ -26,6 +26,8 @@ def tearDownClass(cls) -> None: | |||
os.remove(BaseTestCase.csv_file) | |||
|
|||
def setUp(self) -> None: | |||
# For netty server and psk, change auth_token to what the server printed. | |||
# self.session = Session(port = 8080, auth_type = 'io.deephaven.authentication.psk.PskAuthenticationHandler', auth_token = 'safw7c4nzegp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what this is, but is this a secret and/or stale token that is being checked into the codebase? Just double-checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
psk is a different method to authenticate, and netty is a different gRPC backend engine that the server can be run with. I had a need to confirm that the bug would also manifest in a netty server with psk authentication, so I had to figure out what the F*** to change to get that to work. That comment is me trying to ensure if I need to do that again I don't have to research it again.
console_pb2.BindTableToVariableRequest(console_id=self.console_id, | ||
table_id=table.ticket, | ||
variable_name=variable_name), | ||
metadata=self.session.grpc_metadata) | ||
self.session.update_metadata(call.initial_metadata()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style comment: maybe no action for now, but.....
it's a little bit unfortunate that we have to remember to call self.session.update_metadata(call.initial_metadata())
after every one of our grpc calls. It might be nice to wrap that in a helper method... or a method that forwards arguments to grpc and then does the metadata update for us.... I think that's what I did in C++
Anyway, the suggestion for now is to a least think about how we could make it nicer and a little more foolproof the next time we refactor this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it; added wrappers.
py/client/pydeephaven/session.py
Outdated
skew = random() | ||
# Backoff schedule for retries after consecutive failures to refresh auth token | ||
self._refresh_backoff = [ skew + 0.1, skew + 1, skew + 10 ] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing trailing spaces on a newline
py/client/pydeephaven/session.py
Outdated
return self._input_table_service | ||
|
||
@property | ||
def plugin_object_service(self) -> PluginObjService: | ||
def plugin_object_service(self) -> PluginObjService: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def plugin_object_service(self) -> PluginObjService: | |
def plugin_object_service(self) -> PluginObjService: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
py/client/pydeephaven/session.py
Outdated
def update_metadata(self, metadata: Iterable[Tuple[str, Union[str, bytes]]]): | ||
for header_tuple in metadata: | ||
if header_tuple[0] == "authorization": | ||
v = header_tuple[1] | ||
self._auth_header_value = v if isinstance(v, bytes) else v.encode('ascii') | ||
break | ||
|
||
def wrap_rpc(self, stub_call, *args, **kwargs): | ||
if 'metadata' in kwargs: | ||
raise DHError('Internal error: "metadata" in kwargs not supported in wrap_rpc.') | ||
kwargs["metadata"] = self.grpc_metadata | ||
# We use a future to get a chance to process initial metadata before the call | ||
# is completed | ||
future = stub_call.future(*args, **kwargs) | ||
self.update_metadata(future.initial_metadata()) | ||
# Now block until we get the result (or an exception) | ||
return future.result() | ||
|
||
def wrap_bidi_rpc(self, stub_call, *args, **kwargs): | ||
if 'metadata' in kwargs: | ||
raise DHError('Internal error: "metadata" in kwargs not supported in wrap_bidi_rpc.') | ||
kwargs["metadata"] = self.grpc_metadata | ||
response = stub_call(*args, **kwargs) | ||
self.update_metadata(response.initial_metadata()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing type hints and pydocs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
py/client/pydeephaven/session.py
Outdated
|
||
def get_token(self): | ||
return self._token | ||
def trace(who): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If this is not intended to be public, it should start with
_
. - If this is intended to be public, it needs pydocs.
- Missing type hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
82f1049
Using a pydeephaven.Session from multiple threads and doing concurrent server operations generates an abort inside gRPC with a message similar to
The program
py/client/examples/mt_session.py
added in this PR reproduces the issue very quickly, ifpy/client/session.py
is modified to refresh the authentication token every 3 seconds; a crash is typically observed before 3 minutes.A second failure mode is a segmentation fault insde a cython pyx call to get_token (gdb stack traces attached).
A third failure mode looks like a server disconnection that makes all the threads raise exceptions.
In investigating the issue, the crashes always happen around upcalls from C++ gRPC or C++ pyarrow code to python callbacks in the form of ClientMiddleware (interceptors we use to read and write headers for auth tokens) and ClientAuthHandler.
This PR moves the python client away from ClientMiddleware and ClientAuthHandler to instead manage authentication the way the current C++ client does it: by explicitly setting headers in RPC stub calls, and explicitly reading headers from stub call returns. This prevents any upcall from C++ to python, which prevents the issues and should also be more efficient.
This PR also adds more defensive coding around concurrent use (read/modify/write) of session state, in particular around auth token, and adds retries and warn logging around auth refresh failures instead of raising exceptions right away. Of note is during the investigation was established that raising an exception from ClientMiddleware or ClientAuthHandler code (which is being called from C++) will abort the program right away (as in failure mode 1, GRPC_CALL_ERROR_TOO_MANY_OPERATIONS).
With the changes in this PR, the was run for 3 hours without failures.
pydh_test_mt_failure_1.gdb.txt
pydh_test_mt_failure_2.gdb.txt
pydh_test_mt_failure_3.output.txt