-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kerberos ticket refresh #16680
Fix Kerberos ticket refresh #16680
Conversation
Is it still draft? |
draft only because I want to remember to remove the stress test commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
...sts-launcher/src/main/resources/docker/presto-product-tests/common/hadoop-kerberos/krb5.conf
Outdated
Show resolved
Hide resolved
2c01e00
to
e708bab
Compare
Added a better test, it fails without the fix and passes only with the fix applied. |
...duct-tests/src/main/java/io/trino/tests/product/hive/TestHiveConnectorKerberosSmokeTest.java
Outdated
Show resolved
Hide resolved
1b37cdd
to
3c55768
Compare
forwardable = true | ||
allow_weak_crypto = true | ||
# low ticket_lifetime to make sure refresh code in Trino is tested | ||
ticket_lifetime = 30s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I had to remove renew_lifetime
from here to avoid hitting #11251 (comment).
We can add it back if we have trinodb/docker-images#162 but I don't see any reason to have it since our code doesn't care whether tickets are renewable or not since it always performs a "relogin" instead of renewal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to self, trinodb/docker-images#162 is moving forward so this comment can soon be addressed.
3c55768
to
50057b7
Compare
@electrum @Praveen2112 this is ready for another round of review, I've simplified the test and fixed the failures due to having |
...roduct-tests-launcher/src/main/java/io/trino/tests/product/launcher/suite/suites/Suite2.java
Outdated
Show resolved
Hide resolved
...duct-tests/src/main/java/io/trino/tests/product/hive/TestHiveConnectorKerberosSmokeTest.java
Show resolved
Hide resolved
onTrino().executeQuery("SET SESSION scale_writers = false"); | ||
onTrino().executeQuery("SET SESSION task_scale_writers_enabled = false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this session be reflected across all the usage of onTrino
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When onTrino
is closed it closes the connection it was using thus resetting session properties.
I'll need to verify whether the session is shared even across test classes.
cc: @findepi if he knows the answer. I also see lots of other tests use SET SESSION directly without resetting or any form of isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do we close the connection once the a test is completed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually yes, anything we get from TestContext (e.g. testContext().getDependency(QueryExecutor.class, config)
) gets closed after each test. This is done by tempto, see https://github.com/trinodb/tempto/blob/a2f7ef3b914db2aeb3480cd845929b1ab26103ef/tempto-core/src/main/java/io/trino/tempto/internal/context/GuiceTestContext.java#L135
Thanks @kokosing for explaining how this magic works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
onTrino
is closed it closes the connection it was using thus resetting session properties.
yes
I'll need to verify whether the session is shared even across test classes.
product tests are single-threaded
(i think the test class context holds the query executor, so they likely wouldn't be shared anyway)
The Hadoop UGI class handles ticket refresh only if the Subject is not provided externally. For external Subject UGI expects the refresh will be handled by the creator of the Subject which in our case we did not do. Because of this before this change any Trino query which ran longer than the ticket_lifetime failed with errors like GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]. In Hadoop code the UGI instance also gets re-used in some places (e.g. DFSClient) which means we cannot just create a new UGI with refreshed credentials and return that since other parts of code will keep using the old UGI with expired credentials. So the fix is to create a new UGI, extract the credentials from it and update the existing UGI's credentials with them so that all users of the existing UGI also observe the new valid credentials.
50057b7
to
214bca5
Compare
AC @Praveen2112. PTAL again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the commit message mention re-login
as we don't exactly do a refresh
but an explicit login
refresh != renewal. I kept refresh because existing code uses I do agree however that relogin makes more sense. In a follow-up I can rename the methods as well to use |
Yeah |
|
||
// 2x of ticket_lifetime as configured in hadoop-kerberos krb5.conf, sufficient to cause at-least 1 ticket expiry | ||
SECONDS.sleep(60L); | ||
cancelQueryIfRunning(sql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to cancel the query? Why not simply wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It runs very very long. See some reasons to not wait in #16680 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short because we can't very reliably control how long a query runs. If query finishes quickly then refresh won't get tested, if it runs too long it adds time on CI + possibility of timeout.
The Hadoop UGI class handles ticket refresh only if the Subject is not provided externally. For external Subject UGI expects the refresh will be handled by the creator of the Subject which in our case we did not do.
Because of this before this change any Trino query which ran longer than the ticket_lifetime failed with errors like
In Hadoop code the UGI instance also gets re-used in some places (e.g. DFSClient) which means we cannot just create a new UGI with refreshed credentials and return that since other parts of code will keep using the old UGI with expired credentials. So the fix is to create a new UGI, extract the credentials from it and update the existing UGI's credentials with them so that all users of the existing UGI also observe the new valid credentials.
Release notes
(x) Release notes are required, with the following suggested text: