Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kerberos ticket refresh #16680

Merged
merged 2 commits into from
Mar 28, 2023

Conversation

hashhar
Copy link
Member

@hashhar hashhar commented Mar 23, 2023

The Hadoop UGI class handles ticket refresh only if the Subject is not provided externally. For external Subject UGI expects the refresh will be handled by the creator of the Subject which in our case we did not do.

Because of this before this change any Trino query which ran longer than the ticket_lifetime failed with errors like

GSS initiate failed [Caused by GSSException: No valid credentials
provided (Mechanism level: Failed to find any Kerberos tgt)].

In Hadoop code the UGI instance also gets re-used in some places (e.g. DFSClient) which means we cannot just create a new UGI with refreshed credentials and return that since other parts of code will keep using the old UGI with expired credentials. So the fix is to create a new UGI, extract the credentials from it and update the existing UGI's credentials with them so that all users of the existing UGI also observe the new valid credentials.

Release notes

(x) Release notes are required, with the following suggested text:

# Hive
* Fix possible query failure with a Kerberized Hive connector when the query executes longer than the Kerberos ticket lifetime. ({issue}`16680`)

@hashhar hashhar added the tests:all Run all tests label Mar 23, 2023
@cla-bot cla-bot bot added the cla-signed label Mar 23, 2023
@hashhar hashhar requested review from Praveen2112 and kokosing March 23, 2023 09:29
@kokosing
Copy link
Member

Is it still draft?

@kokosing kokosing requested a review from electrum March 23, 2023 09:35
@hashhar
Copy link
Member Author

hashhar commented Mar 23, 2023

draft only because I want to remember to remove the stress test commit

@hashhar hashhar marked this pull request as ready for review March 23, 2023 09:36
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@hashhar hashhar force-pushed the hashhar/kerberos-renewal-fix branch from 2c01e00 to e708bab Compare March 23, 2023 10:15
@hashhar
Copy link
Member Author

hashhar commented Mar 24, 2023

Added a better test, it fails without the fix and passes only with the fix applied.

@hashhar hashhar force-pushed the hashhar/kerberos-renewal-fix branch from 1b37cdd to 3c55768 Compare March 26, 2023 08:55
forwardable = true
allow_weak_crypto = true
# low ticket_lifetime to make sure refresh code in Trino is tested
ticket_lifetime = 30s
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I had to remove renew_lifetime from here to avoid hitting #11251 (comment).

We can add it back if we have trinodb/docker-images#162 but I don't see any reason to have it since our code doesn't care whether tickets are renewable or not since it always performs a "relogin" instead of renewal.

Copy link
Member Author

@hashhar hashhar Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to self, trinodb/docker-images#162 is moving forward so this comment can soon be addressed.

@hashhar hashhar requested a review from Praveen2112 March 26, 2023 08:57
@hashhar hashhar force-pushed the hashhar/kerberos-renewal-fix branch from 3c55768 to 50057b7 Compare March 26, 2023 08:58
@hashhar
Copy link
Member Author

hashhar commented Mar 27, 2023

@electrum @Praveen2112 this is ready for another round of review, I've simplified the test and fixed the failures due to having renew_lifetime in the krb5.conf.

Comment on lines +64 to +65
onTrino().executeQuery("SET SESSION scale_writers = false");
onTrino().executeQuery("SET SESSION task_scale_writers_enabled = false");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this session be reflected across all the usage of onTrino

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When onTrino is closed it closes the connection it was using thus resetting session properties.

I'll need to verify whether the session is shared even across test classes.

cc: @findepi if he knows the answer. I also see lots of other tests use SET SESSION directly without resetting or any form of isolation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we close the connection once the a test is completed ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yes, anything we get from TestContext (e.g. testContext().getDependency(QueryExecutor.class, config)) gets closed after each test. This is done by tempto, see https://github.com/trinodb/tempto/blob/a2f7ef3b914db2aeb3480cd845929b1ab26103ef/tempto-core/src/main/java/io/trino/tempto/internal/context/GuiceTestContext.java#L135

Thanks @kokosing for explaining how this magic works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When onTrino is closed it closes the connection it was using thus resetting session properties.

yes

I'll need to verify whether the session is shared even across test classes.

product tests are single-threaded

(i think the test class context holds the query executor, so they likely wouldn't be shared anyway)

The Hadoop UGI class handles ticket refresh only if the Subject is not
provided externally. For external Subject UGI expects the refresh will
be handled by the creator of the Subject which in our case we did not
do.

Because of this before this change any Trino query which ran longer than
the ticket_lifetime failed with errors like

    GSS initiate failed [Caused by GSSException: No valid credentials
    provided (Mechanism level: Failed to find any Kerberos tgt)].

In Hadoop code the UGI instance also gets re-used in some places (e.g.
DFSClient) which means we cannot just create a new UGI with refreshed
credentials and return that since other parts of code will keep using
the old UGI with expired credentials. So the fix is to create a new UGI,
extract the credentials from it and update the existing UGI's
credentials with them so that all users of the existing UGI also observe
the new valid credentials.
@hashhar hashhar force-pushed the hashhar/kerberos-renewal-fix branch from 50057b7 to 214bca5 Compare March 27, 2023 10:09
@hashhar
Copy link
Member Author

hashhar commented Mar 27, 2023

AC @Praveen2112. PTAL again.

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the commit message mention re-login as we don't exactly do a refresh but an explicit login

@hashhar
Copy link
Member Author

hashhar commented Mar 27, 2023

refresh != renewal. I kept refresh because existing code uses getNextRefreshTime for example.

I do agree however that relogin makes more sense. In a follow-up I can rename the methods as well to use relogin to clarify what the code actually does. Would that be fine?

@Praveen2112
Copy link
Member

Yeah


// 2x of ticket_lifetime as configured in hadoop-kerberos krb5.conf, sufficient to cause at-least 1 ticket expiry
SECONDS.sleep(60L);
cancelQueryIfRunning(sql);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to cancel the query? Why not simply wait?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs very very long. See some reasons to not wait in #16680 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short because we can't very reliably control how long a query runs. If query finishes quickly then refresh won't get tested, if it runs too long it adds time on CI + possibility of timeout.

@hashhar hashhar merged commit f3c721d into trinodb:master Mar 28, 2023
@hashhar hashhar deleted the hashhar/kerberos-renewal-fix branch March 28, 2023 05:33
@hashhar hashhar mentioned this pull request Mar 28, 2023
@github-actions github-actions bot added this to the 411 milestone Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants