Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly set input dir in job conf instead of FileInputFormat.setInputPath which makes an IO call #16640

Merged
merged 5 commits into from
Mar 29, 2023

Conversation

akshayrai
Copy link
Contributor

Description

Fixes #16639

Release notes

(x) This is not user-visible or docs only and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label Mar 20, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels Mar 20, 2023
@Praveen2112
Copy link
Member

Can we have any test cases - which fails without this fix ?

@@ -529,7 +529,7 @@ private ListenableFuture<Void> loadPartition(HivePartitionMetadata partition)
}

JobConf jobConf = toJobConf(configuration);
FileInputFormat.setInputPaths(jobConf, path);
hdfsEnvironment.doAs(hdfsContext.getIdentity(), () -> FileInputFormat.setInputPaths(jobConf, path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed. It seems to only modify JobConf, not do any IO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought too. But it is internally making a call to namenode to fetch the working directory and update mapreduce.job.working.dir in the jobConf as well. This call started to fail.

Caused by: java.lang.RuntimeException: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "<hostname>/<ip>"; destination host is: "<namenode-hostname>":<port>;
	at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:665)
	at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:452)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. We never set the working directory, so we could replace both usages of this method:

jobConf.set(FileInputFormat.INPUT_DIR, StringUtils.escapeString(path.toString()));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another usage inside createHiveSymlinkSplits() which also needs to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should also work and may be better. Thanks for the feedback. Let me test and update the PR.

One other thing, FileInputFormat.INPUT_DIR isn't defined in the version of hadoop-apache (3.2.0-18) we use. I'll expose the parameter ("mapreduce.input.fileinputformat.inputdir") in this class and update both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum I have updated the PR based on the above feedback.

The CI seems to have failed on a previous commit due to checkstyle violation. But it has been updated in the latest commit.

@akshayrai akshayrai changed the title wrap FileInputFormat.setInputPath call inside hdfsEnvironment.doAs Explicitly set input dir in job conf instead of FileInputFormat.setInputPath which makes an IO call Mar 28, 2023
@electrum electrum merged commit daa80f6 into trinodb:master Mar 29, 2023
@github-actions github-actions bot added this to the 411 milestone Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

Kerberos error when using custom input format via UseFileSplitsFromInputFormat
3 participants