Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Hive DirectoryLister to TrinoFileSystem #17323

Merged
merged 5 commits into from
May 16, 2023

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented May 2, 2023

Description

Refactoring some of the Hadoop FileSystem uses to instead use TrinoFileSystem.

Additional context and related issues

Removing the hierarchical listing API from the DirectoryLister was a necessary requirement for the migration, since TrinoFileSystem does not have a directory based listing API, only a recursive one.

There are a couple functional changes that I think are unavoidable:
- Reading from Hive tables pointing to a location that does not exist produces no rows rather than failing
- hive.ignore-absent-partitions now applies both to partitions whose directories do not exist, but also ones whose directories exist but are empty
Edit: There are no longer any behavior changes

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label May 2, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels May 2, 2023
@alexjo2144 alexjo2144 force-pushed the hive/migrate-split-gen-to-trinofs branch 3 times, most recently from b5c3a63 to 7b9f1ba Compare May 4, 2023 15:27
@alexjo2144
Copy link
Member Author

Note to self, DirectoryListingCacheKey is unused now. Just didn't want to interrupt the build for that update.

@alexjo2144
Copy link
Member Author

io.trino.tests.product.hive.TestAvroSymlinkInputFormat.testSymlinkTableWithNestedDirectory is the last failure I don't understand. It passes when I run it individually but fails when run as the full set like in suite-5 -g 'configured_features, storage_formats, hdfs_impersonation'

@electrum
Copy link
Member

electrum commented May 5, 2023

Why do we need "Support Locations without authorities on all schemes"?

electrum
electrum previously approved these changes May 5, 2023

return COMPLETED_FUTURE;
}

private List<TrinoFileStatus> listBucketFiles(Path path, FileSystem fs, String partitionName)
private List<TrinoFileStatus> listBucketFiles(Location location, TrinoFileSystem fs, String partitionName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put the file system argument first

@electrum electrum self-requested a review May 5, 2023 03:25
@electrum electrum dismissed their stale review May 5, 2023 03:26

Not sure about first commit

@alexjo2144
Copy link
Member Author

Why do we need "Support Locations without authorities on all schemes"?

The product tests hit a few cases where we were listing locations like hdfs:/a/table/path without an authority/host present. This seems to line up with the Wikipedia entry for URIs which says the :// prefix notes that an authority is present, otherwise the scheme is directly followed by the path.

@alexjo2144 alexjo2144 force-pushed the hive/migrate-split-gen-to-trinofs branch 4 times, most recently from 432e959 to 9b9abc4 Compare May 11, 2023 15:46
@alexjo2144 alexjo2144 requested a review from electrum May 11, 2023 15:48
@alexjo2144 alexjo2144 force-pushed the hive/migrate-split-gen-to-trinofs branch from 9b9abc4 to 102fcfb Compare May 12, 2023 19:58
@alexjo2144
Copy link
Member Author

alexjo2144 commented May 12, 2023

Just a fixed checkstyle error

@electrum
Copy link
Member

Not sure if this is related or flaky:

Error:  io.trino.hdfs.TestFileSystemCache.testFileSystemCacheConcurrency  Time elapsed: 0.825 s  <<< FAILURE!
java.lang.AssertionError: Cache size is non zero expected [0] but found [1]
	at org.testng.Assert.fail(Assert.java:94)
	at org.testng.Assert.failNotEquals(Assert.java:513)
	at org.testng.Assert.assertEqualsImpl(Assert.java:135)
	at org.testng.Assert.assertEquals(Assert.java:116)
	at org.testng.Assert.assertEquals(Assert.java:284)
	at io.trino.hdfs.TestFileSystemCache.testFileSystemCacheConcurrency(TestFileSystemCache.java:118)

@alexjo2144 alexjo2144 force-pushed the hive/migrate-split-gen-to-trinofs branch from 102fcfb to d616d90 Compare May 15, 2023 17:29
@alexjo2144 alexjo2144 force-pushed the hive/migrate-split-gen-to-trinofs branch from d616d90 to 0d47cd3 Compare May 15, 2023 18:05
@alexjo2144
Copy link
Member Author

Comments applied, had to resolve some conflicts from #17432 as well

@alexjo2144
Copy link
Member Author

And looks like TestFileSystemCache was just a flake: #17158

@electrum
Copy link
Member

electrum commented May 16, 2023

TestFileSystemCache failed again. Retrying it now. It seems suspicious, but this PR doesn't seem to touch anything that could affect it, so maybe we were just unlucky twice.

@yohengyang
Copy link

yohengyang commented Jun 30, 2023

public Optional<Boolean> directoryExists(Location location)
throws IOException
{
stats.getDirectoryExistsCalls().newCall();
Path directory = hadoopPath(location);
FileSystem fileSystem = environment.getFileSystem(context, directory);
return environment.doAs(context.getIdentity(), () -> {
if (!hierarchical(fileSystem, location)) {
return Optional.empty();
}
try (TimeStat.BlockTimer ignored = stats.getDirectoryExistsCalls().time()) {
FileStatus fileStatus = fileSystem.getFileStatus(directory);
return Optional.of(fileStatus.isDirectory());
}
catch (FileNotFoundException e) {
return Optional.of(false);
}
catch (IOException e) {
stats.getListFilesCalls().recordException(e);
throw e;
}
});

@alexjo2144 @electrum Is it necessary to use the hierarchical method in HdfsFileSystem to determine whether the file system is a Hierarchical file system or an Object store file system? Because when using HdfsFileSystem, it is certain that it is not an Object store file system.

@alexjo2144
Copy link
Member Author

Is it necessary to use the hierarchical method in HdfsFileSystem

It is for now, because we haven't introduced native TrinoFileSystem implementations for all file systems yet. HdfsFileSystem is still used for s3/ Azure FSs, GCS, etc. Once it is only used to connect to HDFS we will be able to take it out.

@alexjo2144 alexjo2144 deleted the hive/migrate-split-gen-to-trinofs branch June 30, 2023 15:53
@yohengyang
Copy link

HdfsFileSystem is still used for s3/ Azure FSs, GCS, etc. Once it is only used to connect to HDFS we will be able to take it out.

Tks for replying, got it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

3 participants