-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17306: [C++] Provide an optimized GetFileInfoGenerator
specialization for LocalFileSystem
#13796
Conversation
GetFileInfoGenerator
specialization for LocalFileSystem
GetFileInfoGenerator
specialization for LocalFileSystem
b671401
to
3f2b141
Compare
Force-pushed the branch to fix some build issues on macOS and Windows. Changelog can be found here: https://github.com/apache/arrow/compare/b6714018c4384cd00f0b9a4e92d804671d1d381b..3f2b141dea8ded351d41d687245917821295277a?diff=unified |
I believe appveyor failures don't have anything to do with my patches... Also, "Dev / Source Merge Script" jobs seem to fail due to some issues related to 10.0.0 freeze? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this. I think the addition of a async file generator will be a nice asset for scanning, especially for datasets with lots of files. I've just take a high-level scan through the implementation at the moment.
Did you find the async generator utilities useful? I have gotten some feedback that the merged async generator was a little confusing and was considering moving away from it in favor of something like a nestable async task group (see https://issues.apache.org/jira/browse/ARROW-16072). I can try and draft up an example of what this might look like on Monday. However, if you're content with the current implementation, we should proceed with this how it is and leave the merged generator question for the future.
I think it might be good to add a stress test now that this is parallel. Maybe create 10 directories with 10 directories with 10k files each so that we can get some testing of both nested parallelism and DiscoveryImplIterator::kBatchSize
.
I will try and look through this more closely on Monday.
/// How many partitions should be processed in parallel. May not be supported by all | ||
/// implementations of `GetFileSystemGenerator`. | ||
util::optional<int32_t> partitions_readahead; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...this makes sense but we haven't used util::optional
in public interfaces that I know of. Typically we do something like int32_t partitions_readahead = kDefaultPartitionsReadahead
. Although, the downside is that we usually end up having to repeat the default in python. Curious if @pitrou has any opinion here.
Also, I don't think "partitions" is the correct terminology to be using here. The FileSelector
is to be understood in the context of "filesystems" which is a more generic abstraction than partitions. Perhaps directory_readahead
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any practical reasons why you do avoid optionals in public APIs? Does Arrow try to maintain strict ABI stability across multiple releases? If not, then I guess it should be completely fine.
Regarding directory_readahead
: I agree, the term partitions
does not belong to this level of abstraction. I'll change it to directory_readahead
, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any ABI guarantees. But when there is a well-known default value, it doesn't make sense to pass an optional, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I also don't understand why this is an attribute of FileSelector
. This sounds more like an implementation-specific know that should probably be in LocalFileSystemOptions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm also curious why this needs to be exposed at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any ABI guarantees. But when there is a well-known default value, it doesn't make sense to pass an optional, IMHO.
Maybe you are right, this can be supplied a reasonable default value, so no need to make it optional.
However, I also don't understand why this is an attribute of FileSelector. This sounds more like an implementation-specific know that should probably be in LocalFileSystemOptions.
Well, I don't have a strong opinion on this one. Don't see anything wrong about it being a part of FileSelector (which is exactly designed to specify details of file selection algorithms). Though, since this option is only applicable for LocalFileSystem
, maybe it makes sense to hide it behind LocalFileSystemOptions
thing.
(I'm also curious why this needs to be exposed at all)
To be able to fine-tune the behavior if the default doesn't work well in a particular case, or better performance can be achieved with another value. For example, various filesystems have varying capabilities regarding parallel IO, e.g. XFS, which is the only FS I know of, that is capable of truly async IO.
Agreed that the test failures are unrelated. The Appveyor issues appears to be addressed already by #13795 |
/// in serial manner via `MakeConcatenatedGenerator` under the hood. | ||
class AsyncStatSelector { | ||
public: | ||
using FileInfoGeneratorProducer = PushGenerator<FileInfoGenerator>::Producer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I'm curious; from an implementation POV, wouldn't it be simpler to have a PushGenerator<FileInfoVector>
?
(that's what the S3 implementation of this does, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely this can be done, but I'll need to play with the code a little bit to figure out how this will work out.
I suggest we move forward with the current approach and, in case the code can be reshaped in a more optimal way, just provide the fix as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm suggesting this is that it seems like the natural way to implement GetFileInfoGenerator
, and it would also make the code easier to read and maintain. If you have some time to experiment it would be good to give it a try IMHO. Unless you have other PRs pending depending on this feature, merging this PR soon is not critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there aren't yet, but I plan to post some more PRs soon :)
Force-pushed the branch to address review comments. Diff can be found here: https://github.com/apache/arrow/compare/3f2b141dea8ded351d41d687245917821295277a..136ae80540f851219c8d9a920470d4863a428d30 Changelog:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, I posted a couple comments.
Also, can you add some localfs-specific tests for this? Ideally you would stress both with and without directory readahead...
cpp/src/arrow/filesystem/localfs.h
Outdated
static constexpr uint32_t kDefaultDirectoryReadahead = 1u; | ||
static constexpr uint32_t kDefaultBatchSize = 1000u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: let's avoid gratuitous use of unsigned integers. Can make these int
or int32_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why are uint32_t
:s bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not bad per se, but the general tendency should be to use them only when necessary (for example you want to perform unsigned computations, or are you reading data provided by a third-party API), as otherwise you inevitably end up mixing signed and unsigned which is generally annoying.
cpp/src/arrow/filesystem/localfs.h
Outdated
/// a single FileInfoVector chunk by the `GetFileSystemGenerator` impl, which | ||
/// is the result of `stat`:ing individual dirents, obtained by the call to | ||
/// `internal::ListDir`. | ||
uint32_t batch_size = kDefaultBatchSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to file_info_batch_size
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
cpp/src/arrow/filesystem/localfs.cc
Outdated
if (!result.ok()) { | ||
auto status = result.status(); | ||
if (selector_.allow_not_found && status.IsIOError()) { | ||
auto exists = FileExists(dir_fn_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here you can simply propagate the error which will simplify the code below.
auto exists = FileExists(dir_fn_); | |
ARROW_ASSIGN_OR_RAISE(bool exists, FileExists(dir_fn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, I'll fix that.
cpp/src/arrow/filesystem/localfs.cc
Outdated
if (exists.ok() && !*exists) { | ||
return Status::OK(); | ||
} else { | ||
return exists.ok() ? arrow::Status::UnknownError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If exists
is ok you should simply propagate the error not-found error, not create another one.
This would make the final code look like this probably:
if (!result.ok()) {
if (selector_.allow_not_found && status.IsIOError()) {
ARROW_ASSIGN_OR_RAISE(bool exists, FileExists(dir_fn));
if (!exists) {
return Status::OK();
}
}
return status;
}
(incidentally, this is a similar snippet as in StatSelector()
)
cpp/src/arrow/filesystem/localfs.cc
Outdated
FileInfoVector yield_vec; | ||
std::swap(yield_vec, current_chunk_); | ||
return yield_vec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply:
FileInfoVector yield_vec; | |
std::swap(yield_vec, current_chunk_); | |
return yield_vec; | |
return std::move(current_chunk_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, missed that one.
FileInfoGeneratorProducer file_gen_producer, | ||
uint32_t batch_size) { | ||
ARROW_RETURN_IF(file_gen_producer.is_closed(), | ||
arrow::Status::Cancelled("Discovery cancelled")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is more of a general failsafe rather than motivated by some real possibility. Ideally, should not happen, since there's no way to interact with the producer directly from the user's point of view. But, it's always better be safe than sorry. :)
Sure, I'll add some stress tests for the feature to cover both batching and processing parallelism options. |
ac2954d
to
affcca7
Compare
Force-pushed the branch to address review comments from @pitrou and @westonpace . The diff can be found here: https://github.com/apache/arrow/compare/136ae80540f851219c8d9a920470d4863a428d30..ac2954d39b5de052b3c05563d290f00c23b51b34, also pushed a fixup to amend commit message. Changelog:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @ManManson ! I have another bunch of comments below.
#include "arrow/util/make_unique.h" | ||
#include "arrow/util/string_view.h" | ||
|
||
#include "parquet/arrow/writer.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the benchmark not use Parquet? I think we want to minimize coupling here.
Just create a dummy filesystem structure and run your benchmark over that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks.
cpp/src/arrow/filesystem/localfs.h
Outdated
/// Whether OpenInputStream and OpenInputFile return a mmap'ed file, | ||
/// or a regular one. | ||
bool use_mmap = false; | ||
|
||
/// Options related to `GetFileSystemGenerator` interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Options related to `GetFileSystemGenerator` interface. | |
/// Options related to `GetFileInfoGenerator` interface. |
cpp/src/arrow/filesystem/localfs.h
Outdated
/// Options related to `GetFileSystemGenerator` interface. | ||
|
||
/// How many directories should be processed in parallel | ||
/// by the `GetFileSystemGenerator` impl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// by the `GetFileSystemGenerator` impl. | |
/// by the `GetFileInfoGenerator` impl. |
cpp/src/arrow/filesystem/localfs.h
Outdated
/// by the `GetFileSystemGenerator` impl. | ||
int32_t directory_readahead = kDefaultDirectoryReadahead; | ||
/// Specifies how much entries shall be aggregated into | ||
/// a single FileInfoVector chunk by the `GetFileSystemGenerator` impl, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// a single FileInfoVector chunk by the `GetFileSystemGenerator` impl, which | |
/// a single FileInfoVector chunk by the `GetFileInfoGenerator` impl, which |
cpp/src/arrow/filesystem/localfs.cc
Outdated
FileInfoVector yield_vec; | ||
std::swap(yield_vec, current_chunk_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably be shortened:
FileInfoVector yield_vec; | |
std::swap(yield_vec, current_chunk_); | |
auto yield_vec = std::move(current_chunk_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
cpp/src/arrow/filesystem/localfs.cc
Outdated
std::move(dir_fn), nesting_depth, std::move(selector), | ||
file_gen_producer, file_info_batch_size)), | ||
io::default_io_context().executor())); | ||
gen = MakeTransferredGenerator(std::move(gen), arrow::internal::GetCpuThreadPool()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... why are we transferring to the CPU thread pool? That doesn't seem necessary.
cc @westonpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, as far as I understand, if the generator is not transferred back, every continuation attached to every future produced by this generator, will also be run on IO thread pool?
This is undesirable, because we would want to do arbitrary (possibly compute-intensive) stuff on each delivered chunk from GetFileInfoGenerator()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you may also want to do IO on the GetFileInfoGenerator
results, in which case you'll transfer back and forth between executors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, looks reasonable. What we get from GetFileInfoGenerator()
is only a raw file handle, so we would want to process it somehow later, e.g. read the file, inspect the metadata etc. Given that, the generator output is going to produce some more IO operations either way, in most cases.
I think MakeTransferredGenerator()
is not necessary, indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave a note about that in DoDiscovery()
, so that future readers don't have to guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange thing: after removing a call to MakeTransferredGenerator()
, the benchmark hangs. Debugging that...
In the meanwhile, maybe you know of any quirks when dealing with background generators, related to this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace would be the best person to answer that question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, it seems, that I need to transfer from the BackgroundGenerator
anyway. There is a bunch of related issues about deadlocking background generators, e.g. https://issues.apache.org/jira/browse/ARROW-13109 and https://issues.apache.org/jira/browse/ARROW-13110, which added a corresponding note to MakeBackgroundGenerator
:
You MUST transfer away from this background generator. Otherwise there could be a race condition if a callback on the background thread deletes the last consumer reference to the background generator. You can transfer onto the same executor as the background thread, it is only neccesary to create a new thread task, not to switch executors.
Though, it's legal to specify the same executor for a background generator, in order to start a new task but without switching to the other executor.
So, there should be MakeTransferredGenerator
, but we can specify the same io_executor
as a target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't much help, I've been trying to wrap up my work on a better async task scheduler. I will give this PR a thorough look tomorrow. What you are describing sounds correct. Creating a new thread task doesn't seem ideal but it should work ok. I agree that, in the ideal case, no transfer should be necessary and we should just remain on the I/O thread.
cpp/src/arrow/filesystem/localfs.cc
Outdated
/// and automatically calls `Close()` on it once the ref-count for the | ||
/// state reaches zero (which is equivalent to finishing the file discovery | ||
/// process). | ||
class AutoClosingProducer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PushGenerator
is thread-safe, but unfortunately AutoClosingProducer
is not.
It seems you can vastly simplify this and make it thread-safe by letting C++ do the work:
struct DiscoveryState {
FileInfoGeneratorProducer producer;
~DiscoveryState() {
producer->Close();
}
};
... then pass the same std::shared_ptr<DiscoveryState>
to every DiscoveryImplIterator
:
static Result<AsyncGenerator<FileInfoGenerator>> DiscoverDirectoryFiles(
FileSelector selector, LocalFileSystemOptions fs_opts) {
PushGenerator<FileInfoGenerator> file_gen;
ARROW_ASSIGN_OR_RAISE(
auto base_dir, arrow::internal::PlatformFilename::FromString(selector.base_dir));
auto discovery_state = std::make_shared<DiscoveryState>(std::move(file_gen.producer()));
ARROW_RETURN_NOT_OK(DoDiscovery(std::move(base_dir), 0, std::move(selector),
std::move(discovery_state),
fs_opts.file_info_batch_size));
return file_gen;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks for catching it!
Boost::filesystem | ||
Boost::system) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the Boost dependency isn't needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this should not be needed.
} | ||
BENCHMARK_REGISTER_F(LocalFSFixture, AsyncFileDiscovery) | ||
->ArgNames({"directory_readahead", "file_info_batch_size"}) | ||
->ArgsProduct({{1, 2, 4, 8, 16}, {1, 10, 100, 1000}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we perhaps cut down on the number of generated benchmarks? For example:
->ArgsProduct({{1, 2, 4, 8, 16}, {1, 10, 100, 1000}}) | |
->ArgsProduct({{1, 4, 16}, {100, 1000}}) |
(1
for file_info_batch_size
seems so obviously pessimal that it needn't be tested, what do you think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
Force-pushed the branch to address review comments from @pitrou Changelog:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the benchmark only, a couple comments.
return Status::OK(); | ||
}); | ||
ASSERT_FINISHES_OK(visit_fut); | ||
st.SetItemsProcessed(total_file_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetItemsProcessed
is per benchmark, not per iteration, so to get correct numbers when iterations > 1, you should instead do something like:
size_t total_file_count = 0;
for (auto _ : st) {
...
}
st.SetItemsProcessed(total_file_count);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will fix that.
|
||
const size_t nesting_depth_ = 2; | ||
const size_t num_dirs_ = 10; | ||
const size_t num_files_ = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates on the order of 1 million files total, and makes the benchmark rather slow to setup and run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to create so many files was to adequately test for large batch/dir_readahead combinations, which I think show sensible results only if there's enough files to stress the target code.
But, if you think we can reduce the overall number of files, I don't have any strong objections to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_files_ = 1000
looks more reasonable here, but YMMV :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a bit of a tension between producing interesting numbers and making the benchmark rather expensive to run (even on a fast machine... I'm afraid macOS or Windows might be dismal here :-)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, win and macOS users will almost surely suffer in this case...
So, to keep benchmarking fast, let num_files_
be 1k, then. :)
I'm running this benchmark locally on Ubuntu 20.04, 24-thread CPU, ext4 filesystem on a fast SSD.
These numbers seem to support a default readahead of 16 and a default batch size of 1000. |
@pitrou Your numbers are in line with what I get on my machine. I agree we can tune the defaults to be |
Force-pushed the branch to address some follow-up comments. Changelog:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this looks good to me. Thanks for adding the benchmark. I'm curious if you've had a chance to benchmark any real world use cases (e.g. dataset discovery or something)?
The transfer seems the best we can do for now.
cpp/src/arrow/filesystem/localfs.h
Outdated
/// Specifies how much entries shall be aggregated into | ||
/// a single FileInfoVector chunk by the `GetFileInfoGenerator` impl, which | ||
/// is the result of `stat`:ing individual dirents, obtained by the call to | ||
/// `internal::ListDir`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a rather heavy comment for your average Arrow user. Can we perhaps just give a brief description of the reasons you might want to change this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be improving those docstrings. I think it's nice that they are explanatory.
I'll take a last look here. |
This header makes use of `int8_t`, which is defined in `<cstdint>` system header. Signed-off-by: Pavel Solodovnikov <[email protected]>
…h `directory_readahead` option Introduce a `directory_readahead` option to `LocalFileSystemOptions` to adjust how much directory readahead should happen for `GetFileInfoGenerator()` (i.e. how many directories should be processed in parallel). Defaults to 16, judged by the benchmarking results (the benchmark itself will be a few patches later). These changes will be used in a later patch to add a separate optimized specialization for `GetFileInfoGenerator(selector)` for `LocalFileSystem` class. Signed-off-by: Pavel Solodovnikov <[email protected]>
…h `file_info_batch_size` option Introduce a `file_info_batch_size` option to `LocalFileSystemOptions` to adjust how much internal batching should happen inside `GetFileInfoGenerator` implementation (e.g. how much elements should be yielded in a single batch by FileInfoGenerator). Defaults to 1k elements, based on benchmarking results. Signed-off-by: Pavel Solodovnikov <[email protected]>
…sion of `GetFileInfoGenerator` Introduce a helper class `AsyncStatSelector`, which contains an optimized specialization for `GetFileInfoGenerator` in the `LocalFileSystem` class. There are two variants of async discovery functions suported: 1. `DiscoverDirectoryFiles`, which parallelizes traversal of individual directories so that each directory results are yielded as a separate `FileInfoGenerator` via an underlying `DiscoveryImplIterator`, which delivers items in chunks (default size is `kDefaultFileInfoBatchSize == 1K` items). 2. `DiscoverDirectoriesFlattened`, which forwards execution to the `DiscoverDirectoryFiles`, with the difference that the results from individual sub-directory iterators are merged into the single FileInfoGenerator stream. The implementation makes use of additional attributes in `LocalFileSystemOptions`, such as `directory_readahead`, which can be used to tune algorithm behavior and adjust how many directories can be processed in parallel. This option is disabled by default, so that individual directories are processed in serial manner via `MakeConcatenatedGenerator` under the hood. Also, internal batching can also be configured by `LocalFileSystemOptions::file_info_batch_size`, which specifies how many `FileInfo`:s should be batched into a single `FileInfoVector` to yield in a single `DirectoryImplIterator::Next()` invocation. Each `DirectoryImplIterator` maintains a reference to a shared `DiscoveryState` struct, which: 1. Ensures that the producer is alive while there is at least one iterator still running. 2. Producer is properly closed when the discovery process is finished (i.e. ref-count for shared `DiscoveryState` reaches 0). Tests: unit(release) Signed-off-by: Pavel Solodovnikov <[email protected]> Co-Authored-by: Igor Seliverstov <[email protected]>
…tFileGenerator` This patch adds a simple benchmark for testing `LocalFileSystem::GetFileInfoGenerator()` performance. The test function is executed for each combination (cartesian product) of input arguments tuple (directory_readahead, file_info_batch_size) to test both internal parallelism and batching. Test arguments are represented by the range `{{1, 4, 16}, {100, 1000}}`. I.e. directory readhead is tested for values 1 through 16, mult. factor 4; batch size us tested for values 100 through 1000, mult. factor 10. Signed-off-by: Pavel Solodovnikov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some minor changes, but most of all added a test to stress the GetFileInfoGenerator implementation with several parameter values.
/// EXPERIMENTAL: The maximum number of directories processed in parallel | ||
/// by `GetFileInfoGenerator`. | ||
int32_t directory_readahead = kDefaultDirectoryReadahead; | ||
|
||
/// EXPERIMENTAL: The maximum number of entries aggregated into each | ||
/// FileInfoVector chunk by `GetFileInfoGenerator`. | ||
/// | ||
/// Since each FileInfo entry needs a separate `stat` system call, a | ||
/// directory with a very large number of files may take a lot of time to | ||
/// process entirely. By generating a FileInfoVector after this chunk | ||
/// size is reached, we ensure FileInfo entries can start being consumed | ||
/// from the FileInfoGenerator with less initial latency. | ||
int32_t file_info_batch_size = kDefaultFileInfoBatchSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ManManson @westonpace Are the docstrings ok to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That helps a lot, thank you.
Benchmark runs are scheduled for baseline = bc52f9f and contender = a1c3d57. a1c3d57 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…lization for `LocalFileSystem` (apache#13796) Introduce a specialization of `GetFileInfoGenerator` in the `LocalFileSystem` class. This implementation tries to improves performance by hiding latencies at two levels: 1. Child directories can be readahead so that listing directories entries from disk can be achieved in parallel with other work; 2. Directory entries can be `stat`'ed and yielded in chunks so that the `FileInfoGenerator` consumer can start receiving entries before a large directory is fully processed. Both mechanisms can be tuned using dedicated parameters in `LocalFileSystemOptions`. Signed-off-by: Pavel Solodovnikov <[email protected]> Co-Authored-by: Igor Seliverstov <[email protected]> Lead-authored-by: Pavel Solodovnikov <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Introduce a specialization of
GetFileInfoGenerator
in theLocalFileSystem
class.This implementation tries to improves performance by hiding latencies at two levels:
stat
'ed and yielded in chunks so that theFileInfoGenerator
consumer can start receiving entries before a large directory is fully processed.Both mechanisms can be tuned using dedicated parameters in
LocalFileSystemOptions
.Signed-off-by: Pavel Solodovnikov [email protected]
Co-Authored-by: Igor Seliverstov [email protected]