Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor read_metadata in fastparquet engine #8092

Merged
merged 60 commits into from
Oct 19, 2021
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
8b23c6e
initial pyarrow-dataset experimental refactor
rjzamora Aug 18, 2021
a0e0fec
inital refactor of both arrow engines
rjzamora Aug 19, 2021
697e4f2
move private methods to correct location
rjzamora Aug 19, 2021
08ea5ad
save state
rjzamora Aug 19, 2021
87845b9
ready to convert _metadata-free case to a dask graph
rjzamora Aug 20, 2021
a44bc39
parallel collect parts
rjzamora Aug 20, 2021
5086583
remove stale/unused code and move legacy-specific code into legacy class
rjzamora Aug 20, 2021
a9c36ba
Merge remote-tracking branch 'upstream/main' into read-metadata-refactor
rjzamora Aug 20, 2021
b978de5
ensure correct hive-partition order
rjzamora Aug 20, 2021
6c819ea
add files_per_metadata_task to docstring
rjzamora Aug 20, 2021
54db532
change default files_per_metadata_task to 32
rjzamora Aug 20, 2021
4b0b4d3
adding basic test coverage
rjzamora Aug 20, 2021
c51e970
Merge branch 'main' of https://github.com/dask/dask into read-metadat…
jrbourbeau Aug 24, 2021
52f6c30
minor fix
rjzamora Aug 25, 2021
1110330
change name to metadata_task_size
rjzamora Aug 25, 2021
e34c48b
port _determine_pf_parts to _collect_dataset_info
rjzamora Aug 25, 2021
0cd35b4
ported _generate_dd_meta to _create_dd_meta
rjzamora Aug 25, 2021
1354cf7
Move _construct_parts into _construct_collection_plan
rjzamora Aug 25, 2021
b4c8a45
remove unused (stale) code
rjzamora Aug 25, 2021
d9e48c5
minor paths tweak
rjzamora Aug 25, 2021
bb57a2a
resolve bug for partitioned data with gather_statistics=False
rjzamora Aug 25, 2021
322af53
improve pyarrow 'fast-path'
rjzamora Aug 25, 2021
7e8b8bd
enable faster partitioned reads with pyarrow
rjzamora Aug 25, 2021
9d88830
fix pyarrow partitioning behaviour
rjzamora Aug 26, 2021
18f316e
basic parallel fastparquet metadata support
rjzamora Aug 26, 2021
b4ca448
remove stale code
rjzamora Aug 26, 2021
ef59025
remove sync compute
rjzamora Aug 26, 2021
a468872
update arrow changes
rjzamora Aug 31, 2021
ef3bfe9
add arrow_filesystem option for 'pyarrow-dataset' engine
rjzamora Sep 8, 2021
355739e
avoid default caching in fs.open
rjzamora Sep 14, 2021
884cd38
roll back arrow_filesystem= option - non-default caching provides goo…
rjzamora Sep 14, 2021
abc561c
remove arrow_filesystsem test
rjzamora Sep 14, 2021
ae3fc23
address missing pandas metadata issue
rjzamora Sep 14, 2021
7a9086b
Merge remote-tracking branch 'upstream/main' into read-metadata-refactor
rjzamora Sep 27, 2021
44bf0f5
Merge branch 'read-metadata-refactor' into read-metadata-refactor-fas…
rjzamora Sep 27, 2021
5fe0fe0
remove stale comment
rjzamora Sep 27, 2021
6f7fab5
re-order check for 'simple' partitioning plan
rjzamora Sep 28, 2021
ff82935
update arrow.py
rjzamora Sep 28, 2021
9b940f2
code review part 1
rjzamora Sep 29, 2021
a18223c
roll back partition_method name change and fix metadata_task_size test
rjzamora Sep 29, 2021
b2474d2
avoid regex
rjzamora Sep 29, 2021
05b544c
move metadata_task_size to config
rjzamora Sep 29, 2021
68f6490
docstring fix
rjzamora Sep 29, 2021
ec8139b
update dask/dask/dask-schema.yaml
rjzamora Sep 29, 2021
b39588c
Merge branch 'read-metadata-refactor' into read-metadata-refactor-fas…
rjzamora Sep 29, 2021
284736c
improve defaults
rjzamora Sep 29, 2021
6cab46c
Merge branch 'read-metadata-refactor' into read-metadata-refactor-fas…
rjzamora Sep 29, 2021
5ff9d14
bugfix in _set_metadata_task_size
rjzamora Sep 29, 2021
c79906c
Merge branch 'read-metadata-refactor' into read-metadata-refactor-fas…
rjzamora Sep 29, 2021
6ea87b3
avoid parallel metadata collection when the file count is less than m…
rjzamora Sep 29, 2021
9a444fe
test_metadata_task_size fix
rjzamora Sep 29, 2021
4587494
Merge remote-tracking branch 'upstream/main' into read-metadata-refac…
rjzamora Oct 1, 2021
2231875
Merge branch 'read-metadata-refactor-fastparquet' of https://github.c…
rjzamora Oct 1, 2021
47b07b9
address 8201
rjzamora Oct 4, 2021
c7401de
remove debug compute call in test
rjzamora Oct 4, 2021
6deed66
roll back base_path name change to reduce diff
rjzamora Oct 4, 2021
859c107
fix undefined piece mistake
rjzamora Oct 14, 2021
37d59e5
remove _common_metadata_exists
rjzamora Oct 14, 2021
d09c76c
update to OSError
rjzamora Oct 14, 2021
b9edc41
Merge remote-tracking branch 'upstream/main' into read-metadata-refac…
rjzamora Oct 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading