Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiskManager and TempFiles getting created several times per query #1690

Closed
alamb opened this issue Jan 27, 2022 · 1 comment · Fixed by #1700
Closed

DiskManager and TempFiles getting created several times per query #1690

alamb opened this issue Jan 27, 2022 · 1 comment · Fixed by #1700
Assignees
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Jan 27, 2022

Describe the bug
If you run a query in DataFusion against parquet files, it will create several unnecessary temporary files.

IOx also hits the same thing (with the same root cause): https://github.com/influxdata/influxdb_iox/issues/3507#issuecomment-1023679575

There are several places which (non obviously) create a DiskManager instance today -- the one that hits the parquet usecase above is (in the creation of the pruning predicate that requires an ExecutionContext): https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs#L132

This has two problems:

  1. it is unneeded overhead (the disk manager is not used),
  2. the overhead is larger than it needs to be (it creates a tempfile)

I propose a two pronged solution (will propose two PRs):

  1. Create temp files on demand in the DiskManger (so we are at least not doing IO unless needed)
  2. Remove unnecessary creation of ExecutionContext

I think the second will be a slightly larger project as it gets passed to create_physical_expr

Though I think the main sources of problem are related to create_physical_expr and that only uses the context to look up vars, if necessary.

@alamb alamb added bug Something isn't working datafusion Changes in the datafusion crate labels Jan 27, 2022
@alamb
Copy link
Contributor Author

alamb commented Jan 28, 2022

cc @yjshen I plan to work on these items today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
1 participant