Rename concurrency to target_partitions #706

andygrove · 2021-07-11T14:48:41Z

Which issue does this PR close?

Closes #685 .

Rationale for this change

We originally used the concurrency config setting to determine how many threads to launch in certain parts of the code base but we now use Tokio to manage task concurrency. The concurrency setting is only used to determine the number of partitions to use when repartitioning the plan. We should therefore rename this config setting.

What changes are included in this PR?

Introduce new with_default_partitions method and leave with_concurrency in place for now so that we don't break the API.

Are there any user-facing changes?

Yes

Benchmark CLI arguments have changed
If users were accessing the concurrency field in ExecutionContext then they will now need to use default_partitions instead. Users should not be accessing these attributes directly though.

jorgecarleitao

👍

Dandandan · 2021-07-11T16:52:25Z

datafusion/src/logical_plan/builder.rs

        table_name: impl Into<String>,
    ) -> Result<Self> {
-        let provider = Arc::new(ParquetTable::try_new(path, max_concurrency)?);
+        let provider = Arc::new(ParquetTable::try_new(path, max_partitions)?);


One concern: when increasing the partitions, we also increase the number of maximum nr of threads while reading parquet. I think this should be decoupled.

I agree it should be decoupled, but I also don't think this PR makes the coupling any worse -- perhaps we could file a follow on ticket?

Filed #924 to track

Dandandan · 2021-07-11T16:56:31Z

I think this is mostly good.

One concern I have is that the current config also sets the number of maximum threads during reading parquet files.

For Ballista, I also think it makes sense to configure the amount of threads Tokio uses. When using multiple executors per machine/node you'll likely want to limit to avoid many (Tokio) threads, to avoid higher memory usage. This could be set when creating / configuring the tokio Runtime.

andygrove · 2021-07-11T18:36:12Z

One concern I have is that the current config also sets the number of maximum threads during reading parquet files.

Is this still true though? I know we were creating threads at one point in time but we are using Tokio/async now, so we are not creating threads. Increasing partition count will increase the number of async tasks that we run in the thread pool but won't increase the number of threads.

Dandandan · 2021-07-11T18:46:22Z

One concern I have is that the current config also sets the number of maximum threads during reading parquet files.

Is this still true though? I know we were creating threads at one point in time but we are using Tokio/async now, so we are not creating threads. Increasing partition count will increase the number of async tasks that we run in the thread pool but won't increase the number of threads.

We run the tasks now with spawn_blocking, this will still create a number of extra threads to execute the task on. This is set to create a maximum of 512(!) threads by default. Based on the max_concurrency we still split the files into multiple parallel readers, so increasing this value will increase the number of extra threads (and allocated data) we use considerably as far as I can see.

houqp · 2021-07-11T19:14:50Z

What's the ideal design going forward? If the end goal is to create one async task for each partition and let tokio thread pool to manage the parallelism (threads), then I think the proper change we want to introduce should be getting rid of spawn_blocking and make read_files fully async.

andygrove · 2021-07-11T19:35:03Z

@Dandandan Ah, ok. I had not understood fully what was happening there. I think the goal is to have one async task per partition and to remove the spawn_blocking, as @houqp suggested.

jorgecarleitao · 2021-07-11T19:49:04Z

What's the ideal design going forward? If the end goal is to create one async task for each partition and let tokio thread pool to manage the parallelism (threads), then I think the proper change we want to introduce should be getting rid of spawn_blocking and make read_files fully async.

All our readers are blocking; if we do not run then on spawn_blocking, won't they block Tokio's runtime?

andygrove · 2021-07-11T20:28:45Z

Maybe we do need two separate configs after all as per @Dandandan PR. Maybe with concurrency renamed to something more specific to readers.

Dandandan · 2021-07-11T20:34:53Z

Maybe we do need two separate configs after all as per @Dandandan PR. Maybe with concurrency renamed to something more specific to readers.

I believe that's correct. Might not be a problem when reading from one source, but if multiple sources come in, one task that waits on results from one source prevents other tasks from being started, which might limit the amount of both CPU and IO.

Dandandan · 2021-07-11T20:40:40Z

Maybe we do need two separate configs after all as per @Dandandan PR. Maybe with concurrency renamed to something more specific to readers.

For now, something specific to readers seems fine to me.

For the future, I think we also need to have something for setting the max number of threads for running multiple executors / DataFusion processes on one node (using the builder). By default this uses the number of CPU cores, but if you run multiple ballista workers on one node it should be better to lower this.

https://docs.rs/tokio/1.8.1/tokio/runtime/struct.Builder.html#

alamb · 2021-07-12T16:31:17Z

What's the ideal design going forward? If the end goal is to create one async task for each partition and let tokio thread pool to manage the parallelism (threads), then I think the proper change we want to introduce should be getting rid of spawn_blocking and make read_files fully async.

I agree with @houqp that the ideal design is to make the parquet reader fully async (which is blocked on getting an actually async parquet reader)

alamb

What about naming the config option target_partitions instead of default_partitions which I think better describes what it is used for?

I think default_partitions is aready a more specific name than concurrency so I would be happy with that as well

alamb · 2021-08-20T19:04:26Z

Shall we try and push this one along?

alamb

I think this is a good change. @houqp what do you think?

alamb · 2021-08-22T10:58:20Z

datafusion/src/execution/context.rs

@@ -665,13 +665,13 @@ pub struct ExecutionConfig {
    /// virtual tables for displaying schema information
    information_schema: bool,
    /// Should DataFusion repartition data using the join keys to execute joins in parallel
-    /// using the provided `concurrency` level
+    /// using the provided `default_partitions` level


I think some of these references to default_partitions should probably be updated to target_partitions to match the new field name.

alamb · 2021-08-22T11:06:27Z

datafusion/src/logical_plan/builder.rs

        table_name: impl Into<String>,
    ) -> Result<Self> {
-        let provider = Arc::new(ParquetTable::try_new(path, max_concurrency)?);
+        let provider = Arc::new(ParquetTable::try_new(path, max_partitions)?);


Filed #924 to track

Dandandan · 2021-08-22T11:56:36Z

👍

houqp

LGTM!

alamb · 2021-08-25T19:00:23Z

I rebased this PR against master and if CI is all clean I plan to merge it in 🎉

andygrove requested a review from Dandandan July 11, 2021 14:48

github-actions bot added ballista datafusion Changes in the datafusion crate labels Jul 11, 2021

jorgecarleitao approved these changes Jul 11, 2021

View reviewed changes

andygrove mentioned this pull request Jul 11, 2021

Introduce (default) number of partitions option, use it in DataFusion/Ballista #683

Closed

Dandandan reviewed Jul 11, 2021

View reviewed changes

andygrove marked this pull request as draft July 12, 2021 13:49

alamb approved these changes Jul 12, 2021

View reviewed changes

andygrove force-pushed the rename-concurrency branch from 65d0c88 to 9723992 Compare August 21, 2021 18:38

andygrove marked this pull request as ready for review August 21, 2021 18:40

andygrove requested review from Dandandan, jorgecarleitao and alamb August 21, 2021 18:40

alamb added api change Changes the API exposed to users of the crate and removed api change Changes the API exposed to users of the crate labels Aug 22, 2021

alamb changed the title ~~Rename concurrency to default_partitions~~ Rename concurrency to target_partitions Aug 22, 2021

alamb added the api change Changes the API exposed to users of the crate label Aug 22, 2021

alamb approved these changes Aug 22, 2021

View reviewed changes

Dandandan approved these changes Aug 22, 2021

View reviewed changes

houqp approved these changes Aug 22, 2021

View reviewed changes

andygrove added 4 commits August 25, 2021 14:59

Rename concurrency to target_partitions and deprecate with_concurrency

f4dac4f

Rename other uses of concurrency

55a222d

update comment

150560c

rename default_partitions to target_partitions in comments

2b5b783

alamb force-pushed the rename-concurrency branch from 3df67eb to 2b5b783 Compare August 25, 2021 19:00

alamb merged commit 405171c into apache:master Aug 25, 2021

andygrove deleted the rename-concurrency branch February 6, 2022 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename concurrency to target_partitions #706

Rename concurrency to target_partitions #706

andygrove commented Jul 11, 2021 •

edited

Loading

jorgecarleitao left a comment

Dandandan Jul 11, 2021

alamb Jul 12, 2021

alamb Aug 22, 2021

Dandandan commented Jul 11, 2021 •

edited

Loading

andygrove commented Jul 11, 2021

Dandandan commented Jul 11, 2021

houqp commented Jul 11, 2021

andygrove commented Jul 11, 2021

jorgecarleitao commented Jul 11, 2021

andygrove commented Jul 11, 2021

Dandandan commented Jul 11, 2021

Dandandan commented Jul 11, 2021

alamb commented Jul 12, 2021

alamb left a comment

alamb commented Aug 20, 2021

alamb left a comment

alamb Aug 22, 2021

andygrove Aug 22, 2021

alamb Aug 22, 2021

Dandandan commented Aug 22, 2021

houqp left a comment

alamb commented Aug 25, 2021

Rename concurrency to target_partitions #706

Rename concurrency to target_partitions #706

Conversation

andygrove commented Jul 11, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

jorgecarleitao left a comment

Choose a reason for hiding this comment

Dandandan Jul 11, 2021

Choose a reason for hiding this comment

alamb Jul 12, 2021

Choose a reason for hiding this comment

alamb Aug 22, 2021

Choose a reason for hiding this comment

Dandandan commented Jul 11, 2021 • edited Loading

andygrove commented Jul 11, 2021

Dandandan commented Jul 11, 2021

houqp commented Jul 11, 2021

andygrove commented Jul 11, 2021

jorgecarleitao commented Jul 11, 2021

andygrove commented Jul 11, 2021

Dandandan commented Jul 11, 2021

Dandandan commented Jul 11, 2021

alamb commented Jul 12, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb commented Aug 20, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 22, 2021

Choose a reason for hiding this comment

andygrove Aug 22, 2021

Choose a reason for hiding this comment

alamb Aug 22, 2021

Choose a reason for hiding this comment

Dandandan commented Aug 22, 2021

houqp left a comment

Choose a reason for hiding this comment

alamb commented Aug 25, 2021

andygrove commented Jul 11, 2021 •

edited

Loading

Dandandan commented Jul 11, 2021 •

edited

Loading