You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: The default value of task_writer_count is 1 which makes insert super slow and doesn't use the cluster resources effectively.
Possible solutions:
Increase the default value to either 4, 8, 16 or 32.
The problem with this solution is that it can produce many small files for a small amount of data. Therefore it can impact the subsequent read performance.
Implement scaling of local writers adaptively using physicalWrittenBytes and current buffer size.
This is similar to how we are doing global scaling with scale_writers.
Here we could have a constraint of min file size using minWriterSize. So, a user could avoid small files thus making the subsequent read faster.
So, If we increase the number of small files it can have a huge impact on read performance. By almost 2x based on the above experiment.
Summary
If we increase the default value of task_writer_count, the read performance will get impacted if a user is inserting a small amount of data over a long period of time. This is because there could huge amount of small files in this case. For instance, if someone is inserting 100MB of data every 15 mins with a default of 32 task writers. To solve this one could use the optimize command (which is expensive) at some frequency or we can go ahead with a local scaling approach which is a bit complex but maintain min file size.
Problem: The default value of
task_writer_count
is 1 which makes insert super slow and doesn't use the cluster resources effectively.Possible solutions:
physicalWrittenBytes
andcurrent buffer size
.scale_writers
.minWriterSize
. So, a user could avoid small files thus making the subsequent read faster.Insert performance with different values of task writers:
1. Single node with no local scaling.
Table inserted: tpcds sf300 lineitem
task_writer_count = 4
=> 36:15 minstask_writer_count = 8
=> 18:27 minstask_writer_count = 16
=> 9:22 minstask_writer_count = 32
=> 5:21 mins2. Single node with local writer scaling.
Table inserted: tpcds sf300 lineitem
writerMinSize
: 16MBmax_task_writer_count = 8
=> 23.06 minsmax_task_writer_count = 6
=> 13:14 minsmax_task_writer_count = 32
=> 8:48 minsRead performance (with small files):
1. Small amount of data inserted over a long time with
task_writer_count = 2
config:
Query:
Result:
2. With
task_writer_count = 32
config (same as above except):
Result:
So, If we increase the number of small files it can have a huge impact on read performance. By almost 2x based on the above experiment.
Summary
If we increase the default value of
task_writer_count
, the read performance will get impacted if a user is inserting a small amount of data over a long period of time. This is because there could huge amount of small files in this case. For instance, if someone is inserting 100MB of data every 15 mins with a default of 32 task writers. To solve this one could use the optimize command (which is expensive) at some frequency or we can go ahead with a local scaling approach which is a bit complex but maintain min file size.cc @dain @sopel39 @raunaqmorarka @electrum
The text was updated successfully, but these errors were encountered: