-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: add oversample option to configure oversampling #27341
Conversation
On clusters with many nodes and smallish disks, doing an IMPORT that is within an order of magnitude of the free space of the disks can lead to disk fullness. This happens because the sampling algorithm (by design) has a relatively high standard deviation in its error rate. We currently target split points at a few hundred megs, but the standard deviation on sampling means that a single node could easily be a few of those away from the target mean, resulting in overscheduling data to a node during shuffle. Introduce an oversample option that can be set to some higher number. This reduces the standard deviation of the error, resulting in each node having more similar portion of the data, but does not have a major impact on the rest of the performance. Release note (sql change): Add an `oversample` WITH option to IMPORT to decrease variance in data distributionduring processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 5 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
What is the downside to oversampling? Should we increase the default? In general I'd say a reduction in variance here is worth a small performance hit. |
Increasing oversampling means more rows are returned to the gateway node during the sampling phase. This means the gateway node has slightly higher disk usage, and since more samples are being transmitted each of the workers also takes a bit more time to complete this phase. The risk is we oversample too much such that the temp space on the gateway can't fit all samples on disk and runs out of space. Current oversampling default is 3. Would be fine to increase to...5? 10? 100? We haven't measured disk space increase resulting in increasing oversampling. I think increasing to something in the 5-10 range would be fine. We haven't measured any of this so hard to come up with specifics. |
it also bloats the job record. |
How does it bloat the job record? It doesn't change the number of split points. |
bors r+ |
27341: importccl: add oversample option to configure oversampling r=mjibson a=mjibson On clusters with many nodes and smallish disks, doing an IMPORT that is within an order of magnitude of the free space of the disks can lead to disk fullness. This happens because the sampling algorithm (by design) has a relatively high standard deviation in its error rate. We currently target split points at a few hundred megs, but the standard deviation on sampling means that a single node could easily be a few of those away from the target mean, resulting in overscheduling data to a node during shuffle. Introduce an oversample option that can be set to some higher number. This reduces the standard deviation of the error, resulting in each node having more similar portion of the data, but does not have a major impact on the rest of the performance. Release note (sql change): Add an `oversample` WITH option to IMPORT to decrease variance in data distributionduring processing. 27345: importccl: verify number of columns during IMPORT PGDUMP r=mjibson a=mjibson Also make error messages more consistent between PGDUMP and PGCOPY. Release note (bug fix): Correctly verify number of COPY columns during IMPORT PGDUMP. 27438: mkrelease: statically link windows release binaries r=mberhault a=benesch This got lost in 38899a8. Static linking is necessary to bundle MinGW-only libraries into the Windows binary. The binary is otherwise only executable from within a MinGW environment. Fix #27435. Release note: None --- At this point I deserve to win an award for most broken refactor. Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>
Build succeeded |
On clusters with many nodes and smallish disks, doing an IMPORT that is
within an order of magnitude of the free space of the disks can lead to
disk fullness. This happens because the sampling algorithm (by design)
has a relatively high standard deviation in its error rate. We currently
target split points at a few hundred megs, but the standard deviation
on sampling means that a single node could easily be a few of those
away from the target mean, resulting in overscheduling data to a node
during shuffle.
Introduce an oversample option that can be set to some higher number. This
reduces the standard deviation of the error, resulting in each node having
more similar portion of the data, but does not have a major impact on
the rest of the performance.
Release note (sql change): Add an
oversample
WITH option to IMPORT todecrease variance in data distributionduring processing.