importccl: add oversample option to configure oversampling #27341

maddyblue · 2018-07-10T17:45:20Z

On clusters with many nodes and smallish disks, doing an IMPORT that is
within an order of magnitude of the free space of the disks can lead to
disk fullness. This happens because the sampling algorithm (by design)
has a relatively high standard deviation in its error rate. We currently
target split points at a few hundred megs, but the standard deviation
on sampling means that a single node could easily be a few of those
away from the target mean, resulting in overscheduling data to a node
during shuffle.

Introduce an oversample option that can be set to some higher number. This
reduces the standard deviation of the error, resulting in each node having
more similar portion of the data, but does not have a major impact on
the rest of the performance.

Release note (sql change): Add an oversample WITH option to IMPORT to
decrease variance in data distributionduring processing.

On clusters with many nodes and smallish disks, doing an IMPORT that is within an order of magnitude of the free space of the disks can lead to disk fullness. This happens because the sampling algorithm (by design) has a relatively high standard deviation in its error rate. We currently target split points at a few hundred megs, but the standard deviation on sampling means that a single node could easily be a few of those away from the target mean, resulting in overscheduling data to a node during shuffle. Introduce an oversample option that can be set to some higher number. This reduces the standard deviation of the error, resulting in each node having more similar portion of the data, but does not have a major impact on the rest of the performance. Release note (sql change): Add an `oversample` WITH option to IMPORT to decrease variance in data distributionduring processing.

cockroach-teamcity · 2018-07-10T17:45:26Z

This change is

rjnn

Reviewed 3 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

bdarnell · 2018-07-10T19:52:15Z

What is the downside to oversampling? Should we increase the default? In general I'd say a reduction in variance here is worth a small performance hit.

maddyblue · 2018-07-10T20:03:01Z

Increasing oversampling means more rows are returned to the gateway node during the sampling phase. This means the gateway node has slightly higher disk usage, and since more samples are being transmitted each of the workers also takes a bit more time to complete this phase. The risk is we oversample too much such that the temp space on the gateway can't fit all samples on disk and runs out of space. Current oversampling default is 3. Would be fine to increase to...5? 10? 100? We haven't measured disk space increase resulting in increasing oversampling. I think increasing to something in the 5-10 range would be fine. We haven't measured any of this so hard to come up with specifics.

dt · 2018-07-10T20:08:54Z

it also bloats the job record.

maddyblue · 2018-07-10T20:10:29Z

How does it bloat the job record? It doesn't change the number of split points.

maddyblue · 2018-07-12T17:39:21Z

bors r+

27341: importccl: add oversample option to configure oversampling r=mjibson a=mjibson On clusters with many nodes and smallish disks, doing an IMPORT that is within an order of magnitude of the free space of the disks can lead to disk fullness. This happens because the sampling algorithm (by design) has a relatively high standard deviation in its error rate. We currently target split points at a few hundred megs, but the standard deviation on sampling means that a single node could easily be a few of those away from the target mean, resulting in overscheduling data to a node during shuffle. Introduce an oversample option that can be set to some higher number. This reduces the standard deviation of the error, resulting in each node having more similar portion of the data, but does not have a major impact on the rest of the performance. Release note (sql change): Add an `oversample` WITH option to IMPORT to decrease variance in data distributionduring processing. 27345: importccl: verify number of columns during IMPORT PGDUMP r=mjibson a=mjibson Also make error messages more consistent between PGDUMP and PGCOPY. Release note (bug fix): Correctly verify number of COPY columns during IMPORT PGDUMP. 27438: mkrelease: statically link windows release binaries r=mberhault a=benesch This got lost in 38899a8. Static linking is necessary to bundle MinGW-only libraries into the Windows binary. The binary is otherwise only executable from within a MinGW environment. Fix #27435. Release note: None --- At this point I deserve to win an award for most broken refactor. Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>

craig · 2018-07-12T18:09:59Z

Build succeeded

GitHub CI (Cockroach)

maddyblue requested review from dt and a team July 10, 2018 17:45

rjnn reviewed Jul 10, 2018

View reviewed changes

dt approved these changes Jul 12, 2018

View reviewed changes

craig bot merged commit 6bf632d into cockroachdb:master Jul 12, 2018

maddyblue deleted the import-oversample branch July 12, 2018 18:21

jseldess mentioned this pull request Jul 31, 2018

importccl: add oversample option to configure oversampling cockroachdb/docs#3461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importccl: add oversample option to configure oversampling #27341

importccl: add oversample option to configure oversampling #27341

maddyblue commented Jul 10, 2018

cockroach-teamcity commented Jul 10, 2018

rjnn left a comment

bdarnell commented Jul 10, 2018

maddyblue commented Jul 10, 2018

dt commented Jul 10, 2018

maddyblue commented Jul 10, 2018

maddyblue commented Jul 12, 2018

craig bot commented Jul 12, 2018

importccl: add oversample option to configure oversampling #27341

importccl: add oversample option to configure oversampling #27341

Conversation

maddyblue commented Jul 10, 2018

cockroach-teamcity commented Jul 10, 2018

rjnn left a comment

Choose a reason for hiding this comment

bdarnell commented Jul 10, 2018

maddyblue commented Jul 10, 2018

dt commented Jul 10, 2018

maddyblue commented Jul 10, 2018

maddyblue commented Jul 12, 2018

craig bot commented Jul 12, 2018

Build succeeded