Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: add oversample option to configure oversampling #27341

Merged
merged 1 commit into from
Jul 12, 2018
Merged

importccl: add oversample option to configure oversampling #27341

merged 1 commit into from
Jul 12, 2018

Conversation

maddyblue
Copy link
Contributor

On clusters with many nodes and smallish disks, doing an IMPORT that is
within an order of magnitude of the free space of the disks can lead to
disk fullness. This happens because the sampling algorithm (by design)
has a relatively high standard deviation in its error rate. We currently
target split points at a few hundred megs, but the standard deviation
on sampling means that a single node could easily be a few of those
away from the target mean, resulting in overscheduling data to a node
during shuffle.

Introduce an oversample option that can be set to some higher number. This
reduces the standard deviation of the error, resulting in each node having
more similar portion of the data, but does not have a major impact on
the rest of the performance.

Release note (sql change): Add an oversample WITH option to IMPORT to
decrease variance in data distributionduring processing.

On clusters with many nodes and smallish disks, doing an IMPORT that is
within an order of magnitude of the free space of the disks can lead to
disk fullness. This happens because the sampling algorithm (by design)
has a relatively high standard deviation in its error rate. We currently
target split points at a few hundred megs, but the standard deviation
on sampling means that a single node could easily be a few of those
away from the target mean, resulting in overscheduling data to a node
during shuffle.

Introduce an oversample option that can be set to some higher number. This
reduces the standard deviation of the error, resulting in each node having
more similar portion of the data, but does not have a major impact on
the rest of the performance.

Release note (sql change): Add an `oversample` WITH option to IMPORT to
decrease variance in data distributionduring processing.
@maddyblue maddyblue requested review from dt and a team July 10, 2018 17:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@rjnn rjnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@bdarnell
Copy link
Contributor

What is the downside to oversampling? Should we increase the default? In general I'd say a reduction in variance here is worth a small performance hit.

@maddyblue
Copy link
Contributor Author

Increasing oversampling means more rows are returned to the gateway node during the sampling phase. This means the gateway node has slightly higher disk usage, and since more samples are being transmitted each of the workers also takes a bit more time to complete this phase. The risk is we oversample too much such that the temp space on the gateway can't fit all samples on disk and runs out of space. Current oversampling default is 3. Would be fine to increase to...5? 10? 100? We haven't measured disk space increase resulting in increasing oversampling. I think increasing to something in the 5-10 range would be fine. We haven't measured any of this so hard to come up with specifics.

@dt
Copy link
Member

dt commented Jul 10, 2018

it also bloats the job record.

@maddyblue
Copy link
Contributor Author

How does it bloat the job record? It doesn't change the number of split points.

@maddyblue
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Jul 12, 2018
27341: importccl: add oversample option to configure oversampling r=mjibson a=mjibson

On clusters with many nodes and smallish disks, doing an IMPORT that is
within an order of magnitude of the free space of the disks can lead to
disk fullness. This happens because the sampling algorithm (by design)
has a relatively high standard deviation in its error rate. We currently
target split points at a few hundred megs, but the standard deviation
on sampling means that a single node could easily be a few of those
away from the target mean, resulting in overscheduling data to a node
during shuffle.

Introduce an oversample option that can be set to some higher number. This
reduces the standard deviation of the error, resulting in each node having
more similar portion of the data, but does not have a major impact on
the rest of the performance.

Release note (sql change): Add an `oversample` WITH option to IMPORT to
decrease variance in data distributionduring processing.

27345: importccl: verify number of columns during IMPORT PGDUMP r=mjibson a=mjibson

Also make error messages more consistent between PGDUMP and PGCOPY.

Release note (bug fix): Correctly verify number of COPY columns during
IMPORT PGDUMP.

27438: mkrelease: statically link windows release binaries r=mberhault a=benesch

This got lost in 38899a8. Static linking is necessary to bundle MinGW-only
libraries into the Windows binary. The binary is otherwise only
executable from within a MinGW environment.

Fix #27435.

Release note: None

---

At this point I deserve to win an award for most broken refactor.

Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 12, 2018

Build succeeded

@craig craig bot merged commit 6bf632d into cockroachdb:master Jul 12, 2018
@maddyblue maddyblue deleted the import-oversample branch July 12, 2018 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants