Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enables spillable/unspillable state for RapidsBuffer and allow buffer sharing #7572

Merged

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Jan 24, 2023

Signed-off-by: Alessandro Bellina [email protected]

Closes #6864
Closes #6561

This PR allows a RapidsBuffer to become "non spillable" for the period of time that a ColumnarBatch is obtained from it (and the ref count of the device memory buffer is > 1). It also enables sharing RapidsBuffer between different RapidsBufferHandle instances. This means that when DeviceMemoryBuffers are being added to the spill framework, we detect whether we are already tracking that buffer (defined as between the creation of the RapidsDeviceMemoryBuffer and the first time free is called on it).

This is a draft PR since I am still running some performance numbers and I need to add more unit tests. There are various race conditions that this creates. Because of the race conditions we may put this into 23.04 instead.

These are tests I need to document, some of them I have done:

  • Performance testing with NDS @ 3TB
  • Stress testing with shared RapidsBuffers at 3TB
  • Stress testing with query95 3TB, 10TB, 30TB
  • Add more unit tests to exercise potential race conditions
  • Verify with nsys that extra calls to Table.columnViewsFromPacked are as cheap as we think

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass through this and it looks good, but the code is complicated enough I need to dig in much deeper.

@abellina
Copy link
Collaborator Author

build

@abellina abellina changed the title Enables RapidsBuffer sharing in the spill framework Enables spillable/unspillable state for RapidsBuffer and allow buffer sharing Jan 24, 2023
@abellina
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just some nits. Feeling better about this change.

@abellina
Copy link
Collaborator Author

I am currently running stress testing for NDS q72 at 3TB by setting the percent of GPU allocation to 30% of the original and watching it spill, especially spill broadcast (shared rapids buffers). I am debugging an inc after close seen here.

@abellina
Copy link
Collaborator Author

I am debugging an inc after close seen here

The INC AFTER CLOSE issue is not related to my change. Some of the gather code has race conditions when an exception is thrown (OOM in this case) and can cause this. It is more of a nuisance than anything as the executor is going down anyway. Filed this to look at separately: #7581

@abellina
Copy link
Collaborator Author

In my stress testing at 3TB I found a deadlock due to lock inversion with a new commit I added given the race condition feedback (8737302). I am debugging it.

@abellina abellina changed the base branch from branch-23.02 to branch-23.04 January 30, 2023 19:38
@abellina
Copy link
Collaborator Author

abellina commented Feb 2, 2023

Performance-wise, overall for NDS 3TB with 5 repetitions for baseline (I used a 23.02 snapshot from last week) vs this change:

Name = benchmark
Means = 533737.6, 533441.6
Time diff = 296.0
Speedup = 1.0005548873578665
T-Test (test statistic, p value, df) = 0.09851820421116382, 0.9239445975461231, 8.0
T-Test Confidence Interval = -6632.437535329434, 7224.437535329434

q9 was faster (6%) and q72 was slower (2.2%). I am going to focus more testing on query72 to see if this is caused perhaps by calling unpack or something else, will also test against a 23.04 baseline jar.

Name = query9
Means = 9498.6, 8937.0
Time diff = 561.6000000000004
Speedup = 1.0628398791540785
T-Test (test statistic, p value, df) = 2.5673438616859325, 0.03326472469818815, 8.0
T-Test Confidence Interval = 57.16740662534335, 1066.0325933746574
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query72
Means = 118548.6, 121185.4
Time diff = -2636.7999999999884
Speedup = 0.9782416033614612
T-Test (test statistic, p value, df) = -2.7090393198155214, 0.02669776978463036, 8.0
T-Test Confidence Interval = -4881.312162957456, -392.2878370425201
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed

@abellina
Copy link
Collaborator Author

abellina commented Feb 6, 2023

For the stress testing I used query95 at 3/10/30 TB, and query72 at 3 TB with a reduced foot print. My main goal was to find deadlocks if anything, and cause spill. I am not seeing such issues.

@abellina
Copy link
Collaborator Author

abellina commented Feb 6, 2023

build

@abellina
Copy link
Collaborator Author

abellina commented Feb 7, 2023

build

revans2
revans2 previously approved these changes Feb 7, 2023
@abellina
Copy link
Collaborator Author

abellina commented Feb 7, 2023

build

// total amount spilled in this invocation
var totalSpilled: Long = 0

if (store.currentSpillableSize > targetTotalSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily to fix for this PR, but one issue with this old logic is that it will not free anything if targetTotalSize is larger than the current spillable size. That seems pretty bad. For example, freeing just 256MB of memory might allow a 1GB allocation to succeed. Honestly not sure why we would care what the current amount of spillable memory is, we should just blindly ask the store to spill until we can allocate, and the store either will spill or it won't. We shouldn't give up until the spill store is exhausted of spillable memory or we succeed in the allocation, whichever comes first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed this to track #7706

revans2
revans2 previously approved these changes Feb 8, 2023
jlowe
jlowe previously approved these changes Feb 8, 2023
@abellina abellina dismissed stale reviews from jlowe and revans2 via 6cca74d February 8, 2023 16:55
@abellina
Copy link
Collaborator Author

abellina commented Feb 8, 2023

@revans2 @jlowe I had introduced a bug in RapidsGdsStoreSuite. Fixed it with 6cca74d.

@abellina
Copy link
Collaborator Author

abellina commented Feb 8, 2023

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
4 participants