[FEA] Should we synchronize and then spill with the ASYNC allocator #6769

abellina · 2022-10-12T21:24:01Z

As mentioned here #6768, I am noticing the synchronizing on OOM can help us handle allocation failures that would otherwise be fatal. Additionally, with some quick prototyping locally, it seems that there may be a performance gain here.

Specifically, if we first Cuda.deviceSynchronize rather than spill right away, but fallback to the spill when we know we have already synchronized, we are able to save time with a quick query I tried in our performance cluster. I ran a query that spills constantly and it took 265 seconds vs 304 seconds without this change.

That said the query also ran OOM on a second trial. The reason I think is that we are really able to pack the GPU, I see that the async pool is able to get closer to its maximum size (40GB in this case). So we have less fudge memory for those tasks that run above their ~1/concurrentGpuTasks chunk of memory.

The text was updated successfully, but these errors were encountered:

abellina added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 12, 2022

abellina mentioned this issue Oct 12, 2022

[TASK] Run without fatal OOMs #6746

Closed

10 tasks

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022

mattahrens assigned abellina Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Should we synchronize and then spill with the ASYNC allocator #6769

[FEA] Should we synchronize and then spill with the ASYNC allocator #6769

abellina commented Oct 12, 2022

[FEA] Should we synchronize and then spill with the ASYNC allocator #6769

[FEA] Should we synchronize and then spill with the ASYNC allocator #6769

Comments

abellina commented Oct 12, 2022