Spill framework refactor for better performance and extensibility #11747

abellina · 2024-11-21T19:55:27Z

This is a very large PR that I'd like some 👀 on. Marked it as draft as I still have some TODOs around more tests. The PR is NOT going to go to 24.12, it's just that we don't have a 25.02 available.

The main file I think one should focus on is SpillFramework.scala (yeap one file, let me know if you want me to break that into multiple files). SpillFramework.scala has a comment describing how things should work, please take a look at that.

The main contribution here is a simplification of the framework where we remove the idea of a RapidsBuffer that has to be acquired and unacquired, for the idea of a handle that just knows how to materialize. There isn't a concept of acquisition in the new framework.

There is a SpillableColumnarBatch api and a lazy-spillable api for Join that I did not touch and left there on purpose, but we can start to remove that API and create spillable handles that replicate the lazy behavior we wanted in lazy spillable, or the recomputing behavior we want for broadcasts. This is the second contribution of the PR: handles decide how to spill, not the framework.

There is one easily fixable shortcoming today in the multiple-spiller case, that I will fix in a follow on PR. While we are spilling a handle, the handle holds a lock. The same lock is used to figure out if the handle is spillable. A second thread that is trying to spill may need to wait for this lock (and spill) to finish, to figure out if it needs to spill that handle or not. We can make this more straightforward by handling the spill state separate from the materialization/data state, but I'd like to submit that work as an improvement.

I have run this against NDS @ 3TB in our perf cluster and I don't see regressions, and have run it against spill prone cases and I am able to see multiple threads in the "spill path", and no deadlocks. I'll post more results when I can run them.

Signed-off-by: Alessandro Bellina <[email protected]>

abellina · 2024-11-21T20:03:06Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsSerializerManager.scala

@@ -44,22 +44,20 @@ class RapidsSerializerManager (conf: SparkConf) {

  private lazy val compressionCodec: CompressionCodec = TrampolineUtil.createCodec(conf)


I'd like to remove this class, or make it much simpler.

jlowe · 2024-11-21T21:32:50Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala

+      val amountSpilled = store.spill(allocSize)
+      retryState.resetIfNeeded(retryCount, amountSpilled > 0)
+      logInfo(s"Device allocation of $allocSize bytes failed. " +
+        s"Device store spilled $amountSpilled bytes. $attemptMsg " +


attemptMsg already has the trailing space.

Suggested change

s"Device store spilled $amountSpilled bytes. $attemptMsg " +

s"Device store spilled $amountSpilled bytes. $attemptMsg" +

jlowe · 2024-11-21T21:40:36Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+ *
+ * Every store handle supports a `materialize` method that isn't part of the interface.
+ * The reason is that to materialize certain objects, you may need some data (for example,
+ * spark schema descriptors). `materialize` incRefCounts the object if it's resident in the


Suggested change

* spark schema descriptors). `materialize` incRefCounts the object if it's resident in the

* Spark schema descriptors). `materialize` incRefCounts the object if it's resident in the

jlowe · 2024-11-21T21:42:41Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+ *
+ * We hold the handle lock when we are spilling (performing IO). That means that no other consumer
+ * can access this spillable device handle while it is being spilled, including a second thread
+ * trying to spill and asking each of the handles wether they are spillable or not, as that requires


Suggested change

* trying to spill and asking each of the handles wether they are spillable or not, as that requires

* trying to spill and asking each of the handles whether they are spillable or not, as that requires

jlowe · 2024-11-21T21:46:52Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+ *           on the device.
+ */
+trait DeviceSpillableHandle[T <: AutoCloseable] extends SpillableHandle {
+  var dev: Option[T]


We don't want to invite handle clients to access this directly.

Suggested change

var dev: Option[T]

protected var dev: Option[T]

for the access of these variables, they come in really handy in the spill framework suite, without having to mock things or make those tests too complicated. I'd like to try using private[spill] and move this and the suite to a spill package. Thoughts?

private[spill] works for me.

jlowe · 2024-11-21T21:47:03Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+ *           on the host.
+ */
+trait HostSpillableHandle[T <: AutoCloseable] extends SpillableHandle {
+  var host: Option[T]


Suggested change

var host: Option[T]

protected var host: Option[T]

jlowe · 2024-11-21T22:02:11Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+            builder.copyNext(n, Cuda.DEFAULT_STREAM)
+            // we are calling chunked packer on `bb` again each time, we need
+            // to synchronize before we ask for the next chunk


We may want to change the method name to something like copyNextAsync so it's clear this is an asynchronous operation. Otherwise it wasn't immediately clear to me why we need a sync after this.

that's a nice catch, I need to make sure this is not async, because it currently asks for a bounce buffer in the disk case. So let me fix this.

jlowe · 2024-11-21T22:14:08Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+ *                         obtained during the iterator-like 'next(DeviceMemoryBuffer)'
+ */
+class ChunkedPacker(table: Table,
+                    bounceBufferSize: Long)


It might be cleaner if we can make ChunkedPacker a true iterator. It feels a little odd that it needs to be told about the size and then it's the callers responsibility to pass the exact same size buffers later to the next method, especially when we have methods where they're handed a ChunkedPacker instance. How are they supposed to know what the correct size is for that instance?

Instead we could create a bounce buffer pool interface that looks something like this:

trait DeviceBounceBufferPool { // size of every buffer in the pool val bufferSize: Long // Returns the next available buffer in the pool, // blocking until one is available if necessary. // Closing the resulting buffer will release it back to the pool. def nextBuffer(): DeviceMemoryBuffer }

Then ChunkedPacker could be constructed with a device buffer pool (potentially just a single buffer pool in the degenerate case). This now allows it to just be an iterator. next() will return a device buffer containing the next chunk of contents from the pool, and when the caller closes the buffer, it will be released back to the pool for reuse.

jlowe · 2024-11-21T22:16:10Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+    }
+  }
+
+  def getHostBuffer: Option[HostMemoryBuffer] = synchronized {


Suggested change

def getHostBuffer: Option[HostMemoryBuffer] = synchronized {

private def getHostBuffer: Option[HostMemoryBuffer] = synchronized {

Although I'm not sure this method is worth being separate given it's only called in one place and that place blindly assumes the result is not None.

made this private[spill] by mistake before, making it just private.

jlowe · 2024-11-21T22:22:38Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillFramework.scala

+class SpillableColumnarBatchHandle private (
+    override val sizeInBytes: Long,
+    override var dev: Option[ColumnarBatch],
+    var host: Option[SpillableHostBufferHandle] = None)


Suggested change

var host: Option[SpillableHostBufferHandle] = None)

private var host: Option[SpillableHostBufferHandle] = None)

…the cuda stream

revans2

didn't finish, but things keep changing under me so I thought I would publish what I have so far

revans2 · 2024-11-22T15:22:51Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala

        true
      }
+
+      shouldRetry


nit: why is this shouldRetry change in there? I assume it was for debugging at some point.

revans2 · 2024-11-22T15:26:02Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala

@@ -1110,7 +1110,7 @@ class CudfSpillableHostConcatResult(
    val hmb: HostMemoryBuffer) extends SpillableHostConcatResult {

  override def toBatch: ColumnarBatch = {
-    closeOnExcept(buffer.getHostBuffer()) { hostBuf =>
+    closeOnExcept(buffer.getHostBuffer) { hostBuf =>


nit: why change this?

revans2 · 2024-11-22T15:27:15Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/JoinGatherer.scala

@@ -307,7 +307,6 @@ class LazySpillableColumnarBatchImpl(
          spill = Some(SpillableColumnarBatch(cached.get,
            SpillPriorities.ACTIVE_ON_DECK_PRIORITY))
        } finally {
-          // Putting data in a SpillableColumnarBatch takes ownership of it.


Is this no longer true?

I was debugging something here, and I must have forgotten to undo this change. coming up

but to answer your question, the spill framework takes ownership, always

abellina · 2024-11-22T15:35:33Z

Sorry for the rapid movement @revans2. I'll pause for a bit, and come back to address the ChunkedPacker comments and other comments I get.

zpuller

likewise leaving partial review

zpuller · 2024-11-22T19:28:07Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+      0L
+    } else {
+      synchronized {
+        if (disk.isEmpty) {


in the else case, disk not empty, it seems we just do nothing (assume already spilled) but return sizeInBytes - is it possible this leads to double accounting of how much we've spilled?

zpuller · 2024-11-22T19:30:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+              }
+            }
+            disk = Some(diskHandleBuilder.build)
+            disk


Is this return value being used anywhere? Can you we just omit this line?

zpuller · 2024-11-25T18:02:19Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+      dev.getOrElse {
+        throw new IllegalStateException(
+          s"$this is spillable but it doesn't have a dev buffer!")
+      }.getRefCount == 1


This logic is basically identical between device and host. Should we consider making some shared templated implementation to dedup this?

zpuller · 2024-11-25T18:04:50Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+ * @tparam T an auto closeable subclass. `host` tracks an instance of this object,
+ *           on the host.
+ */
+trait HostSpillableHandle[T <: AutoCloseable] extends SpillableHandle {


Should DeviceSpillableHandle and HostSpillableHandle be consolidated to like MaterializeableSpillableHandle or something? because they can be templated on the buffer type and otherwise look about identical. Made a similar comment below where maybe we can consolidate some repeated code

revans2

I still have a lot more to go through, but I thought I would at least get some of my comments in.

revans2 · 2024-11-25T21:16:37Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ShuffleBufferCatalog.scala

+   *
+   * @param blockMap mapping of block ID to array of buffers for the block
+   */
+  private case class ShuffleInfo(


nit: there is a single value in this class. Would it be better to just alias ShuffleInfo to The ConcurrentHashMap...

revans2 · 2024-11-25T21:28:58Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

+   * Set a new spill priority.
+   */
+  override def setSpillPriority(priority: Long): Unit = {
+    // TODO: handle.setSpillPriority(priority)


Or remove this entirely.

revans2 · 2024-11-25T21:30:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

+      // closing my reference
+      handle.close()
+    }
+    // TODO this is causing problems so we need to look into this


Should we just remove this then? It feels like that should be a separate issue as this is already a large enough PR. (like 10161)

Okay Sorry it looks like these comments were here already. This is just copied from previous code.

revans2 · 2024-11-25T21:31:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

    }
+    // TODO this is causing problems so we need to look into this


again this PR is big enough. I think we can do this in a separate PR.

Okay Sorry it looks like these comments were here already. This is just copied from previous code.

revans2 · 2024-11-25T21:33:18Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

@@ -245,18 +338,29 @@ object SpillableColumnarBatch {
   */
  def apply(batch: ColumnarBatch,
      priority: Long): SpillableColumnarBatch = {
+    Cuda.DEFAULT_STREAM.sync()


revans2 · 2024-11-25T21:34:06Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

-        RapidsBufferCatalog.addBatch(batch, initialSpillPriority)
-      }
-    }
+    Cuda.DEFAULT_STREAM.sync()


revans2 · 2024-11-25T21:34:38Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

-      catalog: RapidsBufferCatalog): RapidsBufferHandle = {
-    withResource(batch) { batch =>
-      catalog.addBatch(batch, initialSpillPriority)
+      val handle = SpillableHostColumnarBatchHandle(batch)


Why no stream sync if it is needed for the other APIs?

revans2 · 2024-11-25T21:34:58Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

-    val handle = withResource(buffer) { _ => 
-      RapidsBufferCatalog.addBuffer(buffer, meta, priority)
-    }
+    Cuda.DEFAULT_STREAM.sync()


revans2

Still have a lot more to look at, but I am making progress

revans2 · 2024-11-26T17:26:22Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/shuffle/BufferSendState.scala

            buffOffset += blockRange.rangeSize()
          }
          needsCleanup = false
        } catch {
-          case ioe: IOException =>
+          case ex: Throwable =>


Why do we need to catch Errors?

revans2 · 2024-11-26T17:32:55Z

tests/src/test/scala/com/nvidia/spark/rapids/ShuffleBufferCatalogSuite.scala

 class ShuffleBufferCatalogSuite extends AnyFunSuite with MockitoSugar {
+  // TODO: AB: more tests please


Is this going to be done?

revans2 · 2024-11-26T17:35:30Z

tests/src/test/scala/com/nvidia/spark/rapids/spill/SpillFrameworkSuite.scala

+      }, hostCols.head.getRowCount.toInt), dataTypes)
+  }
+
+  // TODO: AB: add tests that span multiple byte buffers for host->disk, and


Is this going to be done?

revans2

Still not done, but I need to switch to some other things so I am going to comment on what I have looked at so far.

revans2 · 2024-11-26T17:40:00Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+ *
+ * CUDA/Host synchronization:
+ *
+ * We assume all device backed handles are completely materialized on the device (before adding


Why do we need this? Please add that to the comments so that it is clear what is happening.

revans2 · 2024-11-26T17:43:04Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+ * with extra locking is the `SpillableHostStore`, to maintain a `totalSize` number that is
+ * used to figure out cheaply when it is full.
+ *
+ * Handles hold a lock to protect the user against when it is either in the middle of


This is a bit confusing. So essentially the lock prevents race conditions when the object is in the middle of spilling or being closed?

revans2 · 2024-11-26T17:43:49Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+   * - If sizeInBytes is 0, the object is tracked by the stores so it can be
+   *   removed on shutdown, or by handle.close, but 0-byte handles are not spillable.
+   */
+  val sizeInBytes: Long


If this is approximate we should name it such. So there is no confusion about how that can be used.

Also this is a val, but if something spills wouldn't it change to 0 from whatever it was before? Shouldn't it be a def?

revans2 · 2024-11-26T20:12:19Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+   * or directly against the handle.
+   * @return sizeInBytes if spilled, 0 for any other reason (not spillable, closed)
+   */
+  def spill: Long


nit: I thought the convention in Scala was that a method with no parameters should have parens if it did work (like side effects). Should this be def spill(): Long then?

revans2 · 2024-11-26T20:13:45Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+  /**
+   * Method called to spill this handle. It can be triggered from the spill store,
+   * or directly against the handle.
+   * @return sizeInBytes if spilled, 0 for any other reason (not spillable, closed)


Just to clarify do we want them to return the result of calling sizeInBytes? Especially if it is just an approximate size? Can we clarify a little bit here about what is expected to be returned by this API?

revans2 · 2024-11-26T20:17:40Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+   *       is a `val`.
+   * @return true if currently spillable, false otherwise
+   */
+  private[spill] def spillable: Boolean = sizeInBytes > 0


Why is this private to spill? Just curious, because if we want others to add spillable things in does that mean they have to put the handles in com.nvidia.spark.rapids.spill to make them work?

revans2 · 2024-11-26T20:21:28Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

+    // do we care if:
+    // handle didn't fit in the store as it is too large.
+    // we made memory for this so we are going to hold our noses and keep going
+    // we could spill `handle` at this point.


I thought we spoke about this already. There are currently two cases.

We are running with no limit on the host memory, but a spill store limit

We have a host memory limit.

If we have a host memory limit, then the spill store is constrained by the host memory limit so we can ignore it here. we will not have allocated a HostMemoryBuffer that is too large to fit.

If we have unlimited host memory, then making a SpillableHostBufferHandle is there for code compatibility, but it should not be added to the host store at all. It should never spill. We have unlimited host memory.

We should also document this somehere.

Spill framework refactor for better performance and extensibility

ed170dc

Signed-off-by: Alessandro Bellina <[email protected]>

abellina commented Nov 21, 2024

View reviewed changes

jlowe reviewed Nov 21, 2024

View reviewed changes

code review comment changes

4cde905

revans2 mentioned this pull request Nov 22, 2024

[FEA] remove OOM retry disable config #11751

Open

abellina added 2 commits November 22, 2024 07:26

private[spill]

2619cab

remove extra sync, and make sure copyNext is always synchronous with …

b48b801

…the cuda stream

revans2 reviewed Nov 22, 2024

View reviewed changes

abellina added 2 commits November 22, 2024 07:32

private getHostBuffer

8c42a01

bring back comment on ownership

a53f10b

zpuller reviewed Nov 25, 2024

View reviewed changes

abellina changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 19:59

revans2 reviewed Nov 25, 2024

View reviewed changes

revans2 reviewed Nov 26, 2024

View reviewed changes

		@@ -44,22 +44,20 @@ class RapidsSerializerManager (conf: SparkConf) {

		private lazy val compressionCodec: CompressionCodec = TrampolineUtil.createCodec(conf)

	s"Device store spilled $amountSpilled bytes. $attemptMsg " +
	s"Device store spilled $amountSpilled bytes. $attemptMsg" +

	* spark schema descriptors). `materialize` incRefCounts the object if it's resident in the
	* Spark schema descriptors). `materialize` incRefCounts the object if it's resident in the

	* trying to spill and asking each of the handles wether they are spillable or not, as that requires
	* trying to spill and asking each of the handles whether they are spillable or not, as that requires

	def getHostBuffer: Option[HostMemoryBuffer] = synchronized {
	private def getHostBuffer: Option[HostMemoryBuffer] = synchronized {

	var host: Option[SpillableHostBufferHandle] = None)
	private var host: Option[SpillableHostBufferHandle] = None)

		}
		// TODO this is causing problems so we need to look into this

		class ShuffleBufferCatalogSuite extends AnyFunSuite with MockitoSugar {
		// TODO: AB: more tests please

Spill framework refactor for better performance and extensibility #11747

Are you sure you want to change the base?

Spill framework refactor for better performance and extensibility #11747

Conversation

abellina commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Nov 22, 2024

zpuller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment