-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Appending host columnar data into ColumnBuilder by batch #4565
Comments
If this really is faster (do you have empirical evidence?), why not put this logic in Then the discussion pivots to memory usage which is increased with this approach. It would be good to plot a curve for performance vs. array size to see if there's a "sweet spot," ideally with a relatively low buffer size. If we're willing to throw more memory at it, we could also avoid off-heap reallocations by tracking multiple array buffers and then when Anyway, the first step is to get some hard numbers as to how this idea performs across a variety of types and buffer sizes, so we can make informed decisions. First step is to prototype the easy case of fixed-width and see how it performs. |
In Spark CPU the producer of the |
Hi @jlowe, I ran some benchmarks locally with the batch appending codes shown above. The result was frustrating: row-wise appending beated batch-wise appending of any batch size (16, 64, 256, 1024, 4096, 8192) on performance. The result incidates that batch size itself doesn't affect the performance. All batch appending attempts were significantly slower than the improved row appending: private def rowAppending[T: ClassTag](size: Int,
@inline accessor: Int => T,
@inline appender: T => Unit): Unit = {
var i = 0
while (i < size) {
val data = accessor(i)
appender(data)
i += 1
}
} private def batchAppending[T: ClassTag](size: Int,
batchSize: Int,
@inline accessor: Int => T,
@inline appender: Array[T] => Unit): Unit = {
var i = 0
var j = 0
val buffer = Array.ofDim[T](batchSize min size)
while (i < size) {
buffer(j) = accessor(i)
i += 1
if (j == batchSize - 1) {
appender(buffer)
j = 0
} else {
j += 1
}
}
if (j > 0) {
appender(buffer.slice(0, j))
}
} I tested with non-nullable fixed-length columns. Each of them contains 1 billion rows. |
The performance is going to be impacted by the number of columns much more than it is by the length of the columns. You also have to think about memory access patterns to guess at the performance, assuming that main memory is going to be the bottleneck. When reading a single fixed width column the CPU will read an entire cache line of data around the part we are accessing. Because we are reading the data sequentially it means we can get all of the other reads on the same cache line essentially for free. Some CPUs can even see the read pattern and read data ahead. So depending on the CPU/etc we should be able to read in the data at close to memory speeds and write it back out at close to memory speeds. When copying to an array first, even if the array fits in the cache, it is on the heap. Because of that we should be able to read the data at main memory speeds, like with the first one, but we are going to write it at least twice. Once to the array that will write through the cache to main memory, and then again to copy it to the final location. If the array is large enough to not fit in the cache, then we are now reading and writing the data twice, so I would expect it to be twice as slow. When we start to get into multiple columns, they will fight with each other over space in the cache. The more columns you have the more likely it is that one column will read data into the cache, but then it will be evicted by another column before the other parts of the same cache line can be used. This will effectively slow down the performance until each entry is read from main memory and written to main memory separately. This is why in the copy code we try to copy a single column at a time. Once we hit arrays the APIs don't let us get at the underlying columnar array data. If they did we could copy it columnar to columnar at hopefully close to main memory speeds. |
@revans2 Thank you for such a detailed elaboration. |
@sperlingxx closing unless you have other thoughts. |
During optimizing
HostColumnVector.ColumnBuilder
: rapidsai/cudf#10025 , I realized that it is quite inefficient to append a large amount of data into the off-heap buffer by row. Meanwhile, we perform the append behavior for numerous times inHostColumnarToGpu
, because we can not access the Spark ColumnVector in a columnar manner.I proposed an alternative approach which may accelerate this process: cache the ColumnVector data with a Array buffer, and append the buffer into
ColumnBuilder
whenever the buffer is full.For fixed-width types without null, it is relatively easy implement the batch appending:
For nullable fixed-width types, we need to maintain a bit set as the valid buffer. For now, the function
batchAppendingWithNull
works under the assumption that it can be called only once. Because we can not guarantee the input sizes are multiplies of 8. So, the subsequent appending may override the previous valid data. Perhaps we can address the problem via recording the last byte of each appending.In cuDF side, we need to implement
appendArray
andappendNullableArray
methods:The text was updated successfully, but these errors were encountered: