[FEA] Document cuDF allocation alignment/padding requirements #9389

jrhemstad · 2021-10-06T18:02:30Z

Is your feature request related to a problem? Please describe.

cuDF informally states that we follow the Arrow physical memory layout spec. However, we're a little loose on some of the particulars, especially around alignment/padding.

Arrow requires all allocations be 8B aligned and padded to 8B sizes. It recommends expanding this to 64B aligned and 64B padded.

cuDF isn't (formally) following this. Implicitly, we know all allocations made through RMM will be 256B aligned. As it happens, RMM will currently pad out to 8B (though it shouldn't do that) . However, due to zero-copy slicing, we can't assume column.data<T>() pointer will be 256B aligned, and can only expect the pointer alignment to be at least alignof(T).

Specifically for null masks, we expect/allocate the null mask of a column to always be padded to 64B.

Describe the solution you'd like

At the very least, we should write down our current expectations. These requirements can be best captured in requirements of the column_view class, as column_view is the arbiter of how we interact will all memory consumed by libcudf.

data must be aligned to at least the alignment of the underlying type
It must be safe to dereference any address in the range [data, data + num_elements * size_of(data_type))
- Note: This is currently incompatible with the Arrow spec. e.g., libcudf would currently allow an INT8 allocation to be any size, whereas Arrow would require it to be padded to a multiple of 8 bytes.
null_mask must be aligned to alignof(bitmask_type)
null_mask must point to an allocation padded to a multiple of 64 bytes such that it is safe to dereference any address in the range [null_mask, null_mask + bitmask_allocation_size_bytes(num_elements))

Additionally, we should consider expanding the requirement for allocation padding to make it such that all allocations be padded out to 64B. This could be achieved easily enough with a device_memory_resource adapter that rounds allocations up to at least 64B.

Additional Considerations

As best I can tell, these requirements are technically incompatible with the CUDA Array Interface because the CAI doesn't say anything about padding/alignment. This seems like something we should push to be addressed in the CAI.

The text was updated successfully, but these errors were encountered:

shwina · 2021-10-06T18:07:49Z

cc: @gmarkall @leofang @jakirkham for thoughts on:

As best I can tell, these requirements are technically incompatible with the CUDA Array Interface because the CAI doesn't say anything about padding/alignment. This seems like something we should push to be addressed in the CAI.

gmarkall · 2021-10-06T20:19:04Z

As best I can tell, these requirements are technically incompatible with the CUDA Array Interface because the CAI doesn't say anything about padding/alignment. This seems like something we should push to be addressed in the CAI.

This is something we can think about - I'll write a summary of thoughts / possibilities and open an issue on the Numba issue tracker (where CAI design discussions tend to take place) and link to here.

kkraus14 · 2021-10-06T20:31:02Z

null_mask must be aligned to alignof(bitmask_type)

null_mask must point to an allocation padded to a multiple of 64 bytes such that it is safe to dereference any address in the range [null_mask, null_mask + bitmask_allocation_size_bytes(num_elements))

Additionally, we should consider expanding the requirement for allocation padding to make it such that all allocations be padded out to 64B. This could be achieved easily enough with a device_memory_resource adapter that rounds allocations up to at least 64B.

I don't think the __cuda_array_interface__ protocol should necessarily be forced to meet these requirements, but in our consumption of objects we could check if the object follows these rules, and if it does not we can copy it into new allocation(s) that do.

shwina · 2021-10-06T20:40:19Z

I agree that the CAI shouldn't necessarily be influenced by the Arrow spec, but the following does worry me:

we could check if the object follows these rules, and if it does not we can copy it into new allocation(s) that do.

This will result in poorly performing code in a lot of cases where we zero copy at the moment. For example, slices:

In [6]: arr = cp.array([1, 2, 3, 4], dtype="int8")

In [7]: arr.__cuda_array_interface__
Out[7]:
{'shape': (4,),
 'typestr': '|i1',
 'descr': [('', '|i1')],
 'stream': 1,
 'version': 3,
 'strides': None,
 'data': (140649290531840, False)}

In [8]: arr[1:].__cuda_array_interface__
Out[8]:
{'shape': (3,),
 'typestr': '|i1',
 'descr': [('', '|i1')],
 'stream': 1,
 'version': 3,
 'strides': None,
 'data': (140649290531841, False)}

kkraus14 · 2021-10-06T20:50:32Z

This will result in poorly performing code in a lot of cases where we zero copy at the moment. For example, slices

Yes, if we require a 64B padded allocation that would be the case. Alternatively we'd have to do a bunch of non-trivial work and add complexity into __cuda_array_interface__ to describe the underlying memory allocation in addition to the relevant data.

jrhemstad · 2021-10-07T15:55:12Z

It seems to me like if CAI wants to play ball in the Arrow world, it needs to be rebased on top of the Arrow physical memory spec. Otherwise, it defeats the purpose of having a common spec for memory layout to enable zero-copy interaction among libraries, e.g., cuDF can't assume anything about padding/alignment that would otherwise be guaranteed by the Arrow spec, so we would need to deep copy data into Arrow memory layout.

This may not have caused any problems yet, because cuDF doesn't really exploit the alignment/padding guarantees of Arrow (yet). Likewise, users of CAI are likely implicitly following the rule of ensuring data is naturally aligned (otherwise CUDA would error on a misaligned address), but Arrow is stricter still saying all allocations should be at least 8B aligned.

kkraus14 · 2021-10-07T16:47:18Z

It seems to me like if CAI wants to play ball in the Arrow world, it needs to be rebased on top of the Arrow physical memory spec.

CAI is an array protocol and not Arrow specific in any way so this is out of scope for it. What could be in scope is supporting a way to describe an underlying allocation separate from the actual array data, and that is likely what would be needed to support the Arrow padding case that we want to take advantage of here.

Also, the same problem exists for dlpack as it does for CAI.

jrhemstad · 2021-10-07T18:51:24Z

I get that CAI is just a way to pass around pointers and is solving a somewhat different (but still very similar) problem than Arrow.

What is a library maintainer supposed to do if they build on the Arrow spec and say "I'll work on your data zero-copy if you follow the Arrow spec!"? Suddenly, CAI/dlpack break that promise and now the whole premise of Arrow eliminating "copy&convert" is eroded and we're back to a situation where you have to deep copy things just to ensure it's in the format you expect.

kkraus14 · 2021-10-09T03:32:22Z

I just realized that dlpack spec does indicate that pointers should be 256 byte aligned (https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h#L139-L141), but I'm not sure if we're following that properly.

Doesn't look like CuPy follows this as expected for sliced arrays either:

import cupy

test_base = cupy.array([1, 2, 3, 4, 5])
test_sliced = test_base[2:]

test_dlpack_base = test_base.toDlpack()
test_dlpack_sliced = test_sliced.toDlpack()

output_base = cupy.fromDlpack(test_dlpack_base)
output_sliced = cupy.fromDlpack(test_dlpack_sliced)

print("Base dlpack pointer is 256 byte aligned?", (output_base.data.ptr % 256) == 0)  # True
print("Sliced dlpack pointer is 256 byte aligned?", (output_sliced.data.ptr % 256) == 0)  # False

kkraus14 · 2021-10-09T03:33:16Z

cc @leofang ⬆️

jrhemstad · 2021-10-10T22:14:57Z

There's no way to make an arbitrarily sliced array be 256B aligned.

The Arrow spec states that allocations should be 8B (64B) aligned, but not that every pointer (e.g., for a sliced/offset array) must have such alignment.

In fact, in libcudf, there's no way to differentiate a pointer to the beginning of an allocation vs a pointer inside an allocation. I suppose Arrow can do this because all Arrow data structures are owning with a shared_ptr to the underlying buffer, so you always know the extent of the underlying allocation.

So that means in libcudf we can never require a pointer have stricter alignment than the natural alignment of the underlying type. That said, we could say that the pointer points inside an allocation that is at least 64B aligned such that it is safe to deference p - p % 64.

leofang · 2021-10-27T02:55:47Z

Sorry I'm late to the party! On DLPack's 256B alignment requirement, I opened an issue in the Array API repo to discuss: data-apis/array-api#293. I (and others in that thread) agree with Jake's assessment that there's no way to make arbitrarily sliced arrays 256B aligned.

github-actions · 2021-11-26T03:02:53Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-02-24T04:07:24Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2023-05-30T22:41:25Z

Based on recent discussions on this issue, Arrow's padding requirements are actually far less strict than we thought. 64 byte padding is recommended solely for the usual performance reasons (e.g. ability to use vector instructions), and 8 byte padding is only required at the level of IPC packets (which cudf does not currently support) and not individual columns. For a comprehensive summary on our discussions, see rapidsai/rmm#865 (comment).

With that in mind, I think we can close this issue since there are no alignment/padding requirements to be documented here. @jrhemstad please reopen if you think there is still some action to be taken on this issue.

jrhemstad added feature request New feature or request proposal Change current process or code doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Performance Performance related issue labels Oct 6, 2021

leofang mentioned this issue Oct 25, 2021

Discussion: DLPack requires 256B alignment of data pointers data-apis/array-api#293

Closed

github-actions bot added the inactive-30d label Nov 26, 2021

github-actions bot added the inactive-90d label Feb 24, 2022

vyasr mentioned this issue Mar 14, 2023

[BUG] There is no overflow checking on alignment rapidsai/rmm#1156

Closed

GregoryKimball removed the inactive-90d label Apr 6, 2023

vyasr closed this as completed May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Document cuDF allocation alignment/padding requirements #9389

[FEA] Document cuDF allocation alignment/padding requirements #9389

jrhemstad commented Oct 6, 2021

shwina commented Oct 6, 2021

gmarkall commented Oct 6, 2021

kkraus14 commented Oct 6, 2021

shwina commented Oct 6, 2021 •

edited

Loading

kkraus14 commented Oct 6, 2021

jrhemstad commented Oct 7, 2021

kkraus14 commented Oct 7, 2021 •

edited

Loading

jrhemstad commented Oct 7, 2021

kkraus14 commented Oct 9, 2021 •

edited

Loading

kkraus14 commented Oct 9, 2021

jrhemstad commented Oct 10, 2021 •

edited

Loading

leofang commented Oct 27, 2021

github-actions bot commented Nov 26, 2021

github-actions bot commented Feb 24, 2022

vyasr commented May 30, 2023

[FEA] Document cuDF allocation alignment/padding requirements #9389

[FEA] Document cuDF allocation alignment/padding requirements #9389

Comments

jrhemstad commented Oct 6, 2021

shwina commented Oct 6, 2021

gmarkall commented Oct 6, 2021

kkraus14 commented Oct 6, 2021

shwina commented Oct 6, 2021 • edited Loading

kkraus14 commented Oct 6, 2021

jrhemstad commented Oct 7, 2021

kkraus14 commented Oct 7, 2021 • edited Loading

jrhemstad commented Oct 7, 2021

kkraus14 commented Oct 9, 2021 • edited Loading

kkraus14 commented Oct 9, 2021

jrhemstad commented Oct 10, 2021 • edited Loading

leofang commented Oct 27, 2021

github-actions bot commented Nov 26, 2021

github-actions bot commented Feb 24, 2022

vyasr commented May 30, 2023

shwina commented Oct 6, 2021 •

edited

Loading

kkraus14 commented Oct 7, 2021 •

edited

Loading

kkraus14 commented Oct 9, 2021 •

edited

Loading

jrhemstad commented Oct 10, 2021 •

edited

Loading