Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Document cuDF allocation alignment/padding requirements #9389

Closed
jrhemstad opened this issue Oct 6, 2021 · 15 comments
Closed

[FEA] Document cuDF allocation alignment/padding requirements #9389

jrhemstad opened this issue Oct 6, 2021 · 15 comments
Labels
doc Documentation feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue proposal Change current process or code Python Affects Python cuDF API.

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

cuDF informally states that we follow the Arrow physical memory layout spec. However, we're a little loose on some of the particulars, especially around alignment/padding.

Arrow requires all allocations be 8B aligned and padded to 8B sizes. It recommends expanding this to 64B aligned and 64B padded.

cuDF isn't (formally) following this. Implicitly, we know all allocations made through RMM will be 256B aligned. As it happens, RMM will currently pad out to 8B (though it shouldn't do that) . However, due to zero-copy slicing, we can't assume column.data<T>() pointer will be 256B aligned, and can only expect the pointer alignment to be at least alignof(T).

Specifically for null masks, we expect/allocate the null mask of a column to always be padded to 64B.

Describe the solution you'd like

At the very least, we should write down our current expectations. These requirements can be best captured in requirements of the column_view class, as column_view is the arbiter of how we interact will all memory consumed by libcudf.

  • data must be aligned to at least the alignment of the underlying type
  • It must be safe to dereference any address in the range [data, data + num_elements * size_of(data_type))
    • Note: This is currently incompatible with the Arrow spec. e.g., libcudf would currently allow an INT8 allocation to be any size, whereas Arrow would require it to be padded to a multiple of 8 bytes.
  • null_mask must be aligned to alignof(bitmask_type)
  • null_mask must point to an allocation padded to a multiple of 64 bytes such that it is safe to dereference any address in the range [null_mask, null_mask + bitmask_allocation_size_bytes(num_elements))

Additionally, we should consider expanding the requirement for allocation padding to make it such that all allocations be padded out to 64B. This could be achieved easily enough with a device_memory_resource adapter that rounds allocations up to at least 64B.

Additional Considerations

As best I can tell, these requirements are technically incompatible with the CUDA Array Interface because the CAI doesn't say anything about padding/alignment. This seems like something we should push to be addressed in the CAI.

@jrhemstad jrhemstad added feature request New feature or request proposal Change current process or code doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Performance Performance related issue labels Oct 6, 2021
@shwina
Copy link
Contributor

shwina commented Oct 6, 2021

cc: @gmarkall @leofang @jakirkham for thoughts on:

As best I can tell, these requirements are technically incompatible with the CUDA Array Interface because the CAI doesn't say anything about padding/alignment. This seems like something we should push to be addressed in the CAI.

@gmarkall
Copy link
Contributor

gmarkall commented Oct 6, 2021

As best I can tell, these requirements are technically incompatible with the CUDA Array Interface because the CAI doesn't say anything about padding/alignment. This seems like something we should push to be addressed in the CAI.

This is something we can think about - I'll write a summary of thoughts / possibilities and open an issue on the Numba issue tracker (where CAI design discussions tend to take place) and link to here.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 6, 2021

  • null_mask must be aligned to alignof(bitmask_type)
  • null_mask must point to an allocation padded to a multiple of 64 bytes such that it is safe to dereference any address in the range [null_mask, null_mask + bitmask_allocation_size_bytes(num_elements))

Additionally, we should consider expanding the requirement for allocation padding to make it such that all allocations be padded out to 64B. This could be achieved easily enough with a device_memory_resource adapter that rounds allocations up to at least 64B.

I don't think the __cuda_array_interface__ protocol should necessarily be forced to meet these requirements, but in our consumption of objects we could check if the object follows these rules, and if it does not we can copy it into new allocation(s) that do.

@shwina
Copy link
Contributor

shwina commented Oct 6, 2021

I agree that the CAI shouldn't necessarily be influenced by the Arrow spec, but the following does worry me:

we could check if the object follows these rules, and if it does not we can copy it into new allocation(s) that do.

This will result in poorly performing code in a lot of cases where we zero copy at the moment. For example, slices:

In [6]: arr = cp.array([1, 2, 3, 4], dtype="int8")

In [7]: arr.__cuda_array_interface__
Out[7]:
{'shape': (4,),
 'typestr': '|i1',
 'descr': [('', '|i1')],
 'stream': 1,
 'version': 3,
 'strides': None,
 'data': (140649290531840, False)}

In [8]: arr[1:].__cuda_array_interface__
Out[8]:
{'shape': (3,),
 'typestr': '|i1',
 'descr': [('', '|i1')],
 'stream': 1,
 'version': 3,
 'strides': None,
 'data': (140649290531841, False)}

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 6, 2021

This will result in poorly performing code in a lot of cases where we zero copy at the moment. For example, slices

Yes, if we require a 64B padded allocation that would be the case. Alternatively we'd have to do a bunch of non-trivial work and add complexity into __cuda_array_interface__ to describe the underlying memory allocation in addition to the relevant data.

@jrhemstad
Copy link
Contributor Author

It seems to me like if CAI wants to play ball in the Arrow world, it needs to be rebased on top of the Arrow physical memory spec. Otherwise, it defeats the purpose of having a common spec for memory layout to enable zero-copy interaction among libraries, e.g., cuDF can't assume anything about padding/alignment that would otherwise be guaranteed by the Arrow spec, so we would need to deep copy data into Arrow memory layout.

This may not have caused any problems yet, because cuDF doesn't really exploit the alignment/padding guarantees of Arrow (yet). Likewise, users of CAI are likely implicitly following the rule of ensuring data is naturally aligned (otherwise CUDA would error on a misaligned address), but Arrow is stricter still saying all allocations should be at least 8B aligned.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 7, 2021

It seems to me like if CAI wants to play ball in the Arrow world, it needs to be rebased on top of the Arrow physical memory spec.

CAI is an array protocol and not Arrow specific in any way so this is out of scope for it. What could be in scope is supporting a way to describe an underlying allocation separate from the actual array data, and that is likely what would be needed to support the Arrow padding case that we want to take advantage of here.

Also, the same problem exists for dlpack as it does for CAI.

@jrhemstad
Copy link
Contributor Author

I get that CAI is just a way to pass around pointers and is solving a somewhat different (but still very similar) problem than Arrow.

What is a library maintainer supposed to do if they build on the Arrow spec and say "I'll work on your data zero-copy if you follow the Arrow spec!"? Suddenly, CAI/dlpack break that promise and now the whole premise of Arrow eliminating "copy&convert" is eroded and we're back to a situation where you have to deep copy things just to ensure it's in the format you expect.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 9, 2021

I just realized that dlpack spec does indicate that pointers should be 256 byte aligned (https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h#L139-L141), but I'm not sure if we're following that properly.

Doesn't look like CuPy follows this as expected for sliced arrays either:

import cupy

test_base = cupy.array([1, 2, 3, 4, 5])
test_sliced = test_base[2:]

test_dlpack_base = test_base.toDlpack()
test_dlpack_sliced = test_sliced.toDlpack()

output_base = cupy.fromDlpack(test_dlpack_base)
output_sliced = cupy.fromDlpack(test_dlpack_sliced)

print("Base dlpack pointer is 256 byte aligned?", (output_base.data.ptr % 256) == 0)  # True
print("Sliced dlpack pointer is 256 byte aligned?", (output_sliced.data.ptr % 256) == 0)  # False

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 9, 2021

cc @leofang ⬆️

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Oct 10, 2021

There's no way to make an arbitrarily sliced array be 256B aligned.

The Arrow spec states that allocations should be 8B (64B) aligned, but not that every pointer (e.g., for a sliced/offset array) must have such alignment.

In fact, in libcudf, there's no way to differentiate a pointer to the beginning of an allocation vs a pointer inside an allocation. I suppose Arrow can do this because all Arrow data structures are owning with a shared_ptr to the underlying buffer, so you always know the extent of the underlying allocation.

So that means in libcudf we can never require a pointer have stricter alignment than the natural alignment of the underlying type. That said, we could say that the pointer points inside an allocation that is at least 64B aligned such that it is safe to deference p - p % 64.

@leofang
Copy link
Member

leofang commented Oct 27, 2021

Sorry I'm late to the party! On DLPack's 256B alignment requirement, I opened an issue in the Array API repo to discuss: data-apis/array-api#293. I (and others in that thread) agree with Jake's assessment that there's no way to make arbitrarily sliced arrays 256B aligned.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vyasr
Copy link
Contributor

vyasr commented May 30, 2023

Based on recent discussions on this issue, Arrow's padding requirements are actually far less strict than we thought. 64 byte padding is recommended solely for the usual performance reasons (e.g. ability to use vector instructions), and 8 byte padding is only required at the level of IPC packets (which cudf does not currently support) and not individual columns. For a comprehensive summary on our discussions, see rapidsai/rmm#865 (comment).

With that in mind, I think we can close this issue since there are no alignment/padding requirements to be documented here. @jrhemstad please reopen if you think there is still some action to be taken on this issue.

@vyasr vyasr closed this as completed May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue proposal Change current process or code Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

7 participants