Support `copy` and `device` keywords in `from_dlpack` #741

leofang · 2024-02-07T22:46:49Z

~~Blocked by #602.~~

Close #626. As discussed there, we expand the DLPack data exchange protocol to formalize copies. In particular, we allow two common needs:

Explicitly enable/disable data copies (through the new copy argument). The default is still None ("may copy") to preserve backward compatibility and align with other constructors (ex: asarray).
Allow copying to CPU (through the new device argument, plus setting copy=True)

In principle, arbitrary cross-device copies could be allowed too, but the consensus in #626 was that limiting to device-to-host copies is enough for now. This also avoids the needs of ~~actual~~ exhaustive library support and clarifying unforeseen semantics issues.

This PR requires an accompanying PR to the DLPack repo (dmlc/dlpack#136), in order for the producer to signal if a copy has been made (through a new bit mask), so please review both together.

leofang · 2024-02-08T01:51:44Z

cc: @rgommers @seberg @tqchen @oleksandr-pavlyk

src/array_api_stubs/_draft/array_object.py

rgommers · 2024-02-08T14:32:10Z

Thanks Leo! Should I take over the first two commits in gh-602?

leofang · 2024-02-08T14:33:37Z

Let's just review/merge #602 first, this branch was based on the one there.

rgommers

Is there a timeline issue here as well? from_dlpack should do nothing with the new copy and device keywords beyond forwarding them to the producer. So if the producer doesn't support that functionality yet, what happens?

src/array_api_stubs/_draft/creation_functions.py

leofang

Just leaving two notes below.

First, in the call today we discussed one issue in this PR. When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries. The discussed solutions involve different ways of requesting synchronous copies (see HackMD).

See the 2nd note below.

src/array_api_stubs/_draft/creation_functions.py

rgommers · 2024-02-09T08:52:14Z

When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries. The discussed solutions involve different ways of requesting synchronous copies (see HackMD).

In that discussion I was perhaps missing a more precise problem statement. Also note that that HackMD link isn't public, so here is what it says for potential solutions:

@seberg suggested: The important thing is that stream=None (or not passing) synchronizes if you specify device="cpu". What happens if a stream is passed, seems not to important to me the consumer passed it, so they must expect the right thing.

@leofang suggested: piggyback on stream=None to request sync copy, when device is not the same. Solution 2: add stream=-2 to signal the two devices do not have a common language (stream/queue).

I had a look at what PyTorch does, and it's basically the same as @seberg's suggestion I believe. From https://pytorch.org/docs/master/notes/cuda.html#asynchronous-execution: In general, the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. Hence, computation will proceed as if every operation was executed synchronously. It goes on to explain a non_blocking keyword, but that's not relevant to the kDLCPU device (only pinned memory).

I am starting to think that defining synchronization based on the target device (i.e. the one passed in) may have some advantages:

Since we only want to enable copying to host (where there is no stream concept) at this moment, do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.

seberg · 2024-02-09T09:09:18Z

do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.

Well, there are two things we need, I guess:

We need to specify that device="cpu" without passing a stream must synchronize fully.
We shouldn't have random decisions about what a passed stream and device means. So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if device="cpu", stream=123123 actually means a cuda stream and cupy must not hope it does.

Streams really should always have been fully type safe, but OK (which would resolve this, we could just say: one of them, figure it out based on type).

Maybe I should change my proposal to, this: Streams belong to the target device unless the producer knows it cannot possibly belong to it (the stream is typed to be a CudaStream, say an integer subclass). The important part being here that the default must be in terms of the target!

rgommers · 2024-02-09T11:34:50Z

So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if device="cpu", stream=123123 actually means a cuda stream and cupy must not hope it does.

I'd say that:

The description of the stream parameter in array_api.array.__dlpack__.html seems pretty clear that the stream number is that of the consumer.
Providing a stream number for a "to host" operation is nonsensical input I'd say, which we tend not to specify. If anything, let's make it even clearer that (1) is the case. However, I don't think this can be much clearer: The ownership of the stream stays with the consumer. On CPU and other device types without streams, only None is accepted.

Since only None is accepted, changing that to now require something else (like -2) looks like an unnecessary backwards compat break.

We need to specify that device="cpu" without passing a stream must synchronize fully.

That sounds good to add. Needs to avoid the string "cpu" and use text like "when the device= keyword receives an argument representing the CPU/host device".

seberg · 2024-02-09T12:17:31Z

Well, the more I think about it the more I think it makes sense to say:

If dl_device= is passed the stream must refer to the specified target devices definition for streams. For example if the dl_device=CPU only stream=None is valid and full synchronization must occur even if the data is currently on a CUDA device.
A producer may ignore the stream passed and opt to the safe default synchronization of the target device.

That covers what we need, and in principle allows a cpu -> gpu exporter to use a stream. A gpu -> cpu can also use stream if the CPU is specified as CUDAManaged or CUDAHost.
The main thing is: It doesn't leave an odd gap in the spec.

Now there might be a point for more complex synchronization schemes in the future. But maybe that just needs a future extension. (The annoyance is that it would be nice to not need a try/except for such a future extension, but beggars can't be choosers)

leofang · 2024-02-09T15:30:05Z

I see that you are continuing the discussion we had during the meeting on the subject

When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries.

I think Sebastian is right to conclude that

If dl_device= is passed the stream must refer to the specified target devices definition for streams.

The stream object and the dl_device object must go in as a valid pair. One subtlety is that while it is in theory possible for a D2H copy to be asynchronously performed on a stream, with the current "one-way" DLPack protocol there is no safe way to allow this. DLPack is one-way in the sense that during handshaking only the consumer would offer stream (and now dl_device too, with this PR) to the producer, not the other way around.¹

Given no major concern with this PR based on the discussions in the original issue, in the call, and above, let me proceed to make necessary changes to refine this PR.

Also note that that HackMD link isn't public, so here is what it says ...

Interesting, I didn't know that. I thought we just need to log in HackMD to view it?

It would be a two-way handshaking if both parties offer their dl_device and stream for negotiation -- then even D2H copies can be done asynchronously, by a bunch of complex logic checking device compatibility and establishing ordering. (By the way, when I developed the draft and exchanged with Sebastian offline, I somehow mistakenly assumed we had two-way handshaking, or at least a part of it -- that a consumer can query what devices the producer supports.) But the ship has sailed and I guess it'd make the implementation very complicated that it's probably not worth it, even if we could start DLPack over from scratch. ↩

rgommers · 2024-02-09T21:06:44Z

or at least a part of it -- that a consumer can query what devices the producer supports.

The new introspection API should allow that, but I agree that it's probably not worth it to go there.

I thought we just need to log in HackMD to view it?

No, HackMD has permissions. It's also transient; for any issue/PR with a relevant discussion we should be posting a summary on GitHub.

If dl_device= is passed the stream must refer to the specified target devices definition for streams.

Sounds right. One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack changes, but the __dlpack__ changes are more general. Is that on purpose (if so, I don't mind)?

rgommers

Overall this is starting to look close to ready - a few more comments.

src/array_api_stubs/_draft/array_object.py

src/array_api_stubs/_draft/creation_functions.py

leofang · 2024-02-12T01:26:10Z

Is there a timeline issue here as well? from_dlpack should do nothing with the new copy and device keywords beyond forwarding them to the producer. So if the producer doesn't support that functionality yet, what happens?

In the later revision, this was clarified that a try-except should be used to check if copy and device keywords are supported.

One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack changes, but the __dlpack__ changes are more general. Is that on purpose (if so, I don't mind)?

Indeed, yes. I was thinking we want to limit user-visible effects. If array library maintainers think they can get creative in __dlpack__ in a way that wouldn't interfere with future evolutions of the array API standard, they can do so.

leofang · 2024-02-12T01:29:58Z

I guess it'll stay in draft until there is movement on dmlc/dlpack#136 (since this PR explicitly mentions setting a to-be-added C level flag)?

That's the part I am not sure... The contents in both the dlpack PR and this PR are cross referenced and intertwined. I am fine to go either way, once @tqchen has time to take a look and offer us feedbacks. I can also break the dlpack PR into two (one is to only add the new flag, another is updating the C/Python docs) if it helps, but I wouldn't do so unless Tianqi agrees that it would help.

I want the verbiage used in the spec to reflect the essential requirement mandated by the standard to make data available to Python interpreter process.

Applied in commit 338742a.

leofang · 2024-02-12T01:31:52Z

(Requested Aaron's review from the array-api-compat perspective.)

tqchen · 2024-02-12T16:43:29Z

Thanks for looping me in. Personally, I find there are two goals in here which are slightly intermingled

G0: The ability to do zero copy exchange
G1: The ability to cross device copy

My main thought is that we should keep API support for G0 to be simple and ensure from_dlpack to come with clear intent as in zero copy.

Because G1 might indeed opens up significant implementation overhead, as well as different possible ways of synchronization that I think would merit a different API(to_device?).

My main question is how to phase out the API, so most libraries can start with G0. But happy to hear about others thoughts as well

seberg · 2024-02-12T16:55:07Z

My main question is how to phase out the API, so most libraries can start with G0

Nobody is required to implement the cross device API, we could bump minor API (so consumers can know whether or not the flag should be set). Flagging IS_COPY (and a code change) is thus only needed if the library is atypical and doesn't use zero-copy exchange.

For the Array-API, this is a bit different: because I think there the intention is to additionally require dl_device="cpu" to work. A maybe slightly overpowered way to allow comparing two libraries in tests (e.g. by converting both to the same CPU library, like NumPy).

rgommers · 2024-02-12T21:35:02Z

Personally, I find there are two goals in here which are slightly intermingled

G0: The ability to do zero copy exchange

G1: The ability to cross device copy

That is the right framing I think indeed.

I think would merit a different API(to_device?).

I think a separate API would be more problematic to do, because:

If two libraries don't support the same devices, anything like to_device will not work. Asking a library like CuPy to return a DLPack with host memory for under-the-hood usage is one thing; asking it to do so in a public/user-facing API looks like a nonstarter.
DLPack is the only candidate for implementing this under the hood; the device support is needed and related issues like streams have been worked out. API-wise a new function not called from_dlpack but still doing everything through DLPack would lead to a strange asymmetry.

so most libraries can start with G0

I think one should always start there if that's an option. from_dlpack(y, device=x.device) will be zero-copy if the data is already on the same device. If we really wanted to ensure always-zero-copy by default, then the copy keyword could default to False rather than None; that way from_dlpack(y, device=x.device) would never be useful and authors would have to write from_dlpack(y, copy=None, device=x.device) (or copy=True).

Does it really matter though if this is using from_dlpack with the device keyword or some other function like xp.from_dlpack_with_devicetransfer? The former seems preferable, and I don't think it detracts from the "prefer zero-copy" nature of DLPack.

tqchen · 2024-02-13T14:19:21Z

Thanks @rgommers for the comments. I think as long as packages can phase their implementation status this would be fine.

leofang · 2024-02-13T14:59:03Z

Thanks @tqchen @rgommers @seberg for the comments. Indeed, G0 (zero copy) has been the goal and we've achieved that since the first version of the standard. What's new with this PR is

we also enable G1 (cross-device copy) using DLPack (for which Ralf gave the rationale from to_device() -- any way to force back to host "portably?" #626; tl;dr is DLPack offers all the info we need to facilitate this, so no reason not use it), and
use the kw arg copy to enforce G0 if needed (setting copy=False will ensure we never copy, a guarantee we couldn't have before).

leofang · 2024-02-13T15:00:54Z

Q: Does anyone know why the doc build has a warning?

/home/circleci/repo/src/array_api_stubs/_draft/array_object.py:docstring of array_api_stubs._draft.array_object._array.__dlpack__:1: WARNING: py:class reference target not found: Enum

I thought Enum has been used in the return value type before and it worked just fine?

rgommers · 2024-02-13T15:41:27Z

I thought Enum has been used in the return value type before and it worked just fine?

Those warnings are a real pain. I pushed a fix. For why it was needed here and not in __dlpack_device__ right below it, I have no explanation beyond "autodoc is buggy"

src/array_api_stubs/_draft/creation_functions.py

rgommers

All looks good to me now. It looks like all comments were addressed; I plan to merge this within the next day or two unless there are more comments.

leofang · 2024-02-13T21:24:18Z

Reminder to all: Remember to review dmlc/dlpack#136 too, which goes hand-in-hand with this PR.

leofang · 2024-02-14T20:44:18Z

The accompanying DLPack PR was merged, and DLPack 1.0rc was released. Let's merge this. Thanks everyone for the discussion!

This comment was marked as outdated.

Sign in to view

This was referenced Feb 7, 2024

Add a new bit mask for copy + Update Python array API spec dmlc/dlpack#136

Merged

Tracking issue for the 2023 revision of the array API specification #643

Closed

to_device() -- any way to force back to host "portably?" #626

Closed

kgryte mentioned this pull request Feb 8, 2024

to_device should not copy when the operation is a no op #645

Closed

seberg reviewed Feb 8, 2024

View reviewed changes

src/array_api_stubs/_draft/array_object.py Outdated Show resolved Hide resolved

src/array_api_stubs/_draft/array_object.py Outdated Show resolved Hide resolved

leofang force-pushed the dlpack_copy branch from f6ec8ca to 8bc321c Compare February 8, 2024 14:49

rgommers added topic: DLPack DLPack. API extension Adds new functions or objects to the API. labels Feb 8, 2024

rgommers reviewed Feb 8, 2024

View reviewed changes

src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved

This was referenced Feb 8, 2024

Remove "cpu" device data-apis/array-api-compat#86

Open

Deprecate using "cpu" as a device for to_device() with CuPy arrays data-apis/array-api-compat#87

Draft

leofang added 3 commits February 8, 2024 19:27

support copy in from_dlpack

e214fde

specify copy stream

a92e481

allow 3-way copy arg to align all constructors

c102bcb

leofang force-pushed the dlpack_copy branch from 8bc321c to c102bcb Compare February 9, 2024 00:27

leofang commented Feb 9, 2024

View reviewed changes

src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved

update to reflect the discussions

4513339

rgommers previously requested changes Feb 10, 2024

View reviewed changes

leofang added 2 commits February 10, 2024 15:49

clairfy a bit and fix typos

b855568

sync the copy docs

60f7439

clarify what's 'on CPU'

338742a

leofang requested a review from asmeurer February 12, 2024 01:31

try to make linter happy

0b3177c

remove namespace leak clause, clean up, and add an example

e3535ce

make linter happy

65ca5bd

leofang marked this pull request as ready for review February 13, 2024 14:49

leofang added this to the v2023 milestone Feb 13, 2024

fix Sphinx complaint about Enum

a7f8d22

rgommers reviewed Feb 13, 2024

View reviewed changes

src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved

leofang added 2 commits February 13, 2024 11:30

add/update v2023-specific notes on device

24cb4db

remove a note on kDLCPU

43263ce

rgommers approved these changes Feb 13, 2024

View reviewed changes

leofang merged commit 71b01c1 into data-apis:main Feb 14, 2024
3 checks passed

leofang deleted the dlpack_copy branch February 14, 2024 20:44

ogrisel mentioned this pull request Feb 15, 2024

xp.from_dlpack to move data to a preallocated array on the target device #750

Closed

jorisvandenbossche mentioned this pull request Mar 6, 2024

[Python] Expose the device interface through the Arrow PyCapsule protocol apache/arrow#38325

Closed

This was referenced Mar 11, 2024

Add support for device and copy kwargs in from_dlpack to match Array API jax-ml/jax#20175

Merged

Add support for max_version, dl_device, copy kwargs in __dlpack__ to match Array API jax-ml/jax#20198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `copy` and `device` keywords in `from_dlpack` #741

Support `copy` and `device` keywords in `from_dlpack` #741

leofang commented Feb 7, 2024 •

edited

Loading

This comment was marked as outdated.

leofang commented Feb 8, 2024

rgommers commented Feb 8, 2024

leofang commented Feb 8, 2024

rgommers left a comment

leofang left a comment

rgommers commented Feb 9, 2024

seberg commented Feb 9, 2024

rgommers commented Feb 9, 2024

seberg commented Feb 9, 2024

leofang commented Feb 9, 2024

rgommers commented Feb 9, 2024

rgommers left a comment

leofang commented Feb 12, 2024

leofang commented Feb 12, 2024

leofang commented Feb 12, 2024

tqchen commented Feb 12, 2024

seberg commented Feb 12, 2024

rgommers commented Feb 12, 2024

tqchen commented Feb 13, 2024

leofang commented Feb 13, 2024 •

edited

Loading

leofang commented Feb 13, 2024

rgommers commented Feb 13, 2024

rgommers left a comment

leofang commented Feb 13, 2024

leofang commented Feb 14, 2024

Support copy and device keywords in from_dlpack #741

Support copy and device keywords in from_dlpack #741

Conversation

leofang commented Feb 7, 2024 • edited Loading

This comment was marked as outdated.

leofang commented Feb 8, 2024

rgommers commented Feb 8, 2024

leofang commented Feb 8, 2024

rgommers left a comment

Choose a reason for hiding this comment

leofang left a comment

Choose a reason for hiding this comment

rgommers commented Feb 9, 2024

seberg commented Feb 9, 2024

rgommers commented Feb 9, 2024

seberg commented Feb 9, 2024

leofang commented Feb 9, 2024

Footnotes

rgommers commented Feb 9, 2024

rgommers left a comment

Choose a reason for hiding this comment

leofang commented Feb 12, 2024

leofang commented Feb 12, 2024

leofang commented Feb 12, 2024

tqchen commented Feb 12, 2024

seberg commented Feb 12, 2024

rgommers commented Feb 12, 2024

tqchen commented Feb 13, 2024

leofang commented Feb 13, 2024 • edited Loading

leofang commented Feb 13, 2024

rgommers commented Feb 13, 2024

rgommers left a comment

Choose a reason for hiding this comment

leofang commented Feb 13, 2024

leofang commented Feb 14, 2024

Support `copy` and `device` keywords in `from_dlpack` #741

Support `copy` and `device` keywords in `from_dlpack` #741

leofang commented Feb 7, 2024 •

edited

Loading

leofang commented Feb 13, 2024 •

edited

Loading