Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support copy and device keywords in from_dlpack #741

Merged
merged 13 commits into from
Feb 14, 2024

Conversation

leofang
Copy link
Contributor

@leofang leofang commented Feb 7, 2024

Blocked by #602.

Close #626. As discussed there, we expand the DLPack data exchange protocol to formalize copies. In particular, we allow two common needs:

  1. Explicitly enable/disable data copies (through the new copy argument). The default is still None ("may copy") to preserve backward compatibility and align with other constructors (ex: asarray).
  2. Allow copying to CPU (through the new device argument, plus setting copy=True)

In principle, arbitrary cross-device copies could be allowed too, but the consensus in #626 was that limiting to device-to-host copies is enough for now. This also avoids the needs of actual exhaustive library support and clarifying unforeseen semantics issues.

This PR requires an accompanying PR to the DLPack repo (dmlc/dlpack#136), in order for the producer to signal if a copy has been made (through a new bit mask), so please review both together.

@leofang

This comment was marked as outdated.

@leofang
Copy link
Contributor Author

leofang commented Feb 8, 2024

src/array_api_stubs/_draft/array_object.py Outdated Show resolved Hide resolved
src/array_api_stubs/_draft/array_object.py Outdated Show resolved Hide resolved
@rgommers
Copy link
Member

rgommers commented Feb 8, 2024

Thanks Leo! Should I take over the first two commits in gh-602?

@leofang
Copy link
Contributor Author

leofang commented Feb 8, 2024

Let's just review/merge #602 first, this branch was based on the one there.

@rgommers rgommers added topic: DLPack DLPack. API extension Adds new functions or objects to the API. labels Feb 8, 2024
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a timeline issue here as well? from_dlpack should do nothing with the new copy and device keywords beyond forwarding them to the producer. So if the producer doesn't support that functionality yet, what happens?

src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving two notes below.

First, in the call today we discussed one issue in this PR. When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries. The discussed solutions involve different ways of requesting synchronous copies (see HackMD).

See the 2nd note below.

src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved
@rgommers
Copy link
Member

rgommers commented Feb 9, 2024

When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries. The discussed solutions involve different ways of requesting synchronous copies (see HackMD).

In that discussion I was perhaps missing a more precise problem statement. Also note that that HackMD link isn't public, so here is what it says for potential solutions:

@seberg suggested: The important thing is that stream=None (or not passing) synchronizes if you specify device="cpu". What happens if a stream is passed, seems not to important to me the consumer passed it, so they must expect the right thing.

@leofang suggested: piggyback on stream=None to request sync copy, when device is not the same. Solution 2: add stream=-2 to signal the two devices do not have a common language (stream/queue).

I had a look at what PyTorch does, and it's basically the same as @seberg's suggestion I believe. From https://pytorch.org/docs/master/notes/cuda.html#asynchronous-execution: In general, the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. Hence, computation will proceed as if every operation was executed synchronously. It goes on to explain a non_blocking keyword, but that's not relevant to the kDLCPU device (only pinned memory).

I am starting to think that defining synchronization based on the target device (i.e. the one passed in) may have some advantages:

Since we only want to enable copying to host (where there is no stream concept) at this moment, do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.

@seberg
Copy link
Contributor

seberg commented Feb 9, 2024

do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.

Well, there are two things we need, I guess:

  • We need to specify that device="cpu" without passing a stream must synchronize fully.
  • We shouldn't have random decisions about what a passed stream and device means. So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if device="cpu", stream=123123 actually means a cuda stream and cupy must not hope it does.

Streams really should always have been fully type safe, but OK (which would resolve this, we could just say: one of them, figure it out based on type).

Maybe I should change my proposal to, this: Streams belong to the target device unless the producer knows it cannot possibly belong to it (the stream is typed to be a CudaStream, say an integer subclass). The important part being here that the default must be in terms of the target!

@rgommers
Copy link
Member

rgommers commented Feb 9, 2024

So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if device="cpu", stream=123123 actually means a cuda stream and cupy must not hope it does.

I'd say that:

  1. The description of the stream parameter in array_api.array.__dlpack__.html seems pretty clear that the stream number is that of the consumer.
  2. Providing a stream number for a "to host" operation is nonsensical input I'd say, which we tend not to specify. If anything, let's make it even clearer that (1) is the case. However, I don't think this can be much clearer: The ownership of the stream stays with the consumer. On CPU and other device types without streams, only None is accepted.

Since only None is accepted, changing that to now require something else (like -2) looks like an unnecessary backwards compat break.

We need to specify that device="cpu" without passing a stream must synchronize fully.

That sounds good to add. Needs to avoid the string "cpu" and use text like "when the device= keyword receives an argument representing the CPU/host device".

@seberg
Copy link
Contributor

seberg commented Feb 9, 2024

Well, the more I think about it the more I think it makes sense to say:

If dl_device= is passed the stream must refer to the specified target devices definition for streams. For example if the dl_device=CPU only stream=None is valid and full synchronization must occur even if the data is currently on a CUDA device.
A producer may ignore the stream passed and opt to the safe default synchronization of the target device.

That covers what we need, and in principle allows a cpu -> gpu exporter to use a stream. A gpu -> cpu can also use stream if the CPU is specified as CUDAManaged or CUDAHost.
The main thing is: It doesn't leave an odd gap in the spec.

Now there might be a point for more complex synchronization schemes in the future. But maybe that just needs a future extension. (The annoyance is that it would be nice to not need a try/except for such a future extension, but beggars can't be choosers)

@leofang
Copy link
Contributor Author

leofang commented Feb 9, 2024

I see that you are continuing the discussion we had during the meeting on the subject

When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries.

I think Sebastian is right to conclude that

If dl_device= is passed the stream must refer to the specified target devices definition for streams.

The stream object and the dl_device object must go in as a valid pair. One subtlety is that while it is in theory possible for a D2H copy to be asynchronously performed on a stream, with the current "one-way" DLPack protocol there is no safe way to allow this. DLPack is one-way in the sense that during handshaking only the consumer would offer stream (and now dl_device too, with this PR) to the producer, not the other way around.1

Given no major concern with this PR based on the discussions in the original issue, in the call, and above, let me proceed to make necessary changes to refine this PR.

Also note that that HackMD link isn't public, so here is what it says ...

Interesting, I didn't know that. I thought we just need to log in HackMD to view it?

Footnotes

  1. It would be a two-way handshaking if both parties offer their dl_device and stream for negotiation -- then even D2H copies can be done asynchronously, by a bunch of complex logic checking device compatibility and establishing ordering. (By the way, when I developed the draft and exchanged with Sebastian offline, I somehow mistakenly assumed we had two-way handshaking, or at least a part of it -- that a consumer can query what devices the producer supports.) But the ship has sailed and I guess it'd make the implementation very complicated that it's probably not worth it, even if we could start DLPack over from scratch.

@rgommers
Copy link
Member

rgommers commented Feb 9, 2024

or at least a part of it -- that a consumer can query what devices the producer supports.

The new introspection API should allow that, but I agree that it's probably not worth it to go there.

I thought we just need to log in HackMD to view it?

No, HackMD has permissions. It's also transient; for any issue/PR with a relevant discussion we should be posting a summary on GitHub.

If dl_device= is passed the stream must refer to the specified target devices definition for streams.

Sounds right. One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack changes, but the __dlpack__ changes are more general. Is that on purpose (if so, I don't mind)?

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is starting to look close to ready - a few more comments.

src/array_api_stubs/_draft/array_object.py Outdated Show resolved Hide resolved
src/array_api_stubs/_draft/array_object.py Outdated Show resolved Hide resolved
src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved
src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved
src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved
src/array_api_stubs/_draft/creation_functions.py Outdated Show resolved Hide resolved
@leofang
Copy link
Contributor Author

leofang commented Feb 12, 2024

Is there a timeline issue here as well? from_dlpack should do nothing with the new copy and device keywords beyond forwarding them to the producer. So if the producer doesn't support that functionality yet, what happens?

In the later revision, this was clarified that a try-except should be used to check if copy and device keywords are supported.

One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack changes, but the __dlpack__ changes are more general. Is that on purpose (if so, I don't mind)?

Indeed, yes. I was thinking we want to limit user-visible effects. If array library maintainers think they can get creative in __dlpack__ in a way that wouldn't interfere with future evolutions of the array API standard, they can do so.

@leofang
Copy link
Contributor Author

leofang commented Feb 12, 2024

I guess it'll stay in draft until there is movement on dmlc/dlpack#136 (since this PR explicitly mentions setting a to-be-added C level flag)?

That's the part I am not sure... The contents in both the dlpack PR and this PR are cross referenced and intertwined. I am fine to go either way, once @tqchen has time to take a look and offer us feedbacks. I can also break the dlpack PR into two (one is to only add the new flag, another is updating the C/Python docs) if it helps, but I wouldn't do so unless Tianqi agrees that it would help.

I want the verbiage used in the spec to reflect the essential requirement mandated by the standard to make data available to Python interpreter process.

Applied in commit 338742a.

@leofang leofang requested a review from asmeurer February 12, 2024 01:31
@leofang
Copy link
Contributor Author

leofang commented Feb 12, 2024

(Requested Aaron's review from the array-api-compat perspective.)

@tqchen
Copy link

tqchen commented Feb 12, 2024

Thanks for looping me in. Personally, I find there are two goals in here which are slightly intermingled

  • G0: The ability to do zero copy exchange
  • G1: The ability to cross device copy

My main thought is that we should keep API support for G0 to be simple and ensure from_dlpack to come with clear intent as in zero copy.

Because G1 might indeed opens up significant implementation overhead, as well as different possible ways of synchronization that I think would merit a different API(to_device?).

My main question is how to phase out the API, so most libraries can start with G0. But happy to hear about others thoughts as well

@seberg
Copy link
Contributor

seberg commented Feb 12, 2024

My main question is how to phase out the API, so most libraries can start with G0

Nobody is required to implement the cross device API, we could bump minor API (so consumers can know whether or not the flag should be set). Flagging IS_COPY (and a code change) is thus only needed if the library is atypical and doesn't use zero-copy exchange.

For the Array-API, this is a bit different: because I think there the intention is to additionally require dl_device="cpu" to work. A maybe slightly overpowered way to allow comparing two libraries in tests (e.g. by converting both to the same CPU library, like NumPy).

@rgommers
Copy link
Member

Personally, I find there are two goals in here which are slightly intermingled

  • G0: The ability to do zero copy exchange
  • G1: The ability to cross device copy

That is the right framing I think indeed.

I think would merit a different API(to_device?).

I think a separate API would be more problematic to do, because:

  1. If two libraries don't support the same devices, anything like to_device will not work. Asking a library like CuPy to return a DLPack with host memory for under-the-hood usage is one thing; asking it to do so in a public/user-facing API looks like a nonstarter.
  2. DLPack is the only candidate for implementing this under the hood; the device support is needed and related issues like streams have been worked out. API-wise a new function not called from_dlpack but still doing everything through DLPack would lead to a strange asymmetry.

so most libraries can start with G0

I think one should always start there if that's an option. from_dlpack(y, device=x.device) will be zero-copy if the data is already on the same device. If we really wanted to ensure always-zero-copy by default, then the copy keyword could default to False rather than None; that way from_dlpack(y, device=x.device) would never be useful and authors would have to write from_dlpack(y, copy=None, device=x.device) (or copy=True).

Does it really matter though if this is using from_dlpack with the device keyword or some other function like xp.from_dlpack_with_devicetransfer? The former seems preferable, and I don't think it detracts from the "prefer zero-copy" nature of DLPack.

@tqchen
Copy link

tqchen commented Feb 13, 2024

Thanks @rgommers for the comments. I think as long as packages can phase their implementation status this would be fine.

@leofang leofang marked this pull request as ready for review February 13, 2024 14:49
@leofang
Copy link
Contributor Author

leofang commented Feb 13, 2024

Thanks @tqchen @rgommers @seberg for the comments. Indeed, G0 (zero copy) has been the goal and we've achieved that since the first version of the standard. What's new with this PR is

  1. we also enable G1 (cross-device copy) using DLPack (for which Ralf gave the rationale from to_device() -- any way to force back to host "portably?" #626; tl;dr is DLPack offers all the info we need to facilitate this, so no reason not use it), and
  2. use the kw arg copy to enforce G0 if needed (setting copy=False will ensure we never copy, a guarantee we couldn't have before).

@leofang
Copy link
Contributor Author

leofang commented Feb 13, 2024

Q: Does anyone know why the doc build has a warning?

/home/circleci/repo/src/array_api_stubs/_draft/array_object.py:docstring of array_api_stubs._draft.array_object._array.__dlpack__:1: WARNING: py:class reference target not found: Enum

I thought Enum has been used in the return value type before and it worked just fine?

@leofang leofang added this to the v2023 milestone Feb 13, 2024
@rgommers
Copy link
Member

I thought Enum has been used in the return value type before and it worked just fine?

Those warnings are a real pain. I pushed a fix. For why it was needed here and not in __dlpack_device__ right below it, I have no explanation beyond "autodoc is buggy"

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me now. It looks like all comments were addressed; I plan to merge this within the next day or two unless there are more comments.

@leofang
Copy link
Contributor Author

leofang commented Feb 13, 2024

Reminder to all: Remember to review dmlc/dlpack#136 too, which goes hand-in-hand with this PR.

@leofang
Copy link
Contributor Author

leofang commented Feb 14, 2024

The accompanying DLPack PR was merged, and DLPack 1.0rc was released. Let's merge this. Thanks everyone for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. topic: DLPack DLPack.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_device() -- any way to force back to host "portably?"
5 participants