Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding .view(...)/reinterpret dtype method #266

Closed
jakirkham opened this issue Sep 16, 2021 · 10 comments
Closed

Adding .view(...)/reinterpret dtype method #266

jakirkham opened this issue Sep 16, 2021 · 10 comments
Labels
API extension Adds new functions or objects to the API.

Comments

@jakirkham
Copy link
Member

In NumPy (and some other libraries) arrays have a method to view the data as another dtype. This is different from astype as this taking data that may not be typed like bytes or bytearray and applying different dtype metadata on top of it. As an example reinterpreting the data in this way can be useful particularly in distributed setting where the data goes through serialization/deserialization steps where metadata is extracted, sent along, and then reapply to the data. Though this can come up in other situations as well.

cc @rgommers @kgryte (since we discussed this briefly earlier)

@rgommers
Copy link
Member

As an example reinterpreting the data in this way can be useful particularly in distributed setting where the data goes through serialization/deserialization steps where metadata is extracted, sent along, and then reapply to the data.

This is different from numpy.ndarray.view right? In the latter case, there is already an array instance which must have a well-defined dtype, it's not just a block of memory. This particular example sounds closest to frombuffer. I'm left wondering a little why the deserialization doesn't use the correct metadata immediately though - can you point to a concrete example?

@jakirkham
Copy link
Member Author

One could do np.asarray(memoryview(buf)).view(fmt) for example. Though yes there are similarities to np.frombuffer

Because the memory is allocated to receive the message before any of that information arrives (it needs to be written somewhere in memory). Only after the metadata and data are stored, can they go through the deserialization process

@rgommers
Copy link
Member

One could do np.asarray(memoryview(buf)).view(fmt) for example

Equivalent to np.asarray(memoryview(buf), dtype=fmt)?

I think I understand the use case, but there's no way to get an array that's untyped in the API, so the "reinterpret memory" use case seems quite niche. And I expect that there will be libraries that don't allow this kind of thing, because memory layout is an implementation detail not exposed to the user. So I'm leaning towards "out of scope" here.

It seems like serialization falls under I/O, which is out of scope completely.

@jakirkham
Copy link
Member Author

One could do np.asarray(memoryview(buf)).view(fmt) for example

Equivalent to np.asarray(memoryview(buf), dtype=fmt)?

Not if dtype=... means .astype(...). I think this gets back into our discussion earlier.

Maybe a short example helps? Imagine b is received over the wire along with relevant metadata. The data is three float32 numbers (IOW Out[3] is what we want).

In [1]: import numpy as np

In [2]: b = b"\x00\x00\x00\x00\x00\x00\x80?\x00\x00\x00@"

In [3]: np.asarray(memoryview(b)).view(np.float32)
Out[3]: array([0., 1., 2.], dtype=float32)

In [4]: np.asarray(memoryview(b), dtype=np.float32)
Out[4]: 
array([  0.,   0.,   0.,   0.,   0.,   0., 128.,  63.,   0.,   0.,   0.,   64.], dtype=float32)

I think I understand the use case, but there's no way to get an array that's untyped in the API, so the "reinterpret memory" use case seems quite niche.

In our usual case it is not so much that the data is untyped, but the type doesn't necessarily match what it should. Taking the example above, we have...

In [6]: np.asarray(memoryview(b)).dtype
Out[6]: dtype('uint8')

IOW we often have something that is uint8 or int8.

And I expect that there will be libraries that don't allow this kind of thing, because memory layout is an implementation detail not exposed to the user. So I'm leaning towards "out of scope" here.

For clarity, am not looking to manipulate the underlying memory in any way and don't really care how it is represented. Am just trying to patch on the correct formatting. Another way to think of this would be altering the dtype DLPack might use. Suppose one could hack around with the DLPack representation before it goes through the protocol, but that feels a bit clumsy.

It seems like serialization falls under I/O, which is out of scope completely.

It is certainly useful in I/O contexts (communication, file I/O, etc.). Though am not really looking for the protocol to handle the I/O portion or even serialization. Just the ability to perform this cast.

@rgommers
Copy link
Member

Thanks, that is helpful. The "it has the wrong dtype" has come up in at least one other place I think, using DLPack to transfer bool arrays - those weren't supported, so it was done as uint8.

I think the next step here is figure out how other array libraries do this (if they allow it).

@kgryte
Copy link
Contributor

kgryte commented Sep 20, 2021

Another use case for reinterpretation is ability to convert to and from the underlying byte representation of floating-point numbers.

This is common in the implementation of transcendental functions where you want to manipulate the underlying bits of a IEEE 754 floating-point number directly. Go, e.g., provides dedicated APIs for such reinterpretation (Float64bits and Float64frombits (albeit only operating on a single number)). JavaScript exposes an ArrayBuffer from which can instantiated typed array views allowing floating-point <=> bits reinterpretation.

The ability to reinterpret the underlying memory (i.e., have a data "view") can certainly be useful in certain classes of numerical algorithms and when you want to vectorize operations. The ability to reinterpret without needing to perform a copy would afford performance benefits.

Currently, the only way to achieve reinterpretation according to the specification is via either (1) manual iteration and data copy or (2) a combination of __dlpack__ and from_dlpack (see interchange), which may or may not involve data copy.

@kgryte kgryte added this to the v2022 milestone Oct 4, 2021
@kgryte kgryte added the API extension Adds new functions or objects to the API. label Oct 4, 2021
@jakirkham
Copy link
Member Author

cc @seberg (in case you have thoughts on this one :)

@seberg
Copy link
Contributor

seberg commented Oct 6, 2022

For the use-case of reading blobs from the buffer protocol, I prefer the frombuffer API. OTOH, I guess Dask cannot export buffers and it doesn't match well for a "reinterpret cast" of an existing array.
So there may be need for view as well (which is a bit more generic I guess?), although it seems less important.

@jakirkham
Copy link
Member Author

Think the main value of view is it allows reinterpreting an existing array and knowing the end array type will be the same (the dtype is ofc changed).

Whereas with frombuffer, asarray, etc., one needs to know the type of the array to call the right function. With a method, this confusion can be avoided.

@rgommers rgommers removed this from the v2022 milestone Dec 14, 2022
@kgryte
Copy link
Contributor

kgryte commented Jun 29, 2023

As this proposal is currently without a champion, I'll go ahead and close.

@kgryte kgryte closed this as completed Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API.
Projects
None yet
Development

No branches or pull requests

4 participants