Support for non-contiguous ("custom") strides #14

FirefoxMetzger · 2023-02-22T04:48:25Z

FirefoxMetzger
Feb 22, 2023

I was digging into the standard yesterday evening, and I have one question/topic that I can't seem to find an answer to:

Does the API support creating views with custom strides? Especially when those strides are non-contiguous?

The only point on views that I've found is that they might be used by conforming strided array libraries (upstream), and that upstream should be very mindful about their in-place behavior because it can lead to unintuitive behavior. At the same time, such views are also a main performance selling point, so I was hoping to find support for it in the array API.

The background is that I've written some linalg code recently to create custom homogeneous rotation matrices in a batched fashion. One of the steps was to fill the diagonal of each result matrix with ones. For single matrices in numpy, this can be done via np.filldiagonal(matrix, 1), but that doesn't support batches (an axes kwarg). To solve this efficiently, I used a view into the array with custom strides to create a batch of diagonal elements like so:

# note: I *knew* that the last 2 dimensions are always core and others loop
# I also *knew* that the matrix is always 4x4 (homogeneous matrix for 3D)
n_matrices = np.prod(matrix.shape[:-1], dtype=int)
itemsize = matrix.itemsize
view = as_strided(matrix, shape=(n_matrices, 4), strides=(16 * itemsize, 5 * itemsize))
view[:] = 1

As you can see, no dimension is exactly itemsize meaning that consecutive elements of the array don't touch each other. Ofc, this has implications on efficiency (it is particularly cache un-friendly), but that stems from the nature of the problem not the choice of implementation. Anyhow, could I do this with the Array API?

FirefoxMetzger · 2023-02-22T08:50:50Z

FirefoxMetzger
Feb 22, 2023
Author

I want to add two more use cases where custom strides are useful:

(1) Avoiding a copy during windowed operations. I know that one can often find specialist implementations for sliding window kernels that significantly reduce constant overhead, but windowed views are an amazing prototyping tool and once (if) there will be a C version for the array API this will become very relevant.

# note: qualitative example for unitary core dimensions
# (i.e. no padding, stride/step of 1, ...)
window_size = N
loop_dims = x.shape[:-1]
loop_strides = x.strides[:-1]
n_windows = x.shape[-1] - N + 1
itemsize = x.itemsize
view = as_strided(x, shape=(*loop_dims, n_windows, window_size), strides=(*loop_strides, itemsize, itemsize)

(2) Efficiently ingestion of decoded data. Compression algorithms often operate on windows and produce result chunks. Video compression is a great example because it's currently relevant for deep learning. It typically operates on 16x16 pixel macro-blocks and, as a result, frames are padded to have multiples of 16 as shape (width/height). The (raw) output of a video decoder is exactly this: a strided array with padding. With custom strides, we can directly present this result buffer to compliant array libraries (numpy, torch, ...) without the need to copy data.

# note: this is for a channel-first frame eg. YUV444
n_channels, n_rows, n_cols = (color, height, width)  # known result shape
itemsize = 8  # 8-bit per channel

# align plane and line strides to multiples of 16
# (aka. account for padding)
linesize = (n_cols + n_cols & 15) * itemsize
planesize = (n_rows + n_rows & 15) * linesize

# assume buffer implements python's buffer protocol
# (can the array API ingest python buffers?)
raw_array = np.from_buffer(buffer, dtype=np.uint8)
view = as_strided(buffer, shape=(n_channels, n_rows, n_cols), strides=(planesize, linesize, itemsize))

The result is very efficient. Planes and rows remain aligned to 64-byte cache lines and we have at most 120 bits (15 ints) of dead weight per row, which is minuscule compared to how long a row is for actual images these days (less than 1%). The dead weight also happens to
align rows on 128-bit multiples (register length for SSE) so we have an easier time writing vectorized kernels for row-wise operations. In short, we can expect performance very close to that of contiguous arrays while avoiding a full copy of the frame.

Maybe this is too specialist a use-case and should be implemented for a specific library only. I do, however, feel that this implies reinventing the wheel several times in different places, which we can try and avoid. (Especially in the case of video decoding, where every CV/tensor library I am aware of currently maintains its own version of FFMPEG bindings at varying degrees of maturity.)

3 replies

rgommers Feb 22, 2023
Maintainer

Hi @FirefoxMetzger, thanks for the question and details on your use case. The very short answer is "no, there is no concept of (strided) views, and that is on purpose". I think you're asking two related but separate questions, one about why there aren't strided views, and one about implementing some use cases which rely on as_strided in a performant way.

The issue with something like a "strided view" is that memory layout is now managed by the user at a very granular level. Different array libraries work differently here, some not supporting views or non-contiguous arrays at all. So clearly that is not a portable concept for a standard. For a similar question and responses, see data-apis/array-api#571.

At the same time, such views are also a main performance selling point, so I was hoping to find support for it in the array API.

I look at this slightly differently. Strides make sense conceptually at the end user level ("take every third element along this axis" is common and perfectly sensible). Views are an implementation detail that is needed for performance, but exposing that detail to end users was a mistake in the early NumPy design - the concept is not needed at the end user level; copy-on-write would give basically similar benefits with much clearer semantics.

Maybe this is too specialist a use-case and should be implemented for a specific library only. I do, however, feel that this implies reinventing the wheel several times in different places, which we can try and avoid.

There is something important here in your question though, that probably deserves a better answer. For as_strided and windowed operations, what comes to mind is:

Why JAX doesn't have as_strided: Equivalent of np.lib.stride_tricks.as_strided jax-ml/jax#3171, no api: numpy.lib.stride_tricks.as_strided jax-ml/jax#11354
NumPy has a sliding_window_view function since v1.20.0: https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.sliding_window_view.html.

sliding_window_view has the problem that it explicitly creates a view, and a large array. For what you actually want, that's only a temporary array though. So a function like

def moving_window(x, size, func):
    # func should be a reduction here
    return func(sliding_window_view(x, size))

moving_window(x, 3, xp.mean)

could express windowing operations quite well, without explicitly dealing with views or non-contiguous memory. This is something that's more commonly used with dataframes (see, e.g., https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html#window-generic).

(can the array API ingest python buffers?)

Yes, asarray supports the buffer protocol.

With custom strides, we can directly present this result buffer to compliant array libraries (numpy, torch, ...) without the need to copy data.

This is more tricky perhaps. It's about directly interpreting raw memory. Could you have a look at the discussion in data-apis/array-api#266? I think that is essentially the same.

FirefoxMetzger Feb 22, 2023
Author

I think you're asking two related but separate questions, one about why there aren't strided views, and one about implementing some use cases which rely on as_strided in a performant way.

You are right. It's essentially the same question (why no as_strided) from two angles. One asks about operating on arrays from conforming libraries in unconventional ways, and the other asks about having array-like data from elsewhere ingested by conforming libraries.

Different array libraries work differently here, some not supporting views or non-contiguous arrays at all. So clearly that is not a portable concept for a standard.

Hm could you elaborate why this is reason enough to avoid as_strided? In my mind, all array libraries implement logic/kernels on top of a contiguous 1D buffer over which they walk in some strided fashion. From there, you almost immediately arrive at custom striding by allowing these strides to be variable or by allowing buffer sharing, so even if a library doesn't implement custom strides today it seems like something that can be easily added or emulated (though perhaps not in a crazy performant way).

Also, don't we have a similar compatibility issue with array.to_device? Some libraries (like numpy) don't offer support for multiple devices, so I feel like there is more to "as_strided doesn't work" than it not being part of a minimal set of currently supported common functions.

I look at this slightly differently. [...] Views are an implementation detail [and] is not needed at the end user level; copy-on-write would give basically similar benefits with much clearer semantics.

Interesting. To me, views have always been one of the main selling points for numpy. Python - being interpreted - doesn't have an easy way to remove unnecessary buffer copies. Views introduce something akin to (very fancy) pointers allowing you to optimize this manually, and often doing just this makes your script fast enough to avoid having to introduce a compiler or custom C kernel.

Copy-on-write does do a lot of good things for sure, but I don't think it allows a full replacement of views. How would I go about writing to slices/views of an array? For example, how would I set the diagonal of a large batch of matrices? Alternatively, how would I go about preprocessing a large dataset where I have to apply different transformations to different examples? Either case is straightforward with views, but I don't think COW can help us here.

For as_strided and windowed operations, [...]

I'm not sure I follow. Are you suggesting to introduce a xp.sliding_window_view function?

Could you have a look at the discussion in data-apis/array-api#266?

Sure. I'll reply in that issue.

Having written the above, I just realized that we may have different users/consumers in mind when thinking about using the array API. "as_strided is an implementation detail" makes a lot of sense when thinking about end-users (eg. scientists or analysts) who use the array API to perform some analysis and now want to switch their "backend" from numpy to dask/cupy because their script takes to long on a single core. At the same time, there is also the engineer who uses the array API to implement performant backend-agnostic kernels and facilitate the aforementioned switching. For this second group as_strided (or some other mechanism to tweak striding) seems very much in scope because it's these kinds of "tricks" that give you a 10x+ runtime improvement.

Could it be that you are thinking of users in the first group, and I am thinking about users from the second group?

rgommers Feb 23, 2023
Maintainer

Hm could you elaborate why this is reason enough to avoid as_strided? In my mind, all array libraries implement logic/kernels on top of a contiguous 1D buffer over which they walk in some strided fashion. [...]

I think you have a point that it could be implemented in principle by every library. In the end, it's defining a new array with a given shape and values after all. If you leave out the explicit view / memory layout part though, it may be pretty niche and non-performant. I think the following illustrates why it's not a good fit for this standard:

NumPy leaves as_strided out of the main namespace, and opens with a warning that says "use with extreme care"
PyTorch does have it in its main namespace, but has a similar warning box, "Prefer using other view functions ..."
JAX doesn't have it, and jax#11354 says "we should consider adding an as_strided API, with clear warnings about the lack of operational semantics guarantees. It may be a lower priority item though".

So what I meant was that the semantics that you get from numpy, which do hinge on views and memory layout, are not portable.

Also, don't we have a similar compatibility issue with array.to_device?

Not the same I think - that either works, or gives a clear exception when the device is not supported on not present on the system.

Views introduce something akin to (very fancy) pointers allowing you to optimize this manually, and often doing just this makes your script fast enough to avoid having to introduce a compiler or custom C kernel.

That's a nice description. And yes it's kinda correct - but just like pointers in C/C++ a major source of confusion and bugs. So unhealthy once you learn of better ways (like use a JIT compiler for performance).

Copy-on-write does do a lot of good things for sure, but I don't think it allows a full replacement of views. How would I go about writing to slices/views of an array?

Not a 100% replacement indeed, but in a large majority of cases good enough. I investigated how much slice assignment inside loops (which is where you get the real performance hit) is used in the likes of SciPy and scikit-learn; the answer was "a lot less often than we expected". Array construction like assigning to the diagonal of a 2-D array is indeed more expensive with CoW. But it's not that common that it's so expensive that it would show up as performance-relevant for full programs.

Could it be that you are thinking of users in the first group, and I am thinking about users from the second group?

That's very well possible. However, I'd think that for the second group it's still on average more healthy to start with a clean implementation, and then to squeeze out more performance use JAX/PyTorch/Numba/etc., rather than resort to things like as_strided.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for non-contiguous ("custom") strides #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support for non-contiguous ("custom") strides #14

FirefoxMetzger Feb 22, 2023

Replies: 1 comment · 3 replies

FirefoxMetzger Feb 22, 2023 Author

rgommers Feb 22, 2023 Maintainer

FirefoxMetzger Feb 22, 2023 Author

rgommers Feb 23, 2023 Maintainer

FirefoxMetzger
Feb 22, 2023

Replies: 1 comment 3 replies

FirefoxMetzger
Feb 22, 2023
Author

rgommers Feb 22, 2023
Maintainer

FirefoxMetzger Feb 22, 2023
Author

rgommers Feb 23, 2023
Maintainer