Support for non-contiguous ("custom") strides #14
Replies: 1 comment 3 replies
-
I want to add two more use cases where custom strides are useful: (1) Avoiding a copy during windowed operations. I know that one can often find specialist implementations for sliding window kernels that significantly reduce constant overhead, but windowed views are an amazing prototyping tool and once (if) there will be a C version for the array API this will become very relevant. # note: qualitative example for unitary core dimensions
# (i.e. no padding, stride/step of 1, ...)
window_size = N
loop_dims = x.shape[:-1]
loop_strides = x.strides[:-1]
n_windows = x.shape[-1] - N + 1
itemsize = x.itemsize
view = as_strided(x, shape=(*loop_dims, n_windows, window_size), strides=(*loop_strides, itemsize, itemsize) (2) Efficiently ingestion of decoded data. Compression algorithms often operate on windows and produce result chunks. Video compression is a great example because it's currently relevant for deep learning. It typically operates on 16x16 pixel macro-blocks and, as a result, frames are padded to have multiples of 16 as shape (width/height). The (raw) output of a video decoder is exactly this: a strided array with padding. With custom strides, we can directly present this result buffer to compliant array libraries (numpy, torch, ...) without the need to copy data. # note: this is for a channel-first frame eg. YUV444
n_channels, n_rows, n_cols = (color, height, width) # known result shape
itemsize = 8 # 8-bit per channel
# align plane and line strides to multiples of 16
# (aka. account for padding)
linesize = (n_cols + n_cols & 15) * itemsize
planesize = (n_rows + n_rows & 15) * linesize
# assume buffer implements python's buffer protocol
# (can the array API ingest python buffers?)
raw_array = np.from_buffer(buffer, dtype=np.uint8)
view = as_strided(buffer, shape=(n_channels, n_rows, n_cols), strides=(planesize, linesize, itemsize)) The result is very efficient. Planes and rows remain aligned to 64-byte cache lines and we have at most 120 bits (15 ints) of dead weight per row, which is minuscule compared to how long a row is for actual images these days (less than 1%). The dead weight also happens to Maybe this is too specialist a use-case and should be implemented for a specific library only. I do, however, feel that this implies reinventing the wheel several times in different places, which we can try and avoid. (Especially in the case of video decoding, where every CV/tensor library I am aware of currently maintains its own version of FFMPEG bindings at varying degrees of maturity.) |
Beta Was this translation helpful? Give feedback.
-
I was digging into the standard yesterday evening, and I have one question/topic that I can't seem to find an answer to:
The only point on views that I've found is that they might be used by conforming strided array libraries (upstream), and that upstream should be very mindful about their in-place behavior because it can lead to unintuitive behavior. At the same time, such views are also a main performance selling point, so I was hoping to find support for it in the array API.
The background is that I've written some linalg code recently to create custom homogeneous rotation matrices in a batched fashion. One of the steps was to fill the diagonal of each result matrix with ones. For single matrices in numpy, this can be done via
np.filldiagonal(matrix, 1)
, but that doesn't support batches (anaxes
kwarg). To solve this efficiently, I used a view into the array with custom strides to create a batch of diagonal elements like so:As you can see, no dimension is exactly
itemsize
meaning that consecutive elements of the array don't touch each other. Ofc, this has implications on efficiency (it is particularly cache un-friendly), but that stems from the nature of the problem not the choice of implementation. Anyhow, could I do this with the Array API?Beta Was this translation helpful? Give feedback.
All reactions