Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33923: [Docs] Tensor canonical extension type specification #33925

Merged
merged 27 commits into from
Mar 15, 2023
Merged
Changes from 26 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
af571cb
Add Fixed size tensor spec to canonical extensions list
AlenkaF Jan 30, 2023
8231150
Apply suggestions from code review
AlenkaF Jan 30, 2023
884d871
Remove implementation-specific metadata
AlenkaF Jan 30, 2023
83edd70
Change order with is_row_major
AlenkaF Jan 30, 2023
16ef6f1
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Jan 30, 2023
4f4ccce
Update metadata part
AlenkaF Jan 30, 2023
92fd7c6
Correct True to true in json
AlenkaF Jan 31, 2023
7873676
Change name from fixed_size_tensor to fixed_shape_tensor
AlenkaF Jan 31, 2023
a4219e3
Add description for ListType parameters
AlenkaF Jan 31, 2023
37e83db
Change the description for ListType parameters
AlenkaF Feb 1, 2023
5c92ff0
Remove is_row_major from the spec
AlenkaF Feb 2, 2023
cb5e2dd
Add dim_names and permutation to optional metadata
AlenkaF Feb 15, 2023
b562b8d
Add notes to the usage of dim_names and permutations metadata
AlenkaF Feb 15, 2023
c44101b
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Feb 15, 2023
24e7c28
Add dim_names and permutation to optional parameters
AlenkaF Feb 15, 2023
333ae67
Add explicit explanation of permutation indices
AlenkaF Feb 15, 2023
4086dfb
Change order with layout
AlenkaF Feb 15, 2023
bd2a515
Rephrase text about absent permutation param
AlenkaF Feb 15, 2023
bc07d7a
Apply suggestions from code review - Joris
AlenkaF Feb 15, 2023
68c6244
Remove redundant sentence in permutations explanation
AlenkaF Feb 16, 2023
3e2bb25
Update value_type description
AlenkaF Feb 22, 2023
a49f14f
Update parameters description
AlenkaF Feb 22, 2023
89d8042
Add a logical layout shape example in the desc of the serialization
AlenkaF Feb 22, 2023
4ff7a65
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Feb 22, 2023
1daf820
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Feb 28, 2023
70059d9
Add note about IPC tensor
AlenkaF Mar 9, 2023
6f44296
Update docs/source/format/CanonicalExtensions.rst
AlenkaF Mar 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 73 additions & 1 deletion docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,76 @@ same rules as laid out above, and provide backwards compatibility guarantees.
Official List
=============

No canonical extension types have been standardized yet.
Fixed shape tensor
==================

* Extension name: `arrow.fixed_shape_tensor`.

* The storage type of the extension: ``FixedSizeList`` where:
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

* **value_type** is the data type of individual tensor elements.
* **list_size** is the product of all the elements in tensor shape.

* Extension type parameters:

* **value_type** = the Arrow data type of individual tensor elements.
* **shape** = the physical shape of the contained tensors
as an array.

Optional parameters describing the logical layout:

* **dim_names** = explicit names to tensor dimensions
as an array. The length of it should be equal to the shape
length and equal to the number of dimensions.

``dim_names`` can be used if the dimensions have well-known
names and they map to the physical layout (row-major).

* **permutation** = indices of the desired ordering of the
original dimensions, defined as an array.

The indices contain a permutation of the values [0, 1, .., N-1] where
N is the number of dimensions. The permutation indicates which
dimension of the logical layout corresponds to which dimension of the
physical tensor (the i-th dimension of the logical view corresponds
to the dimension with number ``permutations[i]`` of the physical tensor).

Permutation can be useful in case the logical order of
the tensor is a permutation of the physical order (row-major).
paleolimbot marked this conversation as resolved.
Show resolved Hide resolved

When logical and physical layout are equal, the permutation will always
be ([0, 1, .., N-1]) and can therefore be left out.

* Description of the serialization:

The metadata must be a valid JSON object including shape of
the contained tensors as an array with key **"shape"** plus optional
dimension names with keys **"dim_names"** and ordering of the
dimensions with key **"permutation"**.

- Example: ``{ "shape": [2, 5]}``
- Example with ``dim_names`` metadata for NCHW ordered data:

``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``

- Example of permuted 3-dimensional tensor:

``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``

This is the physical layout shape and the the shape of the logical
layout would in this case be ``[500, 100, 200]``.

.. note::

Elements in a fixed shape tensor extension array are stored
in row-major/C-contiguous order.

.. note::

Other Data Structures in Arrow include a
`Tensor (Multi-dimensional Array) <https://arrow.apache.org/docs/format/Other.html>`_
to be used as a message in the interprocess communication machinery (IPC).

This structure has no relationship with the Fixed shape tensor extension type defined
by this specification. With defining an extension type one can use fixed shape tensors
as elements in a field of a RecordBatch or a Table.
AlenkaF marked this conversation as resolved.
Show resolved Hide resolved