GH-33923: [Docs] Tensor canonical extension type specification #33925

AlenkaF · 2023-01-30T08:47:11Z

Rationale for this change

There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. This PR is a specification proposal to add tensors as a canonical type extensions and is meant to be sent to the Mailing list for discussion and vote.

What changes are included in this PR?

Specification for canonical extension type for fixed sized tensors.

Open question

Should metadata include the "dim_names" key to (optionally) specify dimension names when creating the Arrow FixedShapeTensorArray?

Closes: [Docs] Tensor canonical extension type specification #33923

github-actions · 2023-01-30T09:20:26Z

Closes: [Docs] Tensor canonical extension type specification #33923

github-actions · 2023-01-30T09:20:29Z

⚠️ GitHub issue #33923 has been automatically assigned in GitHub to PR creator.

docs/source/format/CanonicalExtensions.rst

rok

Thanks for doing this @AlenkaF !
I suggested some changes.
Also: a question I hit on the C++ implementation last week:
what happens when one wants to work with two fixed_size_tensor extensions (let's say they have different shapes) at once? I think this will be a common occurrence and because of the namespace collision two extensions can't be registered at once.

docs/source/format/CanonicalExtensions.rst

Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]>

AlenkaF · 2023-01-30T11:00:25Z

Also: a question I hit on the C++ implementation last week:
what happens when one wants to work with two fixed_size_tensor extensions (let's say they have different shapes) at once? I think this will be a common occurrence and because of the namespace collision two extensions can't be registered at once.

When working on pyarrow implementation locally, this wasn't an issue.
From the Python documentation:

Registration needs an extension type instance, but then works for any instance of the same subclass regardless of parametrization of the type.

And you can see this case in the tests I uploaded to gist:
https://gist.github.com/AlenkaF/95fb41f461fb792396bb20dd502b4112#file-02_tensor_extension_tests-py-L77-L96

I guess it is the same in C++?

rok · 2023-01-30T11:18:35Z

Only one extension type with a given name can be registered at a time. Multiple can coexist, but register can only keep one extension per name. I suppose users won't really want to register these types (or is that a requirement for using compute?), but will want an easy way to instantiate them.

Alternatively we could store shape in the extension name, e.g. arrow.fixed_size_tensor<int64,(5,2),row_major>, but I'm not sure that's a good idea.

jorisvandenbossche · 2023-01-30T12:36:31Z

Only one extension type with a given name can be registered at a time. Multiple can coexist, but register can only keep one extension per name.

It's only the name that is being registered, along with the methods to serialize / deserialize. The actual metadata (i.e. the only thing that differs between two different extension type instances of the same type) isn't part of the type class that is being registered, and so for a single registered (parametrized) type, you can have many instance alive at the same time with a different parametrization (i.e. different metadata).
That's the goal of having parametrized types (and so you don't want to have those parameters (metadata) in the name of the extension type. Otherwise you would have to register every possible parametrization).

docs/source/format/CanonicalExtensions.rst

Co-authored-by: Rok Mihevc <[email protected]>

AlenkaF · 2023-01-30T17:55:02Z

Thank you for reviewing Rok!

If Joris agrees this can be sent to the ML when the C++ implementation is ready. If I remember correctly, I would start a new discussion thread for the tensor canonical extension specification adding some explanation, link to the C++ implementation and pyarrow example. Please correct me if I am wrong.

I am working on PyArrow extension example to have it ready for illustration.

docs/source/format/CanonicalExtensions.rst

pitrou · 2023-01-31T11:17:08Z

Hey @lhoestq, you (and/or other people at HuggingFace) might be interested in reviewing this proposed addition. It will also be discussed on the Arrow dev ML.

docs/source/format/CanonicalExtensions.rst

mariosasko · 2023-01-31T14:30:09Z

Hi! (I'm one of the maintainers of HF Datasets)

The spec looks good to me. I also slightly prefer is_row_major over NumPy's order.

In Datasets, we store fixed-size tensors as variable-length lists, which is suboptimal as it also stores the offsets, so having this natively supported by Arrow would save us the hassle of implementing/maintaining a new extension type.

jorisvandenbossche · 2023-01-31T16:38:50Z

@mariosasko Thanks for the feedback!

In Datasets, we store fixed-size tensors as variable-length lists, which is suboptimal as it also stores the offsets

I noticed that in the HuggingFace Datasets implementation, there is a comment about using variable instead of fixed size list array. That is resolved now, and it's fine to use fixed-size lists?

Related to the is_row_major / order parameter, we were having some discussion in zulip chat (this public stream). Summarizing that here: I wonder to what extent the is_row_major / order keyword is parameter actually useful.

I think none of the existing custom tensor extension type implementations have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do you know if that has come up in HuggingFace?)
The first dimension always needs to be the one that matches the length of the array and have the biggest stride (to match our list array layout). So even if you would store each individual tensor as column-major (F contiguous), the full ndarray representing the full TensorArray wouldn't be F-contiguous.
Not super familiar here, but I think some of the ML frameworks like tensorflow also only deal with row-major data anyway. @rok mentioned an example of a different "channel last" layout in pytorch (https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html). But that specific example also isn't exactly column-major for each individual tensor (after the first dimension), but something custom (strides of (1, 96, 3) for tensor shape of (3, 32, 32) in their example, so neither one of (1024, 32, 1) or (1, 3, 96)). So to support this use case, we would actually need a strides parameter, and not a order/is_row_major parameter. EDIT: that's just a view (with different strides) on the original row-major data, to give a view with consistent dimension order (channel-first), if the actual data is stored with a different dimension order (channel-last). So the useful information for an application to pass along would be the names of the dimensions (the data itself is still always row major).

mariosasko · 2023-01-31T22:57:39Z

@jorisvandenbossche

I noticed that in the HuggingFace Datasets implementation, there is a comment about using variable instead of fixed size list array. That is resolved now, and it's fine to use fixed-size lists?

Yes, the fixed-sized version seems to work now (requires pyarrow>=10.0.0 to pass the tests).

I think none of the existing custom tensor extension type implementations have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do you know if that has come up in HuggingFace?)

No, that has yet to be requested on our side (so not super important for us right now).

docs/source/format/CanonicalExtensions.rst

Co-authored-by: Joris Van den Bossche <[email protected]>

docs/source/format/CanonicalExtensions.rst

Co-authored-by: Joris Van den Bossche <[email protected]>

extabgrad · 2023-03-02T10:24:09Z

MATLAB’s Deep Learning Toolbox uses n-dimensional arrays which fit quite well with the proposal. It also has a special datatype called “dlarray” which is responsible for automatic differentiation. This datatype allows labelled dimensions, however, the labels are quite compatible with a permutation. We group dimension labels into categories – spatial/continuous, channel, batch, temporal, unspecified, and potential extensions – where the spatial and unspecified labels (and perhaps others in future) are allowed to be used for multiple dimensions. We find that ‘H’ and ‘W’ are very image-centric and do not extend arbitrarily, while ‘U’ might be used for additional dimensions needed for a variety of purposes in different contexts, such as filter groups, point cloud index and so forth. Consequently we need to keep track of permutations within dimensions with the same name.

Therefore our main input would be to request that permutation and dim_names are not mutually exclusive.

We would also like to ensure that the format will support complex data.

Questions:

Could you clarify whether shape, dim_names and permutation are listed in left-to-right or right-to-left order. As in, is the contiguous dimension the right-most (C, python) or left-most (MATLAB, Fortran) dimension?
Are 1-dimensional and 0-dimensional arrays allowed?
“fixed_shape_array” or variations on multi-dimensional or nd array might be a more appropriate name given the mathematical implications of the term ‘tensor’.

Joss Knight
Development Manager
GPU and Deep Learning Team
MathWorks UK

rok · 2023-03-02T11:10:02Z

Thanks you for the input and description of MATLAB's Deep Learning Toolbox @extabgrad !

Therefore our main input would be to request that permutation and dim_names are not mutually exclusive.

As per discussion on the mailing list and this proposal I believe permutation and dim_names will not be mutually exclusive.

We would also like to ensure that the format will support complex data.

Just to be clear: by complex you mean diverse not complex as in complex numbers?

Questions:

Could you clarify whether shape, dim_names and permutation are listed in left-to-right or right-to-left order. As in, is the contiguous dimension the right-most (C, python) or left-most (MATLAB, Fortran) dimension?

shape, dim_names and permutation would map to a row-major (C, python) physical layout of data. Data in the underlying buffer would have row-major layout.

Are 1-dimensional and 0-dimensional arrays allowed?

This wasn't discussed, but the current language doesn't forbid them, so I suppose they are allowed. Do you think we should explicitly allow them?

“fixed_shape_array” or variations on multi-dimensional or nd array might be a more appropriate name given the mathematical implications of the term ‘tensor’.

That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum.

jorisvandenbossche · 2023-03-02T11:36:36Z

We would also like to ensure that the format will support complex data.

Just to be clear: by complex you mean diverse not complex as in complex numbers?

And if you mean complex numbers: the value_type in this proposal can be any arrow data type. At the moment, there is no "complex" data type defined in the Arrow format spec, though. But there is a proposal to add a canonical extension type for complex data (#10565), and assuming that this would be included in Arrow, then you can specify a tensor array that uses complex numbers as the individual tensor elements.

3. “fixed_shape_array” or variations on multi-dimensional or nd array might be a more appropriate name given the mathematical implications of the term ‘tensor’.

That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum.

I don't have an opinion on the whole "array vs tensor" debate, but one aspect in favor of "Tensor" is that it avoids the confusion about whether "FixedShapeArray" is actually an array or type class, and the duplication in the name of "FixedShapeArrayArray".

extabgrad · 2023-03-02T13:00:51Z

Just to be clear: by complex you mean diverse not complex as in complex numbers?

No, I mean complex numbers, which are increasingly used in AI workflows.

shape, dim_names and permutation would map to a row-major (C, python) physical layout of data. Data in the underlying buffer would have row-major layout.

I find 'row-major' to be ambiguous. You can have a row major layout and still list dimensions left-to-right [row column page batch].

Are 1-dimensional and 0-dimensional arrays allowed?

This wasn't discussed, but the current language doesn't forbid them, so I suppose they are allowed. Do you think we should explicitly allow them?

I don't know, I just know they are important in numpy.

That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum.

Are you confident it is clear this datatype is only to be used in machine learning contexts? Perhaps it should be called 'machine-learning-tensor' then?

rok · 2023-03-02T13:49:55Z

No, I mean complex numbers, which are increasingly used in AI workflows.

Interesting! As Joris states there is an independent effort to enable that.

shape, dim_names and permutation would map to a row-major (C, python) physical layout of data. Data in the underlying buffer would have row-major layout.

I find 'row-major' to be ambiguous. You can have a row major layout and still list dimensions left-to-right [row column page batch].

This proposal currently states: Elements in a fixed shape tensor extension array are stored in row-major/C-contiguous order.. We can amend that to be more general. Could you state what left-to-right means? I assume it's equal to TensorFlow's minor-to-major.

Are you confident it is clear this datatype is only to be used in machine learning contexts? Perhaps it should be called 'machine-learning-tensor' then?

I hope it'll be used in a wider context. FixedShapeNDArray?

extabgrad · 2023-03-02T15:08:51Z

This proposal currently states: Elements in a fixed shape tensor extension array are stored in row-major/C-contiguous order.. We can amend that to be more general. Could you state what left-to-right means? I assume it's equal to TensorFlow's minor-to-major.

I can see it's confusing, since we are distinguishing between the order dimensions are indexed and the order they are stored. Left-to-right means new higher dimensions are added on the right, and right-to-left means they are added on the left.

The minor_to_major field definition states that the left-most entry refers to the contiguous dimension in memory. The order of indexing is then defined by the numbers in this field, so [0 1 2] means that X(i,j,k) is indexing position i+(Mj)+(MN*k). In typical parlance where i represents the row index, this is therefore column-major layout. A minor_to_major of [1 0 2] would therefore be a strict row-major layout so that j can represent the column index. But of course by row-major people typically mean [2 1 0], which means actually k is the column index, since the left-most dimension will be the highest (pages).

If MATLAB data were stored in row-major format but still indexed left-to-right then i would become the column index and no permutation would be needed to translate between MATLAB data and other row-major formats. However, since everyone uses the row,column indexing convention and MATLAB indexes left-to-right, the data is therefore stored in column-major. If 'permutation' is an inherently left-to-right spec then it is [1 0 2] for MATLAB (swap the 'first' two dimensions in memory). But if it's a right-to-left spec then it would be [0 2 1] (swap the 'last' two dimensions in memory).

Hence the ambiguity - so I thought it best to check!

I note that permutation is effectively the inverse of minor_to_major. The former is the memory order relative to the indexing order and the latter is the other way around.

docs/source/format/CanonicalExtensions.rst

Co-authored-by: David Li <[email protected]>

ursabot · 2023-03-15T23:24:25Z

Benchmark runs are scheduled for baseline = 3df5ba8 and contender = 8583076. 8583076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.61% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.66% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 85830764 ec2-t3-xlarge-us-east-2
[Failed] 85830764 test-mac-arm
[Finished] 85830764 ursa-i9-9960x
[Finished] 85830764 ursa-thinkcentre-m75q
[Finished] 3df5ba8b ec2-t3-xlarge-us-east-2
[Finished] 3df5ba8b test-mac-arm
[Finished] 3df5ba8b ursa-i9-9960x
[Finished] 3df5ba8b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…pache#33925) ### Rationale for this change There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. This PR is a specification proposal to add tensors as a canonical type extensions and is meant to be sent to the Mailing list for discussion and vote. ### What changes are included in this PR? Specification for canonical extension type for fixed sized tensors. **Open question** Should metadata include the `"dim_names"` key to (optionally) specify dimension names when creating the Arrow FixedShapeTensorArray? * Closes: apache#33923 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: David Li <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Alenka Frim <[email protected]>

Add Fixed size tensor spec to canonical extensions list

af571cb

This comment was marked as outdated.

Sign in to view

github-actions bot added the Component: Documentation label Jan 30, 2023

AlenkaF changed the title ~~Add Fixed size tensor spec to canonical extensions list~~ GH-33923: [Docs] Tensor canonical type extension specification Jan 30, 2023

jorisvandenbossche reviewed Jan 30, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

rok requested changes Jan 30, 2023

View reviewed changes

AlenkaF and others added 2 commits January 30, 2023 11:44

Apply suggestions from code review

8231150

Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]>

Remove implementation-specific metadata

884d871

Change order with is_row_major

83edd70

rok reviewed Jan 30, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

AlenkaF and others added 2 commits January 30, 2023 17:26

Update docs/source/format/CanonicalExtensions.rst

16ef6f1

Co-authored-by: Rok Mihevc <[email protected]>

Update metadata part

4f4ccce

rok approved these changes Jan 30, 2023

View reviewed changes

Correct True to true in json

92fd7c6

jorisvandenbossche changed the title ~~GH-33923: [Docs] Tensor canonical type extension specification~~ GH-33923: [Docs] Tensor canonical extension type specification Jan 31, 2023

jorisvandenbossche reviewed Jan 31, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

pitrou reviewed Jan 31, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

AlenkaF added 2 commits January 31, 2023 12:50

Change name from fixed_size_tensor to fixed_shape_tensor

7873676

Add description for ListType parameters

a4219e3

Change the description for ListType parameters

37e83db

jorisvandenbossche reviewed Feb 22, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

Update docs/source/format/CanonicalExtensions.rst

4ff7a65

Co-authored-by: Joris Van den Bossche <[email protected]>

This was referenced Feb 23, 2023

[Python] Support mask in FixedSizeListArray.from_arrays #34316

Closed

GeoArrow + Raster? geoarrow/geoarrow#24

Open

jorisvandenbossche reviewed Feb 28, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

Update docs/source/format/CanonicalExtensions.rst

1daf820

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche approved these changes Feb 28, 2023

View reviewed changes

kylebarron mentioned this pull request Mar 8, 2023

saving and reading vector data cubes xarray-contrib/xvec#26

Open

Add note about IPC tensor

70059d9

github-actions bot added the awaiting review Awaiting review label Mar 9, 2023

AlenkaF commented Mar 9, 2023

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 9, 2023

Update docs/source/format/CanonicalExtensions.rst

6f44296

Co-authored-by: David Li <[email protected]>

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 10, 2023

AlenkaF merged commit 8583076 into apache:main Mar 15, 2023

AlenkaF added this to the 12.0.0 milestone Mar 15, 2023

westonpace mentioned this pull request Apr 18, 2023

Different primitive types in different languages #35052

Closed

AlenkaF deleted the spec-canonical-extension-tensor branch June 5, 2023 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33923: [Docs] Tensor canonical extension type specification #33925

GH-33923: [Docs] Tensor canonical extension type specification #33925

AlenkaF commented Jan 30, 2023 •

edited by jorisvandenbossche

Loading

This comment was marked as outdated.

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 30, 2023

rok left a comment

AlenkaF commented Jan 30, 2023

rok commented Jan 30, 2023 •

edited

Loading

jorisvandenbossche commented Jan 30, 2023

AlenkaF commented Jan 30, 2023

pitrou commented Jan 31, 2023

mariosasko commented Jan 31, 2023

jorisvandenbossche commented Jan 31, 2023 •

edited

Loading

mariosasko commented Jan 31, 2023

extabgrad commented Mar 2, 2023

rok commented Mar 2, 2023

jorisvandenbossche commented Mar 2, 2023 •

edited

Loading

extabgrad commented Mar 2, 2023

rok commented Mar 2, 2023

extabgrad commented Mar 2, 2023 •

edited

Loading

ursabot commented Mar 15, 2023

GH-33923: [Docs] Tensor canonical extension type specification #33925

GH-33923: [Docs] Tensor canonical extension type specification #33925

Conversation

AlenkaF commented Jan 30, 2023 • edited by jorisvandenbossche Loading

Rationale for this change

What changes are included in this PR?

This comment was marked as outdated.

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 30, 2023

rok left a comment

Choose a reason for hiding this comment

AlenkaF commented Jan 30, 2023

rok commented Jan 30, 2023 • edited Loading

jorisvandenbossche commented Jan 30, 2023

AlenkaF commented Jan 30, 2023

pitrou commented Jan 31, 2023

mariosasko commented Jan 31, 2023

jorisvandenbossche commented Jan 31, 2023 • edited Loading

mariosasko commented Jan 31, 2023

extabgrad commented Mar 2, 2023

rok commented Mar 2, 2023

jorisvandenbossche commented Mar 2, 2023 • edited Loading

extabgrad commented Mar 2, 2023

rok commented Mar 2, 2023

extabgrad commented Mar 2, 2023 • edited Loading

ursabot commented Mar 15, 2023

AlenkaF commented Jan 30, 2023 •

edited by jorisvandenbossche

Loading

rok commented Jan 30, 2023 •

edited

Loading

jorisvandenbossche commented Jan 31, 2023 •

edited

Loading

jorisvandenbossche commented Mar 2, 2023 •

edited

Loading

extabgrad commented Mar 2, 2023 •

edited

Loading