-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33923: [Docs] Tensor canonical extension type specification #33925
GH-33923: [Docs] Tensor canonical extension type specification #33925
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @AlenkaF !
I suggested some changes.
Also: a question I hit on the C++ implementation last week:
what happens when one wants to work with two fixed_size_tensor
extensions (let's say they have different shapes) at once? I think this will be a common occurrence and because of the namespace collision two extensions can't be registered at once.
Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]>
When working on pyarrow implementation locally, this wasn't an issue.
And you can see this case in the tests I uploaded to gist: I guess it is the same in C++? |
Only one extension type with a given name can be registered at a time. Multiple can coexist, but register can only keep one extension per name. I suppose users won't really want to register these types (or is that a requirement for using compute?), but will want an easy way to instantiate them. Alternatively we could store shape in the extension name, e.g. |
It's only the name that is being registered, along with the methods to serialize / deserialize. The actual metadata (i.e. the only thing that differs between two different extension type instances of the same type) isn't part of the type class that is being registered, and so for a single registered (parametrized) type, you can have many instance alive at the same time with a different parametrization (i.e. different metadata). |
Co-authored-by: Rok Mihevc <[email protected]>
Thank you for reviewing Rok! If Joris agrees this can be sent to the ML when the C++ implementation is ready. If I remember correctly, I would start a new discussion thread for the tensor canonical extension specification adding some explanation, link to the C++ implementation and pyarrow example. Please correct me if I am wrong. I am working on PyArrow extension example to have it ready for illustration. |
Hey @lhoestq, you (and/or other people at HuggingFace) might be interested in reviewing this proposed addition. It will also be discussed on the Arrow dev ML. |
Hi! (I'm one of the maintainers of HF Datasets) The spec looks good to me. I also slightly prefer In Datasets, we store fixed-size tensors as variable-length lists, which is suboptimal as it also stores the offsets, so having this natively supported by Arrow would save us the hassle of implementing/maintaining a new extension type. |
@mariosasko Thanks for the feedback!
I noticed that in the HuggingFace Datasets implementation, there is a comment about using variable instead of fixed size list array. That is resolved now, and it's fine to use fixed-size lists? Related to the
|
Yes, the fixed-sized version seems to work now (requires
No, that has yet to be requested on our side (so not super important for us right now). |
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
MATLAB’s Deep Learning Toolbox uses n-dimensional arrays which fit quite well with the proposal. It also has a special datatype called “dlarray” which is responsible for automatic differentiation. This datatype allows labelled dimensions, however, the labels are quite compatible with a permutation. We group dimension labels into categories – spatial/continuous, channel, batch, temporal, unspecified, and potential extensions – where the spatial and unspecified labels (and perhaps others in future) are allowed to be used for multiple dimensions. We find that ‘H’ and ‘W’ are very image-centric and do not extend arbitrarily, while ‘U’ might be used for additional dimensions needed for a variety of purposes in different contexts, such as filter groups, point cloud index and so forth. Consequently we need to keep track of permutations within dimensions with the same name. Therefore our main input would be to request that permutation and dim_names are not mutually exclusive. We would also like to ensure that the format will support complex data. Questions:
Joss Knight |
Thanks you for the input and description of MATLAB's Deep Learning Toolbox @extabgrad !
As per discussion on the mailing list and this proposal I believe
Just to be clear: by complex you mean diverse not complex as in complex numbers?
This wasn't discussed, but the current language doesn't forbid them, so I suppose they are allowed. Do you think we should explicitly allow them?
That's an interesting point. Do you know of a source where this is discussed/argued? I'd be in favor of array-like naming too. The consideration here would be that deep learning frameworks are pushing tensor name and seem to have a lot of momentum. |
And if you mean complex numbers: the
I don't have an opinion on the whole "array vs tensor" debate, but one aspect in favor of "Tensor" is that it avoids the confusion about whether "FixedShapeArray" is actually an array or type class, and the duplication in the name of "FixedShapeArrayArray". |
No, I mean complex numbers, which are increasingly used in AI workflows.
I find 'row-major' to be ambiguous. You can have a row major layout and still list dimensions left-to-right [row column page batch].
I don't know, I just know they are important in numpy.
Are you confident it is clear this datatype is only to be used in machine learning contexts? Perhaps it should be called 'machine-learning-tensor' then? |
Interesting! As Joris states there is an independent effort to enable that.
This proposal currently states:
I hope it'll be used in a wider context. |
I can see it's confusing, since we are distinguishing between the order dimensions are indexed and the order they are stored. Left-to-right means new higher dimensions are added on the right, and right-to-left means they are added on the left. The minor_to_major field definition states that the left-most entry refers to the contiguous dimension in memory. The order of indexing is then defined by the numbers in this field, so [0 1 2] means that X(i,j,k) is indexing position i+(Mj)+(MN*k). In typical parlance where i represents the row index, this is therefore column-major layout. A minor_to_major of [1 0 2] would therefore be a strict row-major layout so that j can represent the column index. But of course by row-major people typically mean [2 1 0], which means actually k is the column index, since the left-most dimension will be the highest (pages). If MATLAB data were stored in row-major format but still indexed left-to-right then i would become the column index and no permutation would be needed to translate between MATLAB data and other row-major formats. However, since everyone uses the row,column indexing convention and MATLAB indexes left-to-right, the data is therefore stored in column-major. If 'permutation' is an inherently left-to-right spec then it is [1 0 2] for MATLAB (swap the 'first' two dimensions in memory). But if it's a right-to-left spec then it would be [0 2 1] (swap the 'last' two dimensions in memory). Hence the ambiguity - so I thought it best to check! I note that permutation is effectively the inverse of minor_to_major. The former is the memory order relative to the indexing order and the latter is the other way around. |
Co-authored-by: David Li <[email protected]>
Benchmark runs are scheduled for baseline = 3df5ba8 and contender = 8583076. 8583076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…pache#33925) ### Rationale for this change There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. This PR is a specification proposal to add tensors as a canonical type extensions and is meant to be sent to the Mailing list for discussion and vote. ### What changes are included in this PR? Specification for canonical extension type for fixed sized tensors. **Open question** Should metadata include the `"dim_names"` key to (optionally) specify dimension names when creating the Arrow FixedShapeTensorArray? * Closes: apache#33923 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: David Li <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Alenka Frim <[email protected]>
Rationale for this change
There have been quite a lot of discussions connected to the tensor support in Arrow Tables/RecorBatches. This PR is a specification proposal to add tensors as a canonical type extensions and is meant to be sent to the Mailing list for discussion and vote.
What changes are included in this PR?
Specification for canonical extension type for fixed sized tensors.
Open question
Should metadata include the
"dim_names"
key to (optionally) specify dimension names when creating the Arrow FixedShapeTensorArray?