-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add data type inspection utilities to the array API specification #425
Comments
Also related ( #152 ) Edit: If we look into type naming (briefly discussed), this discussion around typing naming in Zarr may be of interest ( zarr-developers/zarr-specs#131 ) |
Like |
Update: I've updated the OP as follows based on feedback here and in the last array API consortium meeting.
|
Some more prior art:
Given that dtype objects are immutable and have no state, this should also work for JAX et al. Not saying that that's my preference (I'm not yet sure), but this RFC proposes a lot of functions ... 3 underscores in a name like |
Whether methods or functions, surface area would be the same. The list can obviously be culled; however, I do think there is some advantage to matching the categories as used in the spec, especially for providing consistent APIs for input argument validation.
The number of underscores is not super important, IMO. Instead, we're probably concerned about number of characters. Originally, I left out the I don't have a strong opinion here; although, the current naming convention is arguably more literate. |
Silly question, why not do: if array.dtype in <set_of_dtypes>:
... and require implementations to provide some predefined sets, such as "set of all supported integer dtypes" or "set of all supported floating point dtypes"? |
That does seem more appealing indeed; it's what can already be done today and it reads fairly well. I think I prefer that over both the |
I don't like For
I do think neither of these is particularly problematic. But, I would say that this would not be a |
A minor pro of dtype sets is that it could be a way for a library to communicate what dtypes they support—thinking of PyTorch and it only supporting |
@leofang point out that this is blocking for adding
Other thoughts:
One alternative with a similar API surface is to add 3 functions with the same functionality instead. Those functions could be 3 of the ones in the issue description here (e.g.,
I think the con is more important than the pro here. But I'd say either choice is pretty reasonable here. |
Just to make sure @leofang, both flavors are fine for accelerators, right? When the spec says something should return a |
My preference would be to match more closely the spec on this. Namely, have the following objects:
This would mean 6 objects, which would, as it stands now, cover almost the entirety of the spec. As these are relatively trivial to implement and expose, I don't see this as imposing an undue burden on array libraries. However, if only E.g., suppose we want to validate an array for a function which supports all numeric dtypes. With just def foo( x: array ):
dt = x.dtype
if dt in integer_dtypes or dt in float_dtypes or dt in complex_dtypes:
... Given the opacity of what's intended in the conditional, one might be tempted to write a small helper function transforming the check to something more literate. And given the ubiquity of composite dtype categories in the spec, I'd argue we should just include the composite groups in the spec directly so that array library clients don't need to reimplement these groups from library to library. |
This is a good point. Although in general this isn't done for library code, even if the library provided string/object/etc. dtypes. It is difficult to pick the right sets here.
I don't think that will work, the names don't map to current practice and are not intuitive enoug. >>> x = np.ones(2, dtype=np.float64)
>>> x2 = np.ones(2, dtype=np.complex128)
>>> np.issubdtype(x.dtype, np.floating)
True
>>> np.issubdtype(x2.dtype, np.floating)
False |
Understood. We're not starting from a blank slate. Although, presumably, at least for Torch, the need for For NumPy, well, 🤷♂️. The notion of what is considered a "floating-point" dtype arose previously in the consortium. Then, it was decided that under the umbrella of floating-point are both real and complex. Hence, the OP. Unfortunately, however, I don't have, atm, a more intuitive name for "real + complex floating-point dtypes", but I don't think this negates the general desirability of composite groups. |
NumPy calls I have to think some more about the API choices we have here. I currently think that whatever definition we have for The last point might mean that we should have an API to allow users to spell Annoyingly, NumPy doesn't even have a good way to spell it. Basically, you would first use |
It should always be the case that the standard says "must contain (x, y, z)" - and either explicitly saying or implying that it's fine to also contain other things.
Yeah, I was just thinking in that direction as well. If it's hard to figure out what sets we may need, plus we need it to be easy to extend, plus we have naming issues with "float", then perhaps something like this which is concise and explicit: def has_dtype(x: Union[array, dtype], kind: Union[str, tuple[Union[str, dtype], ...]]) -> bool):
"""
Examples
----------
>>> has_dtype(x, 'integer')
>>> has_dtype(x, 'real')
>>> has_dtype(x, 'complex')
>>> has_dtype(x, ('real, 'complex')) # avoid both 'floating' and 'inexact', those are not good names
>>> has_dtype(x, 'numeric') # shorthand for ('integer', 'real', 'complex')
>>> has_dtype(x, 'signed integer')
>>> has_dtype(x, 'unsigned integer')
>>> has_dtype(x, (float64, complex128))
>>> has_dtype(x, int32) # supports dtype subclasses for libraries that support those, unlike `== int32'
""" A couple of thoughts for why this may be nice:
|
I will note that I am slightly unsure about the |
I think all the examples except the last (which IMHO seems like a different thing) can be done with sets: x in xp.integer_dtypes
x in xp.real_dtypes
x in xp.complex_dtypes
x in xp.real_dtypes | xp.complex_dtypes
x in xp.numeric_dtypes
x in xp.integer_dtypes - xp.unsigned_dtypes # If we only want to add unsigned as a special set
x in xp.integer_dtypes & xp.unsigned_dtypes # I don't think there are unsigned dtypes that are not integers but just to be sure
x in {xp.float64, xp.complex128}
x == xp.int32 or (isinstance(xp.int32, type) and isinstance(x, xp.int32)) |
That does look nicer syntactically, thanks @vnmabus. Rather than x.dtype in xp.integer_dtypes then I think - we can do the union of array and dtype in a function, but not if it's a set. Which is perfectly fine. The main thing that won't work I believe is dtype subclasses. >>> int32 = 'int32'
>>> np.int32 = 'int32' # example to simulate a library with string identifiers for dtypes
>>> isinstance(int32, np.int32)
Traceback (most recent call last):
Input In [11] in <cell line: 1>
isinstance(int32, np.int32)
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union |
I mentioned this in a call a few weeks back: the pandas is_foo_dtype functions accept both and that is a design choice we have largely come to regret. The performance overhead of checking for a .dtype attribute adds up quickly. |
Good point @jbrockmendel. I never noticed it in >>> import numpy as np
>>> real_dtypes = {np.float16, np.float32, np.float64, np.longdouble}
>>> %timeit np.float64 in real_dtypes
48.7 ns ± 0.211 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit np.issubdtype(np.float64, np.floating)
257 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
To circle back to this, this is true only for user-defined dtypes. If those exist (which may be unique to |
Just to be sure, custom dtypes (working for each backend) won't be ever added to the standard, right? It would be great to be able to implement the logic for a custom dtype once and have it working everywhere, but probably that would be difficult to standardize. I saw that for units, for example, you recommended to wrap the API backends instead in https://discuss.scientific-python.org/t/advice-and-guidance-about-array-api-for-a-units-package/. |
I think it's safe to say that custom dtype support won't be added. Most libraries don't have it, and for NumPy it's still a work-in-progress to define a good API with all the functionality downstream library authors may need. That said, it would be nice that standard-compliant code like |
Let me make a few points for why I am leaning against the set approach, although it is still not quite clear cut:
In the end, I am not certain yet that the set approach works well for NumPy proper. Of course that is not actually a blocker for this API since there can be differences. |
IIUC, I don't think it's likely pandas would change our current usage |
Thanks, @seberg, for the nice thoughts. Just wanna add a quick note.
This is very nice point. It seems
Also a very good point. Since we include the Python types in the type lattice, I think it is legitimate to do the said check even if we don't plan to support scalars. |
So it looks like we're (a) leaning towards the single-function version, and (b) only have it accept either a dtype or an array (avoiding the union of both). For (b), most of the time the thing to check is an array. However, dtype checking is also needed, and getting a dtype from an array is trivial while an array from a dtype is not. If the input was an array, So we'd be looking at some flavor of: def is_dtype(x: dtype, kind: Union[str, dtype, tuple[Union[str, dtype], ...]]) -> bool:
"""
>>> is_dtype(x, 'integer')
>>> is_dtype(x, 'real')
>>> is_dtype(x, 'complex')
>>> is_dtype(x, ('real, 'complex')) # avoid both 'floating' and 'inexact', those are not good names
>>> is_dtype(x, 'numeric') # shorthand for ('integer', 'real', 'complex')
>>> is_dtype(x, 'signed integer')
>>> is_dtype(x, 'unsigned integer')
>>> is_dtype(x, float32)
>>> is_dtype(x, (float64, complex128))
""" or def is_dtype(x: dtype, kind: str) -> bool:
"""
>>> is_dtype(x, 'integer')
>>> is_dtype(x, 'real')
>>> is_dtype(x, 'complex')
>>> is_dtype(x, 'numeric')
>>> is_dtype(x, 'signed integer')
>>> is_dtype(x, 'unsigned integer')
""" or something in between (e.g, Looking at the So perhaps this is the way to go: ? def is_dtype(x: dtype, kind: str | dtype) -> bool: |
We had another look at this yesterday. We want to go for a flavor of the function-based implementation here; there was no clear preference for which of the above was preferred. So let's try a vote - use emoji's on this comment:
|
I know I'm very late to this discussion, but as the array API is now implemented in NumPy, I've been exploring what it would do to my code. A couple problems with
I personally prefer seberg's proposal to use ordinary Python |
I think this isn't really true? At least, I can't think of many APIs where enums are common, while I can think of lots of libraries that use string args for keywords. Enums have a major design flaw - namespace cluttering. Imho that is far more important, also for ergonomics, than static type checking. The array API standard doesn't have many strings, but if NumPy had enums instead of strings or True/False/None keywords everywhere, that would be hundreds of extra objects.
I think we still have a more fundamental issue to solve: how to annotate arrays themselves. This should be done using a The same will apply to other objects. Giving that we have to be concerned about usage by consuming libraries and end users in an array-library-agnostic way, where it's effectively impossible for objects to have a class relationship, this is nontrivial to design. We haven't spent a whole lot of time on that aspect yet - and we should do that. The |
In fairness, Python didn't have enums until Python 3.4, and after that there has been talk about updating old APIs to use them.
I do love the Array API's compact namespace. I understand being very judicious about what gets into the namespace. I agree that if every method got its own enumerations, then the namespace might become overwhelming. Perhaps the ABCs (number, integer, inexact, signedinteger, unsignedinteger, floating, and complexfloating) could be tucked into
RIght, I'm not suggesting that.
Yeah, I'm looking forward to this!
I understand, but if they don't, then it is impossible (as far as I can tell) to maintain the type checking of array dtypes, which already works in numpy. Edit: I just realized, but isn't the def f(x: xp.Array): # x can be from any Array API library.
yp = array_api_of(x) # IIC, you have to get the right array API library to answer the question since x's dtype may not be known to numpy.array_api.
if yp.is_dtype(x.dtype, 'integer'): versus import numpy.array_api as xp
def f(x: xp.Array): # x can be from any Array API library.
if issubclass(x.dtype.type, xp.abc.integer): # All Array API implementers can inherit appropriately. |
Maybe, not sure .... I have to think about the typing aspect, it's nontrivial. The |
That's how you should use any function in the whole namespace (your |
This RFC proposes adding data type inspection utilities to the array API specification.
Overview
Currently, the array API specification requires that conforming implementations provide a specified set of data type objects (see https://data-apis.org/array-api/2021.12/API_specification/data_types.html) and casting functions (see https://data-apis.org/array-api/2021.12/API_specification/data_type_functions.html).
However, the specification does not include APIs for array data type inspection (e.g., an API for determining whether an array has a complex number data type or a floating-point data type, etc).
Prior Art
NumPy and its derivatives have
dtype
objects with extensive properties, including akind
property, which returns a character code indicating the general "kind" of data. For example, for relevant dtypes in the specification, NumPy uses the following character codes:b
: booleani
: signed integeru
: unsigned integerf
: floating-point (real-valued)c
: complex floating-pointThis availability of the
kind
property is useful when wanting to branch based on input array data types (e.g., applying summation algorithms).In PyTorch,
dtype
objects haveis_complex
andis_floating_point
properties for checking a data type "kind".Additionally, PyTorch offers functional APIs
is_complex
andis_floating_point
providing equivalent behavior.Proposal
Given the proposal for adding complex number support to the specification (see #373 and #418), a greater need arises for the specification to require conforming implementations to provide standardized ways for data type inspection.
For example, conforming implementations will need to branch in
abs(x)
depending on whetherx
is real-valued or complex-valued. Similarly, in downstream user code, we can expect that users will inevitably encounter situations where they need to branch based on input array data types (e.g., when choosing summation algorithms).As this specification has favored functional APIs, this RFC follows suit and proposes adding the following APIs to the specification:
has_complex_float_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has a complex number data type (e.g.,complex64
orcomplex128
).has_real_float_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has a (real-valued) floating-point number data type (e.g.,float32
orfloat64
).has_float_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has a complex or real-valued floating-point number data type (e.g.,float32
,float64
,complex64
, orcomplex128
).has_unsigned_int_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has an unsigned integer data type (e.g.,uint8
,uint16
,uint32
,uint64
).has_signed_int_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has a signed integer data type (e.g.,int8
,int16
,int32
,int64
).has_int_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has an integer (signed or unsigned) data type.has_real_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has a real-valued (integer or floating-point) data type.has_bool_dtype(x: Union[array, dtype]) -> bool
Returns a
bool
indicating whether an input array has a boolean data type.The above APIs cover the list of data types currently described in the specification, are sufficiently specific to cover most use cases, and can be composed to address most anticipated data type set combinations.
The text was updated successfully, but these errors were encountered: