Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add data type inspection utilities to the array API specification #425

Closed
kgryte opened this issue May 5, 2022 · 33 comments · Fixed by #503
Closed

RFC: add data type inspection utilities to the array API specification #425

kgryte opened this issue May 5, 2022 · 33 comments · Fixed by #503
Labels
API extension Adds new functions or objects to the API.
Milestone

Comments

@kgryte
Copy link
Contributor

kgryte commented May 5, 2022

This RFC proposes adding data type inspection utilities to the array API specification.

Overview

Currently, the array API specification requires that conforming implementations provide a specified set of data type objects (see https://data-apis.org/array-api/2021.12/API_specification/data_types.html) and casting functions (see https://data-apis.org/array-api/2021.12/API_specification/data_type_functions.html).

However, the specification does not include APIs for array data type inspection (e.g., an API for determining whether an array has a complex number data type or a floating-point data type, etc).

Prior Art

NumPy and its derivatives have dtype objects with extensive properties, including a kind property, which returns a character code indicating the general "kind" of data. For example, for relevant dtypes in the specification, NumPy uses the following character codes:

  • b: boolean
  • i: signed integer
  • u: unsigned integer
  • f: floating-point (real-valued)
  • c: complex floating-point
In [1]: np.zeros((3,4)).dtype.kind
Out[1]: 'f'

This availability of the kind property is useful when wanting to branch based on input array data types (e.g., applying summation algorithms).

if x.dtype.kind == 'f':
    # do one thing
else:
   # do another thing

In PyTorch, dtype objects have is_complex and is_floating_point properties for checking a data type "kind".

Additionally, PyTorch offers functional APIs is_complex and is_floating_point providing equivalent behavior.

Proposal

Given the proposal for adding complex number support to the specification (see #373 and #418), a greater need arises for the specification to require conforming implementations to provide standardized ways for data type inspection.

For example, conforming implementations will need to branch in abs(x) depending on whether x is real-valued or complex-valued. Similarly, in downstream user code, we can expect that users will inevitably encounter situations where they need to branch based on input array data types (e.g., when choosing summation algorithms).

As this specification has favored functional APIs, this RFC follows suit and proposes adding the following APIs to the specification:

has_complex_float_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a complex number data type (e.g., complex64 or complex128).

has_real_float_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a (real-valued) floating-point number data type (e.g., float32 or float64).

has_float_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a complex or real-valued floating-point number data type (e.g., float32, float64, complex64, or complex128).

has_unsigned_int_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has an unsigned integer data type (e.g., uint8, uint16, uint32, uint64).

has_signed_int_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a signed integer data type (e.g., int8, int16, int32, int64).

has_int_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has an integer (signed or unsigned) data type.

has_real_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a real-valued (integer or floating-point) data type.

has_bool_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a boolean data type.


The above APIs cover the list of data types currently described in the specification, are sufficiently specific to cover most use cases, and can be composed to address most anticipated data type set combinations.

@jakirkham
Copy link
Member

jakirkham commented May 5, 2022

Also related ( #152 )

Edit: If we look into type naming (briefly discussed), this discussion around typing naming in Zarr may be of interest ( zarr-developers/zarr-specs#131 )

@honno
Copy link
Member

honno commented May 18, 2022

Like can_cast() or result_type(), could these utils take both dtypes and arrays? I'd personally want these utils for dtype objects themselves, but definitely my own use cases are not quite aligned with most array consumers.

@kgryte
Copy link
Contributor Author

kgryte commented May 23, 2022

Update: I've updated the OP as follows based on feedback here and in the last array API consortium meeting.

  1. Functions now accept both arrays and dtypes.
  2. Function names include a _dtype suffix (as suggested during the consortium meeting)
  3. Function names begin with a has_ prefix. This helps avoid conflicts with existing APIs (e.g., PyTorch) and matches how one might describe an array (e.g., has shape X, has data type Y, etc).
  4. Included both real-valued and generic float APIs to match specification data type categories.
  5. Included a generic real dtype API to match specification data type categories.

@rgommers
Copy link
Member

Some more prior art:

As this specification has favored functional APIs

Given that dtype objects are immutable and have no state, this should also work for JAX et al. Not saying that that's my preference (I'm not yet sure), but this RFC proposes a lot of functions ...

3 underscores in a name like has_real_float_dtype is also not ideal.

@kgryte
Copy link
Contributor Author

kgryte commented May 23, 2022

Given that dtype objects are immutable and have no state, this should also work for JAX et al. Not saying that that's my preference (I'm not yet sure), but this RFC proposes a lot of functions ...

Whether methods or functions, surface area would be the same. The list can obviously be culled; however, I do think there is some advantage to matching the categories as used in the spec, especially for providing consistent APIs for input argument validation.

3 underscores in a name like has_real_float_dtype is also not ideal.

The number of underscores is not super important, IMO. Instead, we're probably concerned about number of characters. Originally, I left out the _dtype suffix, which would reduce the function name length; however, consortium members voiced desire for such a suffix in the array API meeting.

I don't have a strong opinion here; although, the current naming convention is arguably more literate.

@vnmabus
Copy link

vnmabus commented Jul 14, 2022

Silly question, why not do:

if array.dtype in <set_of_dtypes>:
    ...

and require implementations to provide some predefined sets, such as "set of all supported integer dtypes" or "set of all supported floating point dtypes"?

@rgommers
Copy link
Member

and require implementations to provide some predefined sets, such as "set of all supported integer dtypes" or "set of all supported floating point dtypes"?

That does seem more appealing indeed; it's what can already be done today and it reads fairly well. I think I prefer that over both the has_* functions in this proposal and the numpy issubdtype design.

@seberg
Copy link
Contributor

seberg commented Jul 18, 2022

I don't like issubdtype. For NumPy, I could imagine isinstance(arr.dtype, InexactDType) (or similar). So that way the API here would be isinstance(arr.dtype, some_object). The problem is that I am not sure if an isinstance API would work for everyone.

For arr.dtype in set_of_dtypes there are two three things to keep in mind:

  1. The set_of_dtypes will be different for each library, because bfloat16, float16, and others do not exist for implementers. Implementers can extend the API after all.
  2. For NumPy, users may extend the API reasonably soon. For example adding bfloat16 or a multi-precision float object.
  3. Sets might be tricky right now NumPy in either case (although that could likely be made sure to work). There are arbitrary number of possible instances for dtypes, although they should compare equal with a limited set, that set is confusingly large (byte-order matters).

I do think neither of these is particularly problematic. But, I would say that this would not be a set, but rather an opaque object that supports the in operator.

@honno
Copy link
Member

honno commented Jul 21, 2022

A minor pro of dtype sets is that it could be a way for a library to communicate what dtypes they support—thinking of PyTorch and it only supporting uint8 unsigned integers. Useful here and there, like telling Hypothesis to not try generating uint{16/32/64}.

@rgommers
Copy link
Member

rgommers commented Sep 6, 2022

@leofang point out that this is blocking for adding real and conj (and I imagine imag too), it'd be great to finalize this. The majority of folks who have weighed in seem to prefer a set/collection type of approach. So here's a suggested API for that, in line with @seberg's last sentence above.

  1. There must be objects integer_dtypes, floating_point_dtypes, and complex_dtypes,
  2. The syntax dtype in xxx_dtypes must yield a boolean value with the expected result (to be detailed out more in the spec),
  3. xxx_dtypes must contain all the expected dtypes that are part of the standard, and may contain additional dtypes of the same kind
  4. The objects may be of any kind, e.g. a set or a custom class instance.

Other thoughts:

  • No object for boolean is needed, because bool already supports __eq__, so array.dtype == bool is enough.
  • Also no separate signed/unsigned integer objects, because that's a bit much for the API / less needed. This is mostly a convenient way to spell array.dtype in (dtype1, dtype2, ...) anyway.
  • The one name where there's not a single obvious choice is floating_point_dtypes. It could also be float_dtypes, floating_dtypes, or real_dtypes for example.
  • Not specifying the type of these objects is on purpose, to make it easy to for example have an API that adds user-defined dtypes in.
    • That means that for static typing we need another Protocol. Not completely ideal, but imho better than restricting implementation choices for libraries (see point 3 in @seberg's comment above about why set is tricky for NumPy).

One alternative with a similar API surface is to add 3 functions with the same functionality instead. Those functions could be 3 of the ones in the issue description here (e.g., has_integer_dtype, has_floating_point_dtype, has_complex_dtype). Considerations:

  • Pro: it's better for static typing,
  • Con: it introduces an asymmetry between supported and unsupported sets - we need the dtype in xxx anyway when the predefined objects aren't the right ones.

I think the con is more important than the pro here. But I'd say either choice is pretty reasonable here.

@rgommers
Copy link
Member

rgommers commented Sep 6, 2022

Just to make sure @leofang, both flavors are fine for accelerators, right? When the spec says something should return a bool, that's not a problem - only Python control flow like if _expr_yielding_a_bool is. So a function is not preferred from that perspective. Or maybe there's a significant amount of extra implementation complexity for the dtype in xxx_dtypes version?

@kgryte
Copy link
Contributor Author

kgryte commented Sep 8, 2022

My preference would be to match more closely the spec on this. Namely, have the following objects:

  • numeric_dtypes: int8...64, uint8...64, float32/64, complex64/128
  • real_dtypes: int8...64, uint8...64, float32/64
  • float_dtypes: float32/64, complex64/128
  • real_float_dtypes: float32/64
  • complex_float_dtypes: complex64/128
  • integer_dtypes: int8...64, uint8...64

This would mean 6 objects, which would, as it stands now, cover almost the entirety of the spec. As these are relatively trivial to implement and expose, I don't see this as imposing an undue burden on array libraries.

However, if only integer, float, and complex, their repetition in order to generate composite groups matching the spec in userland and library implementations would be mildly annoying and would possibly just lead array libraries to implement the composite groups anyway.

E.g., suppose we want to validate an array for a function which supports all numeric dtypes. With just integer, float, and complex collections, I'd need to do

def foo( x: array ):
    dt = x.dtype
    if dt in integer_dtypes or dt in float_dtypes or dt in complex_dtypes:
        ...

Given the opacity of what's intended in the conditional, one might be tempted to write a small helper function transforming the check to something more literate. And given the ubiquity of composite dtype categories in the spec, I'd argue we should just include the composite groups in the spec directly so that array library clients don't need to reimplement these groups from library to library.

@rgommers
Copy link
Member

rgommers commented Sep 8, 2022

E.g., suppose we want to validate an array for a function which supports all numeric dtypes. With just integer, float, and complex collections, I'd need to do

This is a good point. Although in general this isn't done for library code, even if the library provided string/object/etc. dtypes. It is difficult to pick the right sets here.

My preference would be to match more closely the spec on this. Namely, have the following objects:

I don't think that will work, the names don't map to current practice and are not intuitive enoug. float_dtypes in particular is bad. See torch.is_floating_point and for numpy:

>>> x = np.ones(2, dtype=np.float64)
>>> x2 = np.ones(2, dtype=np.complex128)
>>> np.issubdtype(x.dtype, np.floating)
True
>>> np.issubdtype(x2.dtype, np.floating)
False

@kgryte
Copy link
Contributor Author

kgryte commented Sep 8, 2022

Understood. We're not starting from a blank slate. Although, presumably, at least for Torch, the need for is_complex and is_floating_point would no longer exist, opening up a path to eventual deprecation.

For NumPy, well, 🤷‍♂️.

The notion of what is considered a "floating-point" dtype arose previously in the consortium. Then, it was decided that under the umbrella of floating-point are both real and complex. Hence, the OP.

Unfortunately, however, I don't have, atm, a more intuitive name for "real + complex floating-point dtypes", but I don't think this negates the general desirability of composite groups.

@seberg
Copy link
Contributor

seberg commented Sep 8, 2022

NumPy calls real + complex floating point types inexact (not that the name is used often). The only problem with using floating point for both is that people may just not think about complex at all, and it may be a false friend in practice.

I have to think some more about the API choices we have here. I currently think that whatever definition we have for floating point dtypes should not be fixed to the standard? (Rather it should be valid to extend it e.g. with bfloat16?)

The last point might mean that we should have an API to allow users to spell isdtype(arr.dtype, (float32, float64)), or isofdtype(arr, (float32, float64)).

Annoyingly, NumPy doesn't even have a good way to spell it. Basically, you would first use issubdtype or dtype.kind == "f" to filter out floating (not future proof/extensible). And then use can_cast or <= on the dtypes, because == on dtypes is too strict to be useful. (dtype "equality" is in practice more like dtype1 <= dtype2 and dtype2 <= dtype1 rather than dtype1 == dtype2)

@rgommers
Copy link
Member

rgommers commented Sep 8, 2022

I have to think some more about the API choices we have here. I currently think that whatever definition we have for floating point dtypes should not be fixed to the standard? (Rather it should be valid to extend it e.g. with bfloat16?)

It should always be the case that the standard says "must contain (x, y, z)" - and either explicitly saying or implying that it's fine to also contain other things.

The last point might mean that we should have an API to allow users to spell isdtype(arr.dtype, (float32, float64)), or isofdtype(arr, (float32, float64)).

Yeah, I was just thinking in that direction as well. If it's hard to figure out what sets we may need, plus we need it to be easy to extend, plus we have naming issues with "float", then perhaps something like this which is concise and explicit:

def has_dtype(x: Union[array, dtype], kind: Union[str, tuple[Union[str, dtype], ...]]) -> bool):
    """
    Examples
    ----------
    >>> has_dtype(x, 'integer')
    >>> has_dtype(x, 'real')
    >>> has_dtype(x, 'complex')
    >>> has_dtype(x, ('real, 'complex'))  # avoid both 'floating' and 'inexact', those are not good names
    >>> has_dtype(x, 'numeric')  # shorthand for ('integer', 'real', 'complex')
    >>> has_dtype(x, 'signed integer')
    >>> has_dtype(x, 'unsigned integer')
    >>> has_dtype(x, (float64, complex128))
    >>> has_dtype(x, int32)  # supports dtype subclasses for libraries that support those, unlike `== int32'
    """

A couple of thoughts for why this may be nice:

  • It's quite different from existing APIs, so no introduction issues or users confusing it with very similarly-named APIs,
  • It's pretty concise and explicit,
  • It avoids confusion around what "floating" or "floating-point" means,
  • It's only a single API addition - I think this is helpful; 3 would still be okay but appears to not be enough, and 6-8 or more is too much imho,
  • It's extensible, unlike any fixed collections of dtypes encoded in API name.

@seberg
Copy link
Contributor

seberg commented Sep 8, 2022

I will note that I am slightly unsure about the Union[array, dtype]. Not a concrete concern though; it is probably that NumPy is quite relaxed about what it accepts as a dtype (and also array-like...), which makes these unions feel brittle/unclear to me.

@vnmabus
Copy link

vnmabus commented Sep 8, 2022

Yeah, I was just thinking in that direction as well. If it's hard to figure out what sets we may need, plus we need it to be easy to extend, plus we have naming issues with "float", then perhaps something like this which is concise and explicit

  • In case we go in that direction, I would not limit the type of kind to tuples, but I would accept any kind of collection.
  • If there are in the future API functions that can only be applied to a particular dtype, and we want Mypy to warn about it, with the set approach we could define a Protocol to tag the different kinds of types, and defining the different sets with the appropriate Protocol type would make Mypy narrow types after the check, I think. With this approach has_dtype must return a TypeGuard to do the narrowing, that depends on the kind parameter. It could be done with overloads except for combination of strings like ("real", "complex") AFAIK.

I think all the examples except the last (which IMHO seems like a different thing) can be done with sets:

x in xp.integer_dtypes
x in xp.real_dtypes
x in xp.complex_dtypes
x in xp.real_dtypes | xp.complex_dtypes
x in xp.numeric_dtypes
x in xp.integer_dtypes - xp.unsigned_dtypes # If we only want to add unsigned as a special set
x in xp.integer_dtypes & xp.unsigned_dtypes # I don't think there are unsigned dtypes that are not integers but just to be sure
x in {xp.float64, xp.complex128}
x == xp.int32 or (isinstance(xp.int32, type) and isinstance(x, xp.int32))

@rgommers
Copy link
Member

rgommers commented Sep 8, 2022

That does look nicer syntactically, thanks @vnmabus. Rather than x in xp.integer_dtypes it should be

x.dtype in xp.integer_dtypes

then I think - we can do the union of array and dtype in a function, but not if it's a set. Which is perfectly fine. The main thing that won't work I believe is dtype subclasses. dtype in a_set should compare with ==, not with isinstance. And explicit isinstance cannot work:

>>> int32 = 'int32'
>>> np.int32 = 'int32'  # example to simulate a library with string identifiers for dtypes
>>> isinstance(int32, np.int32)
Traceback (most recent call last):
  Input In [11] in <cell line: 1>
    isinstance(int32, np.int32)
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

@jbrockmendel
Copy link

I will note that I am slightly unsure about the Union[array, dtype]

I mentioned this in a call a few weeks back: the pandas is_foo_dtype functions accept both and that is a design choice we have largely come to regret. The performance overhead of checking for a .dtype attribute adds up quickly.

@rgommers
Copy link
Member

rgommers commented Sep 8, 2022

Good point @jbrockmendel. I never noticed it in numpy, but did a quick check and yes these checks are expensive (still fast though):

>>> import numpy as np
>>> real_dtypes = {np.float16, np.float32, np.float64, np.longdouble}
>>> %timeit np.float64 in real_dtypes
48.7 ns ± 0.211 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit np.issubdtype(np.float64, np.floating)
257 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

The main thing that won't work I believe is dtype subclasses.

To circle back to this, this is true only for user-defined dtypes. If those exist (which may be unique to numpy), it's perhaps okay to then require registering them somehow so they get added to xxx_dtypes.

@vnmabus
Copy link

vnmabus commented Sep 8, 2022

Just to be sure, custom dtypes (working for each backend) won't be ever added to the standard, right? It would be great to be able to implement the logic for a custom dtype once and have it working everywhere, but probably that would be difficult to standardize.

I saw that for units, for example, you recommended to wrap the API backends instead in https://discuss.scientific-python.org/t/advice-and-guidance-about-array-api-for-a-units-package/.

@rgommers
Copy link
Member

rgommers commented Sep 8, 2022

I think it's safe to say that custom dtype support won't be added. Most libraries don't have it, and for NumPy it's still a work-in-progress to define a good API with all the functionality downstream library authors may need.

That said, it would be nice that standard-compliant code like x.dtype in real_dtypes in libraries like SciPy and scikit-learn will work for those NumPy users that do end up creating their own dtype. I think it will, as long as NumPy has an API that allows those users to extend real_dtypes with their new dtype.

@seberg
Copy link
Contributor

seberg commented Sep 13, 2022

Let me make a few points for why I am leaning against the set approach, although it is still not quite clear cut:

  • The non-set approach is similar to isinstance. I am not sure the set approach has a clear inspiration e.g. in typing? (The notation of using set operations has, but checking with in?)
  • The set approach just feels a bit too smart to me. :)
  • If floating_dtypes is just a set/tuple (NumPy cannot do that, I think), it is not clear that arr in floating_dtypes would raise an error rather than always returning False (it must be arr.dtype in floating_dtypes).
  • In NumPy, I can see things like is_of_dtype/has_dtype(1., np.floating) making sense. Where 1. is actually just a Python float. Allowing to generalize "dtype checking" to objects that may not have a .dtype attribute.
    Yes, this would be to have better support of scalars, which is something that ideally are not supposed to exist here. Would this be useful e.g. for pandas, @jbrockmendel (since I always wonder if pandas has more need of scalars than an array API)?
  • In NumPy it would be nice to use np.floating for this, but that is also the scalar type, which may lead to a bit strange overloading. If we have to functions (has_dtype and is_dtype) that becomes unproblematic. (An error could point to the other where appropriate.)

In the end, I am not certain yet that the set approach works well for NumPy proper. Of course that is not actually a blocker for this API since there can be differences.

@jbrockmendel
Copy link

Would this be useful e.g. for pandas, @jbrockmendel (since I always wonder if pandas has more need of scalars than an array API)?

IIUC, I don't think it's likely pandas would change our current usage

@leofang
Copy link
Contributor

leofang commented Sep 13, 2022

Thanks, @seberg, for the nice thoughts. Just wanna add a quick note.

  • If floating_dtypes is just a set/tuple (NumPy cannot do that, I think), it is not clear that arr in floating_dtypes would raise an error rather than always returning False (it must be arr.dtype in floating_dtypes).

This is very nice point. It seems floating_dtypes cannot be a plain set/tuple, but at least a subclass of them with a custom __contains__ first checking the type of the object before delegating to the in check of the parent class.

  • In NumPy, I can see things like is_of_dtype/has_dtype(1., np.floating) making sense. Where 1. is actually just a Python float. Allowing to generalize "dtype checking" to objects that may not have a .dtype attribute.

Also a very good point. Since we include the Python types in the type lattice, I think it is legitimate to do the said check even if we don't plan to support scalars.

@rgommers
Copy link
Member

rgommers commented Sep 20, 2022

So it looks like we're (a) leaning towards the single-function version, and (b) only have it accept either a dtype or an array (avoiding the union of both).

For (b), most of the time the thing to check is an array. However, dtype checking is also needed, and getting a dtype from an array is trivial while an array from a dtype is not. If the input was an array, has_dtype is a logical name. If it's a dtype, I think is_dtype is better. That is also a name that AFAIK isn't used anywhere.

So we'd be looking at some flavor of:

def is_dtype(x: dtype, kind: Union[str, dtype, tuple[Union[str, dtype], ...]]) -> bool:
    """
    >>> is_dtype(x, 'integer')
    >>> is_dtype(x, 'real')
    >>> is_dtype(x, 'complex')
    >>> is_dtype(x, ('real, 'complex'))  # avoid both 'floating' and 'inexact', those are not good names
    >>> is_dtype(x, 'numeric')  # shorthand for ('integer', 'real', 'complex')
    >>> is_dtype(x, 'signed integer')
    >>> is_dtype(x, 'unsigned integer')
    >>> is_dtype(x, float32)
    >>> is_dtype(x, (float64, complex128))
    """

or

def is_dtype(x: dtype, kind: str) -> bool:
    """
    >>> is_dtype(x, 'integer')
    >>> is_dtype(x, 'real')
    >>> is_dtype(x, 'complex')
    >>> is_dtype(x, 'numeric') 
    >>> is_dtype(x, 'signed integer')
    >>> is_dtype(x, 'unsigned integer')
    """

or something in between (e.g, kind: str | dtype]).

Looking at the np.issubdtype usage in SciPy, there's a roughly equal mix between checking against a set of dtypes (e.g., np.issubdtype(dtype, np.complexfloating)) and checking against a single dtype (e.g., np.issubdtype(dtype, np.int32)). Both seem kinda useful. A combination (tuple of sets/dtypes) is probably not necessary.

So perhaps this is the way to go: ?

def is_dtype(x: dtype, kind: str | dtype) -> bool:

@rgommers
Copy link
Member

rgommers commented Oct 7, 2022

We had another look at this yesterday. We want to go for a flavor of the function-based implementation here; there was no clear preference for which of the above was preferred. So let's try a vote - use emoji's on this comment:

  • 👍🏼 if you prefer is_dtype(x: dtype, kind: str)
  • 🎉 if you prefer is_dtype(x: dtype, kind: str | dtype)
  • 🚀 if you prefer is_dtype(x: dtype, kind: str | dtype | tuple[Union[str, dtype], ...])

@NeilGirdhar
Copy link

NeilGirdhar commented Dec 23, 2022

I know I'm very late to this discussion, but as the array API is now implemented in NumPy, I've been exploring what it would do to my code.

A couple problems with is_dtype over ordinary Python types (proposed by seberg) is that:

  • is_dtype uses strings, which can be error prone. If you have a typo, it may not be caught until you run your program and run into the offending code. Yes, you can annotate is_dtype with Literal, but if the type codes are passed from other functions (as str), then the validation won't happen. It feels more ergonomic to me to have objects rather than magic strings.

    One of the things I love about the Array API is the constrained interface that feels way less bug prone. It's not a huge burden to have to import a special object instead of using a string, and it prevents mistakes and allows type-checkers to find errors. It's the same reason people generally prefer enumeration objects over strings.

  • The various kinds cannot be checked by type checkers. Right now, it's possible to annotate an array as numpy.typing.NDArray[np.floating[Any]]. I do this for various numpy array types, and this catches many bugs thanks to numpy's excellent implementation of type annotations. If you don't provide base classes, then how are you supposed to have these annotations?

    If I were to vote, I would have voted for:

    is_dtype(x: dtype, kind: dtype | tuple[dtype, ...])

    which may as well have been written as simply issubclass(x.type, kind).

I personally prefer seberg's proposal to use ordinary Python issubclass with a tree of Python types. Any thoughts on this? With is_dtype, how can I accomplish the above type annotations?

@rgommers
Copy link
Member

It's the same reason people generally prefer enumeration objects over strings.

I think this isn't really true? At least, I can't think of many APIs where enums are common, while I can think of lots of libraries that use string args for keywords.

Enums have a major design flaw - namespace cluttering. Imho that is far more important, also for ergonomics, than static type checking.

The array API standard doesn't have many strings, but if NumPy had enums instead of strings or True/False/None keywords everywhere, that would be hundreds of extra objects.

With is_dtype, how can I accomplish the above type annotations?

I think we still have a more fundamental issue to solve: how to annotate arrays themselves. This should be done using a Protocol I believe, see gh-229.

The same will apply to other objects. Giving that we have to be concerned about usage by consuming libraries and end users in an array-library-agnostic way, where it's effectively impossible for objects to have a class relationship, this is nontrivial to design. We haven't spent a whole lot of time on that aspect yet - and we should do that.

The array[dtype] is one level more complex. And it's not just dtype, there's also device, dimensionality, etc. Even in NumPy this is still very much a work in progress. It's probably best to split that off into a new issue - I don't think dtypes having a class hierarchy or not is the primary issue here.

@NeilGirdhar
Copy link

NeilGirdhar commented Dec 23, 2022

I think this isn't really true? At least, I can't think of many APIs where enums are common, while I can think of lots of libraries that use string args for keywords.

In fairness, Python didn't have enums until Python 3.4, and after that there has been talk about updating old APIs to use them.

Enums have a major design flaw - namespace cluttering. Imho that is far more important, also for ergonomics, than static type checking.

I do love the Array API's compact namespace. I understand being very judicious about what gets into the namespace. I agree that if every method got its own enumerations, then the namespace might become overwhelming.

Perhaps the ABCs (number, integer, inexact, signedinteger, unsignedinteger, floating, and complexfloating) could be tucked into xp.abc? Then, the root namespace would only have one extra symbol (abc) and one fewer symbol (is_dtype). It also mirrors Python's ABCs in collections and numbers. What do you think?

The array API standard doesn't have many strings, but if NumPy had enums instead of strings or True/False/None keywords everywhere, that would be hundreds of extra objects.

RIght, I'm not suggesting that.

I think we still have a more fundamental issue to solve: how to annotate arrays themselves. This should be done using a Protocol I believe, see #229.

Yeah, I'm looking forward to this!

I don't think dtypes having a class hierarchy or not is the primary issue here.

I understand, but if they don't, then it is impossible (as far as I can tell) to maintain the type checking of array dtypes, which already works in numpy.


Edit: I just realized, but isn't the is_dtype approach more complex for users? You have to use it like this, right?

def f(x: xp.Array):  # x can be from any Array API library.
  yp = array_api_of(x)  # IIC, you have to get the right array API library to answer the question since x's dtype may not be known to numpy.array_api.
  if yp.is_dtype(x.dtype, 'integer'):

versus

import numpy.array_api as xp

def f(x: xp.Array):  # x can be from any Array API library.
  if issubclass(x.dtype.type, xp.abc.integer):  # All Array API implementers can inherit appropriately.

@rgommers
Copy link
Member

Perhaps the ABCs (number, integer, inexact, signedinteger, unsignedinteger, floating, and complexfloating) could be tucked into xp.abc? Then, the root namespace would only have one extra symbol (abc) and one fewer symbol (is_dtype). It also mirrors Python's ABCs in collections and numbers. What do you think?

Maybe, not sure .... I have to think about the typing aspect, it's nontrivial. The collections ABCs are useful indeed. The numbers ones terrible and I believe even Guido is on record saying they were a mistake.

@rgommers
Copy link
Member

Edit: I just realized, but isn't the is_dtype approach more complex for users? You have to use it like this, right?

That's how you should use any function in the whole namespace (your import numpy.array_api as xp alternative is non-portable), so I think they're equivalent. The is_dtype(x.dtype, 'integer') line is shorter and simpler I'd say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants