Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check that literal strings/int/float belong to /is excluded from a set/range of values #478

Closed
sametmax opened this issue Oct 2, 2017 · 34 comments

Comments

@sametmax
Copy link

sametmax commented Oct 2, 2017

Opened in python/mypy#4040, but moved here after @JukkaL 's advice.

Some debate took place in there, but I'll copy the original post here for context:


It's a common practice to pass literal strings as arguments. In Python, it's even more important, as strings are often used instead of byte flags, constants or enums.

You often end up checking if those literals are passed correctly so you can give some debug information:

  • sorry, the parameter mode except a string among "started, paused, cancelled";

  • you can only used an integer between 0 and 16.

etc.

The problem with that is it's done at runtime despite the fact those argument are completely static ones. So the IDE can't use this information to help you write code. Typically, I would have code completion and linting helping me with "started/pause/cancelled" if it was an enum, but not if it's a string.

With this concept I could do:

MODES = ('started', 'paused', 'cancelled')
LEVELS = (
    range(0, 5),
    range(40, float('+inf')), 
)

def foo(mode: typing.Among(MODES), levels: typing.Among(LEVELS)):
    pass

So that mypy could alert my users if they do:

def foo("Start", 6):

Of course, it's only for strings/ints/floats literals, maybe bytes and tuple, as we need the information to be immutable and simple to be usable.

@ilevkivskyi
Copy link
Member

a lot of people don't use enum. It's very overkill when you start a project, especially with Python and its simple syntax

I don't see how it is a "very overkill" to write:

from your_lib import foo, mode

foo(mode.started)

instead of

from your_lib import foo

foo('started')

Anyway, if we are going to consider something like this, I would propose an already discussed idea: literal types, so that the first example will be

from typing import Union, Literal

Mode = Union[Literal['strated'], Literal['paused'], Literal['cancelled']]

def foo(mode: Mode): ...

foo('started') # OK
foo('wrong') # Error!

x = 'started'
foo(x) # Also an error, we can't track a 'str' variable

The same would work e.g. to describe open (currently special-cased in mypy) etc.:

@overload
def open(name: str, mode: Literal['r']) -> IO[str]: ...
@overload
def open(name: str, mode: Literal['rb']) -> IO[bytes]: ...
@overload
def open(name: str, mode: str) -> IO[Any]: ...

This will not cover the integer ranges, but I think it is too advanced for a typechecker (I think we could support just simple/common dependent types).

@shoyer
Copy link

shoyer commented Oct 26, 2017

Another use-case is booleans, e.g., the return value from pandas.to_datetime() is either a pandas.DatetimeIndex or numpy.ndarray, based upon whether it is called with the arguments box=True or box=False.

Both boolean flags and strings as enums are endemic within Python's numerical computing ecosystem.

@JukkaL
Copy link
Contributor

JukkaL commented Oct 27, 2017

@shoyer It would help us if you can give additional concrete examples of cases where string or boolean values affect types (in numerical libraries or elsewhere).

@shoyer
Copy link

shoyer commented Oct 27, 2017

Let me give examples culled from the API docs for pandas, which I think reflect common usage in numerical libraries:

Format: argument_name=literal_value -> return_type

pandas.read_csv
squeeze=False -> DataFrame
squeeze=True -> Union[DataFrame, Series]

pandas.cut
retbins=False -> Series
retbins=True -> Tuple[Series, ndarray]

pandas.concat
axis=0 -> Series
axis=1 -> DataFrame

pandas.get_dummies
sparse=False -> DataFrame
sparse=True -> SparseDataFrame

pandas.to_datetime
box=True -> Series
box=False -> ndarray

pandas.Series.drop
inplace=False -> Series
inplace=True -> None

pandas.Series.reset_index
drop=True -> Series
drop=False -> DataFrame

pandas.DataFrame.to_dict
orient='dict' -> dict like {column -> {index -> value}}
orient='list' -> dict like {column -> [values]}
orient='series' -> dict like {column -> Series(values)}
orient='split' -> dict like {index -> [index], columns -> [columns], data -> [values]}
orient='records' -> list like [{column -> value}, ... , {column -> value}]
orient='index' -> dict like {index -> {column -> value}}

These are not cherry-picked examples -- these are common functions/methods used in a large fraction of code using pandas. So in practice, this will be a very real obstacle when/if we try to add type annotations to pandas (pandas-dev/pandas#14468). Certainly pandas does not exhibit best practices here, but nonetheless it's a very popular library.

This is intentionally only including cases where type signature itself varies gives based on the literal value, which precludes usefully typing a function almost at all unless we have a way of recognizing literal notes. There are many more examples, including in libraries like numpy or scipy, where a limited set of strings describes all valid values. From a principled perspective, most of these should probably be using enums instead of strings or booleans, but the value dependent semantics remains.

@JukkaL
Copy link
Contributor

JukkaL commented Oct 27, 2017

@shoyer Thanks for the additional examples! They are very helpful. The current type system clearly seems to work poorly for numerical Python libraries.

Here are a few additional things that would be useful for moving this forward:

  • Do a similar analysis as above for numpy and scipy (and perhaps other popular numerical libraries).
  • Collect a corpus of open-source numerical Python code (from GitHub, perhaps) that we can use to look for additional real-world use cases.

@TeamSpen210
Copy link

Another example is tkinter, although that's more of a wrapper around another language and isn't exactly Pythonic in a lot of cases. There most function arguments are either numbers, or string enums. It would be desirable to ensure they match the method that's called. It doesn't have many functions where the return type depends on a literal though, although many do change between a return value or None depending on the number of arguments (if they match the default).

@shoyer
Copy link

shoyer commented Oct 29, 2017

Looking through NumPy, I notice some themes. There are a few common arguments that only have particular valid values (enum equivalents):

  • order : {‘K’, ‘A’, ‘C’, ‘F’}
  • casting : {‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}
  • subok : bool: if True, result can be an ndarray subclass, otherwise only a baseclass ndarray

These are found on a very large portion of NumPy's API (e.g., most array creation routines, shape changing routines, copying routines and all ufuncs).

NumPy doesn't use enums, so the "strings as enums" pattern is endemic throughout the library:
numpy.seterr and numpy.errstate: arguments can be of {‘ignore’, ‘warn’, ‘raise’, ‘call’, ‘print’, ‘log’}
numpy.fft routines: norm can be either None or 'ortho'
numpy.take, numpy.choose and numpy.put: mode can be any of {‘raise’, ‘wrap’, ‘clip’}
numpy.nditer: quite a few arguments, all of which can only be particular strings
numpy.linalg.norm: {non-zero int, inf, -inf, ‘fro’, ‘nuc’}
numpy.pad: mode can take on any of 10 string values
numpy.sort and related routines: kind can be any of {‘quicksort’, ‘mergesort’, ‘heapsort’}
numpy.correlate and numpy.convolve: mode can be any of {‘valid’, ‘same’, ‘full’},
numpy.histogram: bins can be an integer, sequence or any of 7 specific string values

There are also a few functions whose return value depends on boolean flags:
numpy.linalg.svd: if compute_uv=True, returns a tuple of three arrays, otherwise just one
numpy.unique: returns either a single array, or a tuple of 2-4 arrays, depending on three boolean flag values

I have not looked into the more esoteric corners of the library (e.g., masked arrays, numpy.matrix, polynomials or financial functions).

So on the whole, NumPy is certainly in a much better position than pandas: there are only a handful of functions where the return type depends on a literal values (although they are widely used).

I'm not going to bother going through SciPy as the API is larger and more varied, and I'm less familiar with it, but I hope you'll trust me that its situation is very similar to that for NumPy. The is large overlap in the community maintaining both libraries, and for a short while they were even integrated in a single project. Certainly strings are used as enums throughout. I don't know if there are any commonly used functions that can return multiple types.

One last thing I'll note is that there are quite a few further examples of type unstable behavior (even in NumPy) if we ever tried to make array shapes part of the type system.

@rowillia
Copy link

@JukkaL I recall talking to @ambv about this at PyCon as well in the context of something simple like correctly typing open, which has the same problem (e.g. open(filename, 'w') returns IO[str] whereas open(filename, 'wb') returns IO[bytes]).

@gvanrossum
Copy link
Member

Note that we solved the specific problem with open() using a plugin. That is, if the argument is a literal string, we parse it and infer the appropriate type. If the argument is a correctly-typed expression that's not a literal, the old behavior still applies.

In theory you can now write numpy-specific plugins that do the same thing for the numpy APIs you list above. As long as the call sites are typically passing literals.

Adding a new mechanism to the type system that would let you define a subtype of str that is restricted to a given set of values would be more challenging (though the plugin mechanism might be up to this as well -- I'm not sure, but in principle it has arbitrary powers, which include shooting yourself in the foot :-).

@emmatyping
Copy link
Contributor

emmatyping commented Oct 30, 2017

While plugins work, they are less than ideal. I think it would be much better to have literal types in the type system. It is a not too uncommon pattern to have literal keyword arguments cause different behavior. Additionally, it is much more maintainable in my mind to write the types via an overload as compared to a plugin (less indirection, less magic, etc). That being said, for smaller cases, plugins are an acceptable stop-gap measure.

@gvanrossum
Copy link
Member

No argument there! We've just been worried about the cost of adding literal types to the type system vs. the cost of developing a plugin system. (Also, the syntax for the mode argument to open() is a little too complicated, but this is an exception to the exceptions. :-)

@ilevkivskyi
Copy link
Member

Another example of use in numpy is dtype='int8' and similar. I was thinking about this for some time, and scrolling through my old numeric code. It looks we will need three ingredients:
1). Literal types, I would propose a simple syntax:

var: Literal['value']

2). Constant qualifier, it is needed for situations like this:

@overload
def func(mode: Literal['b']) -> IO[bytes]: ...
@overload
def func(mode: Literal['s']) -> IO[str]: ...

MODE: Const[str] = 'b'

func(MODE) # this should be IO[bytes]

This example is oversimplified, but I have seen such patterns for data types, array/matrix dimensions/sizes etc. (see the next point).
3). Integer generic or shape type, this is needed for fixed size arrays/matrices. This one is hardest and will definitely require a plugin.

I think that we could move step by step here. The first two are actually not so hard to implement, will already cover large amount of numeric code, and could be also useful for other (non-numeric) code.
I would propose to start from these two (we can put them in typing_extensions first).

The idea is that Literal will be mostly used in library code, wile Const will be mostly used in user code. I am not sure how much of static constant "calculations" we need to support, it seems to me that basic constant propagation (like in the example above) will be sufficient for vast majority of code.

@shoyer
Copy link

shoyer commented Oct 30, 2017

Another example of use in numpy is dtype='int8' and similar. I was thinking about this for some time, and scrolling through my old numeric code.

This is true, but strings like 'int8' have always been a short-cut for np.int8. I would be OK requiring user code to types something like dtype=np.int8 if they want an array to type-check as having int8 dtype.

@gvanrossum @ilevkivskyi I assume by "plugin" you are referring specifically to mypy?

@emmatyping
Copy link
Contributor

@ilevkivskyi I'm not sure I understand why Literal types are needed, especially if there is a Const type. I understand they are used in different ways, but isn't a Literal an in-place Const?

@JukkaL
Copy link
Contributor

JukkaL commented Oct 31, 2017

The mypy plugin approach really only works well if the special signatures are rare enough. These "special" cases don't sound very special at all in numpy and pandas, so a plugin-based approach is less than optimal. Also, the plugin approach is very specific to mypy.

@ilevkivskyi Your proposal sounds pretty reasonable. I agree that that 3) sounds much harder than the rest.

Before considering implementation, it would make sense to have some partial draft stubs for numpy and pandas that use the proposed features. There is a risk that there are other problems (such as array shapes) that we'd also need to solve before we can have useful stubs.

Here are a some additional comments:

  • Const will be useful more generally, and it has been proposed several times in other contexts.
  • It would be good to a have a more compact syntax such as s: Const = 'foo' as a short-hand for s: Const[str] = 'foo'.
  • I think that previously we were converging on Final as the desired spelling instead of Const.
  • Based on the above proposal, Const / Final would implicitly infer a literal type. However, it would be useful to also allow it to be used with non-literal expressions (e.g. x: Final[List[int]] = []). In this case the only effect would be to reject non-initialization assignments to x. The list would still be mutable.
  • It may be useful to allow conditional initialization of Const / Final variables -- based on the current platform, for example.
  • Large unions of Literal types will be pretty verbose (e.g. Union[Literal['a'], Literal['b'], Literal['c'], Literal['d'], Literal['e']]). Type aliases help here, though.
  • We could support Literal[...] at least for str, bytes, int and bool literals. float might also be useful, at least together with Const / Final.
  • The semantics of Const / Final attributes are unclear when there is inheritance.
  • Would we infer Literal[...] as the type of literal expressions? What about things like 1 + 3 or 'a' + 'b'? Would they also have Literal[...] types?
    • In cases like x = 'a' we'd likely promote Literal['a'] to str when inferring the type of x. Similarly, the type of ['x'] would be List[str] instead of List[Literal['x']].
    • Alternatively, we could only infer literal types in specific contexts (surrounding type context, using mypy terminology, has a literal type or a union of literal types, or in the initializer of a Const / Final variable).
  • It's unclear when we'd be able to use Literal[...] and Const[...] in typeshed. For example, pytype and PyCharm might decide not to support them until they are in stdlib, even if mypy supports them earlier through typing_extensions. One option would be to tag some stubs in typeshed as requiring specific type system extensions.

@vlasovskikh
Copy link
Member

BTW value types could be useful in a broader sense, not only for literals per se. For example, one may have overloaded functions for various enum members:

class Color(Enum):
    RED = 1
    GREEN = 2

@overload
def f(x: Literal[Color.RED]) -> Any: ...

@overload
def f(x: Literal[Color.GREEN]) -> Any: ...

@JukkaL
Copy link
Contributor

JukkaL commented Oct 31, 2017

@vlasovskikh Good point, enums are a good match as well. I remember another discussion where overloading based on enum values was proposed but can't find it now.

@TeamSpen210
Copy link

Another example is default arguments, where you want to detect no value is entered. You could do that with @overload, but that's slightly inaccurate since it is possible to pass in that value deliberately.

@JukkaL
Copy link
Contributor

JukkaL commented Nov 1, 2017

@TeamSpen210 Can you give an example of how this would work with default arguments?

@ilevkivskyi
Copy link
Member

@JukkaL

Sorry for a long silence, here are some comments:

Before considering implementation, it would make sense to have some partial draft stubs for numpy and pandas that use the proposed features. There is a risk that there are other problems (such as array shapes) that we'd also need to solve before we can have useful stubs.

I think Literal and Const will be useful enough beyond the numeric libs. But anyway I wanted to work on numpy stubs at some point. I think it will be important to have a draft before we move with integer generics/shape types, but I think we can already move forward with Literal and Const.

It would be good to a have a more compact syntax such as s: Const = 'foo' as a short-hand for s: Const[str] = 'foo'.

I also wanted to say this, but I am not sure since it would be a single case where Something is not equivalent to Something[Any].

I think that previously we were converging on Final as the desired spelling instead of Const.

I am fine with either.

Large unions of Literal types will be pretty verbose (e.g. Union[Literal['a'], Literal['b'], Literal['c'], Literal['d'], Literal['e']]). Type aliases help here, though.

I think we can allow Literal['a', 'b', 'c'] as a shorthand for Union[Literal['a'], Literal['b'], Literal['c']]. This will be pretty common case (as in original post).

The semantics of Const / Final attributes are unclear when there is inheritance.

I think we can implement it first only for global and local variables. For class/instance variables this may be harder and seems related to python/mypy#4019

Would we infer Literal[...] as the type of literal expressions? What about things like 1 + 3 or 'a' + 'b'? Would they also have Literal[...] types?

I think the simplest way is to infer Literal types only in Literal/Union/Final context. Otherwise we risk problems with invariance. However in the Literal context we can in principle allow any expression that can be evaluated statically, like 'a' + 'b' * 3. I think this should not be part of specification, we can start from allowing only "atomic" expressions, and add more later if people will ask.

@junjihashimoto
Copy link

junjihashimoto commented Nov 30, 2017

Do you mean any expression of Literalcontext allows a following example(concatenation of sized vector)?
The example is just pseudo code. Vector[length,type] is a 1d-array like np.array([1, 2, 3]) of numpy.
concat function returns a vector of 'a'+'b' length. This is usefull.

@overload
def concat(a0: Vector[Literal['a'],int],
           b0:Vector[Literal['b'],int])
       -> Vector[Literal['a'+'b'],int]: ...

But a case of matrix multiply does not work.
The type checker does not just allow any expression of Literalcontext,
I hope it allows a constraint of among types.

Next code shows a example of type-constraint.
Matrix[length,length,type] is a 2d-array like np.array([[1,2],[3,4]]) of numpy.
with 'ay' == 'bx' means first matrix's second dimension is the same as second matrix's first dimension.

@overload
def matmul(a0: Matrix[Literal['ax'],Literal['ay'],int],
           b0:Matrix[Literal['bx'],Literal['by'],int])
        -> Matrix[Literal['ax'],Literal['by'],int] with 'ay' == 'bx': ...

@iddan
Copy link

iddan commented Jun 21, 2018

Has there been any progress with the discussion about this in another place?

@ilevkivskyi
Copy link
Member

Has there been any progress with the discussion about this in another place?

At PyCon typing meeting, most people like the idea, now someone just needs to implement this. An original plan for mypy was end of summer, but no guarantees.

@pawelswiecki
Copy link

Will Literal and Const be introduced via a PEP?

@ilevkivskyi
Copy link
Member

Will Literal and Const be introduced via a PEP?

a) Const will be likely called Final
b) They will be first implemented as experimental in typing_extensions with a support in mypy. Hopefully as soon as in few weeks. The PEP will follow soon after we get some real world experience with them.

@pawelswiecki
Copy link

I see, thank you :)

@Michael0x2a
Copy link
Contributor

Small update: we're started work on a PoC implementation of literal types in mypy and are hoping to have the core implementation work done sometime around early 2019.

We also have a preliminary draft of what we think the semantics of Literal should look like here: https://github.com/Michael0x2a/peps/blob/literal-types/pep-9999.rst -- I think the plan is to discuss it in our shiny new typing-sigs mailing list?

Final has already been implemented in mypy and added to typing_extensions -- though I don't think we have a PEP for that ready yet. It'll probably end up being similar to the stuff that's currently in the mypy docs.

@dkgi
Copy link

dkgi commented Nov 20, 2018

Thanks for the update @Michael0x2a . I have two questions around the process:

  1. is there a way for us to discuss your Literal proposal with inline comments (I'm not super familiar with GitHub)? It seems like the discussion would be difficult to structure via email.
  2. What's the process for moving Final from mypy_extensions to typing? Does it move once there is an accepted PEP?

@ilevkivskyi
Copy link
Member

  1. One can only comment PRs and commits. So either @Michael0x2a will make a PR to its own repo, or maybe post full text to the mailing list, this is how all PEP discussions happen on python-dev and python-ideas.
  2. Final is already in typing_extensions, not mypy_extensions. And yes, it will be moved to typing when the PEP is accepted. As a related comment I think we should move TypedDict to typing_extensions now.

@JukkaL
Copy link
Contributor

JukkaL commented Nov 20, 2018

As a related comment I think we should move TypedDict to typing_extensions now.

I think that we can wait until there has been some discussion of the TypedDict PEP (which still needs to be written). It still seems possible to me that the definition syntax may be tweaked during the PEP process.

@mrkmndz
Copy link
Contributor

mrkmndz commented Nov 20, 2018

In what forum is the TypedDict PEP going to be developed? I would love to be involved as I'm working on implementing them into Pyre now and would want to make sure we're heading in a standards compliant direction.

@JukkaL
Copy link
Contributor

JukkaL commented Nov 20, 2018

@mrkmndz We'll share drafts of the TypedDict PEP on the typing-sig mailing list. It may be best to develop the initial draft as a GitHub PR to make commenting easy. This might happen in December/January. If you have questions before that, feel free to open an issue here or on the mypy issue tracker.

@Michael0x2a
Copy link
Contributor

@dkgi -- I opened a pull request for my branch as Ivan suggested: Michael0x2a/peps#1

@ilevkivskyi
Copy link
Member

Most things discussed in this issues are now supported by Literal and Final (PEPs accpeted and implemented). So this can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests