Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usecols keyword argument of pd.read_csv says it expects list[str] but the documentation says otherwise #605

Closed
JasonMendoza2008 opened this issue Mar 30, 2023 · 3 comments · Fixed by #630
Labels

Comments

@JasonMendoza2008
Copy link
Contributor

JasonMendoza2008 commented Mar 30, 2023

Describe the bug
usecols keyword argument of pd.read_csv says it expects list[str] but the documentation says otherwise:

image

Documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html):

usecols list-like or callable, optional Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If names are given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.

To Reproduce

  1. Provide a minimal runnable pandas example that is not properly checked by the stubs.
    import pandas as pd; path_to_csv = "mycsv.csv"; df_db = pd.read_csv(path_to_csv, usecols=[0])
  2. Indicate which type checker you are using (mypy or pyright). PyCharm default type-checker, I haven't checked mypy.
  3. Show the error message received from that type checker while checking your example.
    image

Please complete the following information:

  • OS: Windows
  • OS Version 11
  • python version 3.11.1
  • version of type checker mypy 1.1.1
  • version of installed pandas-stubs pandas-stubs 1.5.3.230321

Additional context
Realted to this SO post.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 30, 2023

Need to change

usecols: list[str]
| tuple[str, ...]
| Sequence[int]
| Series
| Index
| npt.NDArray
| Callable[[str], bool]
| None = ...,

to change list[str] to list[HashableT] and Callable[[str], bool] to Callable[[Hashable], bool]

Also change other places in that file that have the usecols argument.

PR with tests welcome. Tests should be added near here:

df13: pd.DataFrame = pd.read_csv(path, usecols=pd.Series(data=["col1"]))

Dr-Irv pushed a commit that referenced this issue Apr 6, 2023
* gh-623: broaden 'names' param of read_csv

Broaden the type hint for the 'names' param of read_csv (and read_table,
which behaves similarly) from previous list[str], so that other valid
types are accepted by mypy.

* allow None as names param of read_clipboard

Noticed as I found clipboard after the changes to read_csv and
read_table, and it calls it, so should match - but it was missing None
as an option.

* broaden 'names' param of read_clipboard

Match prior change to read_csv, since read_clipboard calls read_csv.

* broaden 'names' param of read_excel

Match prior change to read_csv, read_table, read_clipboard.

* gh-605: broader usecols param type hint

This fixes the pycharm tooltip problem in gh-605, as well as allowing
more list-like types of strings (tuples of strings, as well as mutable
sequences of strings other than list), and callables that accept
hashables, not just strings.

* test that read_excel accepts string for usecols

* test names and usecols correctly exclude strings

Strings aren't valid arguments here (except for read_excel, where we
have a test now to check that this is accepted). Adding tests to make
sure the type hints aren't overly wide and accept string arguments by
mistake.
@JasonMendoza2008
Copy link
Contributor Author

When will the changes be made public? meaning I can do pip install -U pandas-stubs?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Apr 7, 2023

When will the changes be made public? meaning I can do pip install -U pandas-stubs?

Unsure at the moment. I would like the next release to support the 2.0 features, but there is work described in #624 that needs to get done.

If you can't wait for that work to get done, I believe that you could just clone the repo, switch to the main branch, set up the dev environment, do a poetry build, then you will get a wheel file in dist and you can then install from the wheel via pip install -U dist/name_of_wheel_file.whl .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants