Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency around the behavior of the schema argument across the API #11723

Open
nameexhaustion opened this issue Oct 14, 2023 · 4 comments
Open
Labels
A-input-parsing Area: parsing input arguments bug Something isn't working P-medium Priority: medium python Related to Python Polars reference Reference issue for recurring topics

Comments

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Oct 14, 2023

Description

In short, the schema argument has wildly different behaviors across different parts of the API, and would likely benefit from some standardization.

Examples

Using the following data for examples:

a,b
1,2
# DataFrame is initialized using this dict:
dict(a=1, b=2)

The comparisons are done between DataFrame, scan_csv and read_csv (more examples can be found by looking at the related issues linked below).

Case when the schema length does not match data:

schema = dict(b=pl.UInt32)
# DataFrame errors
ValueError: the given column-schema names do not match the data dictionary
# scan_csv loads as a partial overwrite:
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64u32 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
# read_csv errors
exceptions.ComputeError: found more fields than defined in 'Schema'

Case when the schema length matches the data, contains the same keys but is in a different order:

schema = dict(b=pl.UInt32, a=pl.UInt64)
# DataFrame does not respect the order of the schema,
# but will confusingly re-order the columns in the output 
# to match the schema (note how it is now 2,1 instead of 1,2)
┌─────┬─────┐
│ ba   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 21   │
└─────┴─────┘
# scan_csv does not respect the order of the schema
┌─────┬─────┐
│ ab   │
│ ------ │
│ u64u32 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
# read_csv respects the order of the schema
┌─────┬─────┐
│ ba   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘

Case when the schema length matches the data, but contains different keys:

schema = dict(x=pl.UInt32, y=pl.UInt64)
# DataFrame errors
ValueError: the given column-schema names do not match the data dictionary
# scan_csv and read_csv both respect the order of the schema
┌─────┬─────┐
│ xy   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘

Reproducible scripts

Schema arg for DataFrame, read_csv and scan_csv
import polars as pl
from functools import partial

path_in = ".env/input.csv"  # can change this


class funcs:
    DataFrame = partial(pl.DataFrame, dict(a=1, b=2))
    scan_csv = partial(pl.scan_csv, path_in)
    read_csv = partial(pl.read_csv, path_in)


funcs.DataFrame().write_csv(path_in)


def run(f):
    print(f"-- run {f=}")
    try:
        r = f()

        if hasattr(r, "collect"):
            r = r.collect()
    except Exception as e:
        r = e.__repr__()
    print(r)


# Note the default type is Int64.
for schema in (
    dict(b=pl.UInt32),
    dict(b=pl.UInt32, a=pl.UInt64),
    dict(x=pl.UInt32, y=pl.UInt64),
):
    print(f"---- {schema=}")
    run(partial(funcs.DataFrame, schema=schema))
    run(partial(funcs.scan_csv, schema=schema))
    run(partial(funcs.read_csv, schema=schema))

print(f"{pl.__version__=}")

outputs

---- schema={'b': UInt32}
-- run f=functools.partial(<class 'polars.dataframe.frame.DataFrame'>, {'a': 1, 'b': 2}, schema={'b': UInt32})
ValueError('the given column-schema names do not match the data dictionary')
-- run f=functools.partial(<function scan_csv at 0x7f6f842e2660>, 'env/input.csv', schema={'b': UInt32})
shape: (1, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64u32 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
-- run f=functools.partial(<function read_csv at 0x7f6f842e16c0>, 'env/input.csv', schema={'b': UInt32})
ComputeError("found more fields than defined in 'Schema'\n\nConsider setting 'truncate_ragged_lines=True'.")
---- schema={'b': UInt32, 'a': UInt64}
-- run f=functools.partial(<class 'polars.dataframe.frame.DataFrame'>, {'a': 1, 'b': 2}, schema={'b': UInt32, 'a': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ ba   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 21   │
└─────┴─────┘
-- run f=functools.partial(<function scan_csv at 0x7f6f842e2660>, 'env/input.csv', schema={'b': UInt32, 'a': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ u64u32 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
-- run f=functools.partial(<function read_csv at 0x7f6f842e16c0>, 'env/input.csv', schema={'b': UInt32, 'a': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ ba   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
---- schema={'x': UInt32, 'y': UInt64}
-- run f=functools.partial(<class 'polars.dataframe.frame.DataFrame'>, {'a': 1, 'b': 2}, schema={'x': UInt32, 'y': UInt64})
ValueError('the given column-schema names do not match the data dictionary')
-- run f=functools.partial(<function scan_csv at 0x7f6f842e2660>, 'env/input.csv', schema={'x': UInt32, 'y': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ xy   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
-- run f=functools.partial(<function read_csv at 0x7f6f842e16c0>, 'env/input.csv', schema={'x': UInt32, 'y': UInt64})
shape: (1, 2)
┌─────┬─────┐
│ xy   │
│ ------ │
│ u32u64 │
╞═════╪═════╡
│ 12   │
└─────┴─────┘
pl.__version__='0.19.9'

Solution ideas

One way to resolve this could be to make it so that the schema argument will always overwrite the existing names in the data, and when specified it must match the length of the data (and contain no duplicates as @stinodego suggests in #11632). For cases when only needing to define the dtypes of the (potentially a subset of) the columns without overwriting original ordering, the schema_overrides / dtypes arguments could be used instead.

But this probably needs some further discussion / review.

Related issues

@stinodego
Copy link
Member

stinodego commented Oct 18, 2023

What I think the schema arguments should do:

If I pass a schema argument:

  • The resulting DataFrame/LazyFrame should have the given schema. The order of the given schema should be respected.
  • There should be no type inference. Thus, passing the argument should have a performance benefit.
  • An error should be thrown if:
    • the data has named columns that do not match the schema keys (ignoring ordering).
    • the number of columns does not match the number of keys in the schema

If I pass a schema_overrides, argument:

  • The columns specified should have the given data types. The order of the original data is respected.
  • There should be no type inference on the columns for which we specify a data type. Thus, passing the argument should have a performance benefit.
  • If schema is also passed, the data types in schema_overrides take precedence.
  • An error should be thrown if the keys are not present in the data. Keys not present in the data or schema are ignored. See discussion in SchemaError: nonexistent column when created from sequence #15471

@nameexhaustion could you add reproducible examples to your post? Then we can determine if there's any bugs here we should fix.

@nameexhaustion
Copy link
Collaborator Author

Added a script under reproducible scripts, for the examples in the post.

@stinodego stinodego added the accepted Ready for implementation label Jan 10, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 10, 2024
@stinodego stinodego added bug Something isn't working P-medium Priority: medium python Related to Python Polars and removed enhancement New feature or an improvement of an existing feature labels Jan 10, 2024
@stinodego stinodego removed the accepted Ready for implementation label Jan 12, 2024
@stinodego stinodego self-assigned this Jan 13, 2024
@stinodego stinodego moved this from Ready to Candidate in Backlog Jan 13, 2024
@stinodego stinodego added the A-input-parsing Area: parsing input arguments label Jan 23, 2024
@stinodego stinodego added the reference Reference issue for recurring topics label Apr 3, 2024
@stinodego stinodego removed their assignment May 26, 2024
@stinodego stinodego moved this from Candidate to Ready in Backlog May 26, 2024
@mcrumiller
Copy link
Contributor

@stinodego under the proposed syntax, how would one select only columns 4 and 2 of a csv, give them new name, and give specify their dtype? Do you need to specify the entire scheme of the csv, and then use both the 'columns' and 'new_columns' parameter?

@stinodego
Copy link
Member

There are a few ways for CSV files.

  • If the CSV file has a header, you can specify schema_overrides on the original name, specify columns with those same names, and call .rename(...) on the result.
  • If the CSV has no header, you can specify schema_overrides by index (not sure if this is supported yet), and then specify columns by index as well.
  • If the CSV file has no header, you can specify new_columns to give new names to all the columns, and then specify schema_overrides to give a dtype to specific columns, and then specify columns to only select those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-input-parsing Area: parsing input arguments bug Something isn't working P-medium Priority: medium python Related to Python Polars reference Reference issue for recurring topics
Projects
Status: Ready
Development

No branches or pull requests

3 participants