-
-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589
Comments
hi @peter1456, #600 should address this issue |
Hi @cosmicBboy, thanks a lot! |
* Unique keyword arg (#580) * add copy button to docs (#448) * Add missing inplace arg to SchemaModel's validate (#450) * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * WIP * fix test errors, re-factor allow_duplicates handling * fix io tests * fix docs, remove _allow_duplicates private var * update unique type signature in strategies * completing tests for setters and lazy evaluation of unique kw * small fix for the linting errors * support dataframe-level uniqueness in strategies * add docs, fix error formatting, add multiindex support Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> * Add support for timezone-aware datetime strategies (#595) * add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]> * schemas with multi-index columns correctly report errors (#600) fixes #589 * strategies module supports undefined checks in regex columns (#599) * Add support for empty data type annotation in SchemaModel (#602) * remove artifacts of py3.6 support * add support for empty data type annotation in SchemaModel * fix frictionless version in dev dependencies * fix setuptools version instead of frictionless * fix setuptools pinning * remove frictionless from core pandera deps (#609) * support frictionless primary keys with multiple fields (#608) * fix validation of check raising error without message (#613) * docs/requirements.txt pin setuptools (#611) * bump version 0.7.1 Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: Kevin Kho <[email protected]>
Hi @cosmicBboy, Sorry for not checking carefully, but seems like import pandas as pd
from pandera import Column, DataFrameSchema, Check
schema = DataFrameSchema({
("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
})
df = pd.DataFrame({
("foo", "bar"): [1, 2, 3],
})
schema.validate(df, lazy=True) When it is run, the following error is thrown:
It seems like
Desktop:
|
Hi @cosmicBboy, thanks again! |
fixed by #622 |
Describe the bug
A
ValueError
is raised in pandas when apandas.DataFrame
object with MultiIndexed Columns is lazily validated (using the parameterlazy=True
) by apandera.DataFrameSchema
object, and there is at least one failed check for the columns.Running the code below, the following exception is raised:
Checking the line 172 in
errors.py
inpandera
, i.e.It could be seen that the MultiIndexed Column with the name
("foo", "baz")
, which has the typetuple
, would not be interpreted as a single value bypandas
, which then failed to be broadcasted toerr.failure_cases
and causing theValueError
frompandas
during theassign
method call.Code Sample, a copy-pastable example
Expected behavior
A
pandera.SchemasError
should be raised with the type mismatch on the column("foo", "baz")
logged, which has the value("foo", "baz")
in the column Column.Desktop:
Additional context
If we change the code above to
The output would be
which shows that the column name
("foo", "bar")
is incorrectly interpreted as apandas.Series
-like object and treated as a column object when calling the methodassign
fromerr.failure_cases
.Potential Fix
A band-aid fix would be manually broadcast the input for the column Column before assigning the column to
err.failure_cases
, i.e.which seems to have fixed the problem.
The text was updated successfully, but these errors were encountered: