Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

Closed
2 of 3 tasks
peter1456 opened this issue Aug 18, 2021 · 5 comments · Fixed by #622
Closed
2 of 3 tasks
Labels
bug Something isn't working

Comments

@peter1456
Copy link

peter1456 commented Aug 18, 2021

Describe the bug

A ValueError is raised in pandas when a pandas.DataFrame object with MultiIndexed Columns is lazily validated (using the parameter lazy=True) by a pandera.DataFrameSchema object, and there is at least one failed check for the columns.

Running the code below, the following exception is raised:

Traceback (most recent call last):
  line 18, in <module>
    print(schema.validate(df, lazy=True))
  File "Y:\Python39\lib\site-packages\pandera\schemas.py", line 613, in validate
    raise errors.SchemaErrors(
  File "Y:\Python39\lib\site-packages\pandera\errors.py", line 87, in __init__
    error_counts, failure_cases = self._parse_schema_errors(schema_errors)
  File "Y:\Python39\lib\site-packages\pandera\errors.py", line 172, in _parse_schema_errors
    failure_cases = err.failure_cases.assign(
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3699, in assign
    data[k] = com.apply_if_callable(v, data)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
    self._set_item(key, value)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
    value = self._sanitize_column(key, value)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "Y:\Python39\lib\site-packages\pandas\core\internals\construction.py", line 747, in sanitize_index
    raise ValueError(
ValueError: Length of values (2) does not match length of index (1)

Checking the line 172 in errors.py in pandera, i.e.

failure_cases = err.failure_cases.assign(
                    schema_context=err.schema.__class__.__name__,
                    check=check_identifier,
                    check_number=err.check_index,
                    column=column,
                )

It could be seen that the MultiIndexed Column with the name ("foo", "baz") , which has the type tuple, would not be interpreted as a single value by pandas, which then failed to be broadcasted to err.failure_cases and causing the ValueError from pandas during the assign method call.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema({
    ("foo", "bar"): Column(int),
    ("foo", "baz"): Column(int)
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

print(schema.validate(df, lazy=True))

Expected behavior

A pandera.SchemasError should be raised with the type mismatch on the column ("foo", "baz") logged, which has the value ("foo", "baz") in the column Column.

Desktop:

  • OS: Windows 10
  • Version: Python 3.9.0, with pandera 0.7.0, pandas 1.1.4 installed

Additional context

If we change the code above to

import pandas as pd
from pandera import Column, DataFrameSchema, Check

schema = DataFrameSchema({
    ("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
    ("foo", "baz"): Column(str, name="b")
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

try:
    schema.validate(df, lazy=True)
except Exception as e:
    print(e.failure_cases)

The output would be

  schema_context column     check  check_number  failure_case  index
0         Column    foo  <lambda>             0             2      1
1         Column    bar  <lambda>             0             3      2

which shows that the column name ("foo", "bar") is incorrectly interpreted as a pandas.Series-like object and treated as a column object when calling the method assign from err.failure_cases.

Potential Fix

A band-aid fix would be manually broadcast the input for the column Column before assigning the column to err.failure_cases, i.e.

failure_cases = err.failure_cases.assign(
                    schema_context=err.schema.__class__.__name__,
                    check=check_identifier,
                    check_number=err.check_index,
                    column=[column] * len(err.failure_cases),
                )

which seems to have fixed the problem.

@cosmicBboy
Copy link
Collaborator

hi @peter1456, #600 should address this issue

@peter1456
Copy link
Author

Hi @cosmicBboy, thanks a lot!

cosmicBboy added a commit that referenced this issue Sep 10, 2021
* Unique keyword arg (#580)

* add copy button to docs (#448)

* Add missing inplace arg to SchemaModel's validate (#450)

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <[email protected]>

* intermediate commit for review by @cosmicBboy

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <[email protected]>

* intermediate commit for review by @cosmicBboy

* WIP

* fix test errors, re-factor allow_duplicates handling

* fix io tests

* fix docs, remove _allow_duplicates private var

* update unique type signature in strategies

* completing tests for setters and lazy evaluation of unique kw

* small fix for the linting errors

* support dataframe-level uniqueness in strategies

* add docs, fix error formatting, add multiindex support

Co-authored-by: Jean-Francois Zinque <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: fkroll8 <[email protected]>
Co-authored-by: fkroll8 <[email protected]>

* Add support for timezone-aware datetime strategies (#595)

* add support for Any annotation in schema model (#594)

* add support for Any annotation in schema model

the motivation behind this feature is to support column annotations
that can have any type, to support use cases like the one described
in #592, where
custom checks can be applied to any column except for ones that
are explicitly defined in the schema model class attributes

* update pylint, fix lint

* Docs/scaling - Bring Pandera to Spark and Dask (#588)

* scaling.rst

* edited conf

* finished first pass

* removing FugueWorkflow

* Update index.rst

* Update docs/source/scaling.rst

Co-authored-by: Niels Bantilan <[email protected]>

* add support for timezone-aware datetime strategies

* fix le/ge strategies with datetime

* fix mypy errors

Co-authored-by: Niels Bantilan <[email protected]>
Co-authored-by: Kevin Kho <[email protected]>

* schemas with multi-index columns correctly report errors (#600)

fixes #589

* strategies module supports undefined checks in regex columns (#599)

* Add support for empty data type annotation in SchemaModel (#602)

* remove artifacts of py3.6 support

* add support for empty data type annotation in SchemaModel

* fix frictionless version in dev dependencies

* fix setuptools version instead of frictionless

* fix setuptools pinning

* remove frictionless from core pandera deps (#609)

* support frictionless primary keys with multiple fields (#608)

* fix validation of check raising error without message (#613)

* docs/requirements.txt pin setuptools (#611)

* bump version 0.7.1

Co-authored-by: Jean-Francois Zinque <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: fkroll8 <[email protected]>
Co-authored-by: fkroll8 <[email protected]>
Co-authored-by: Kevin Kho <[email protected]>
@peter1456 peter1456 reopened this Sep 14, 2021
@peter1456
Copy link
Author

peter1456 commented Sep 14, 2021

Hi @cosmicBboy,

Sorry for not checking carefully, but seems like pandas would not broadcast properly even when the MultiIndex column names are wrapped in a list. See the following example

import pandas as pd
from pandera import Column, DataFrameSchema, Check

schema = DataFrameSchema({
    ("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
})

schema.validate(df, lazy=True)

When it is run, the following error is thrown:

Traceback (most recent call last):
  File "test_pandera.py", line 12, in <module>
    schema.validate(df, lazy=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandera\schemas.py", line 655, in validate
    raise errors.SchemaErrors(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandera\errors.py", line 87, in __init__
    error_counts, failure_cases = self._parse_schema_errors(schema_errors)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandera\errors.py", line 173, in _parse_schema_errors
    failure_cases = err.failure_cases.assign(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3695, in assign
    data[k] = com.apply_if_callable(v, data)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3040, in __setitem__
    self._set_item(key, value)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3116, in _set_item
    value = self._sanitize_column(key, value)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3764, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 747, in sanitize_index
    raise ValueError(
ValueError: Length of values (1) does not match length of index (2)

It seems like pandas try to fit an list-like object in the column "as-is", and failed, as the pandas interpret [("foo", "bar")] as list of length 1, and threw an error as the err.failure_cases has length 2, which looks like that before the column assignments:

   index  failure_case
0      1             2
1      2             3

Desktop:

  • OS: Windows 10
  • Version: Python 3.8.5, with pandera 0.7.1, pandas 1.1.3 installed

@peter1456
Copy link
Author

peter1456 commented Sep 14, 2021

Hi @cosmicBboy, thanks again!

@cosmicBboy
Copy link
Collaborator

fixed by #622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants