ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

peter1456 · 2021-08-18T06:56:31Z

Describe the bug

A ValueError is raised in pandas when a pandas.DataFrame object with MultiIndexed Columns is lazily validated (using the parameter lazy=True) by a pandera.DataFrameSchema object, and there is at least one failed check for the columns.

Running the code below, the following exception is raised:

Traceback (most recent call last):
  line 18, in <module>
    print(schema.validate(df, lazy=True))
  File "Y:\Python39\lib\site-packages\pandera\schemas.py", line 613, in validate
    raise errors.SchemaErrors(
  File "Y:\Python39\lib\site-packages\pandera\errors.py", line 87, in __init__
    error_counts, failure_cases = self._parse_schema_errors(schema_errors)
  File "Y:\Python39\lib\site-packages\pandera\errors.py", line 172, in _parse_schema_errors
    failure_cases = err.failure_cases.assign(
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3699, in assign
    data[k] = com.apply_if_callable(v, data)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
    self._set_item(key, value)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
    value = self._sanitize_column(key, value)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "Y:\Python39\lib\site-packages\pandas\core\internals\construction.py", line 747, in sanitize_index
    raise ValueError(
ValueError: Length of values (2) does not match length of index (1)

Checking the line 172 in errors.py in pandera, i.e.

failure_cases = err.failure_cases.assign(
                    schema_context=err.schema.__class__.__name__,
                    check=check_identifier,
                    check_number=err.check_index,
                    column=column,
                )

It could be seen that the MultiIndexed Column with the name ("foo", "baz") , which has the type tuple, would not be interpreted as a single value by pandas, which then failed to be broadcasted to err.failure_cases and causing the ValueError from pandas during the assign method call.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema({
    ("foo", "bar"): Column(int),
    ("foo", "baz"): Column(int)
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

print(schema.validate(df, lazy=True))

Expected behavior

A pandera.SchemasError should be raised with the type mismatch on the column ("foo", "baz") logged, which has the value ("foo", "baz") in the column Column.

Desktop:

OS: Windows 10
Version: Python 3.9.0, with pandera 0.7.0, pandas 1.1.4 installed

Additional context

If we change the code above to

import pandas as pd
from pandera import Column, DataFrameSchema, Check

schema = DataFrameSchema({
    ("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
    ("foo", "baz"): Column(str, name="b")
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

try:
    schema.validate(df, lazy=True)
except Exception as e:
    print(e.failure_cases)

The output would be

  schema_context column     check  check_number  failure_case  index
0         Column    foo  <lambda>             0             2      1
1         Column    bar  <lambda>             0             3      2

which shows that the column name ("foo", "bar") is incorrectly interpreted as a pandas.Series-like object and treated as a column object when calling the method assign from err.failure_cases.

Potential Fix

A band-aid fix would be manually broadcast the input for the column Column before assigning the column to err.failure_cases, i.e.

failure_cases = err.failure_cases.assign(
                    schema_context=err.schema.__class__.__name__,
                    check=check_identifier,
                    check_number=err.check_index,
                    column=[column] * len(err.failure_cases),
                )

which seems to have fixed the problem.

The text was updated successfully, but these errors were encountered:

fixes #589

cosmicBboy · 2021-09-06T16:36:19Z

hi @peter1456, #600 should address this issue

fixes #589

peter1456 · 2021-09-09T13:16:12Z

Hi @cosmicBboy, thanks a lot!

@cosmicBboy

* Unique keyword arg (#580) * add copy button to docs (#448) * Add missing inplace arg to SchemaModel's validate (#450) * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * WIP * fix test errors, re-factor allow_duplicates handling * fix io tests * fix docs, remove _allow_duplicates private var * update unique type signature in strategies * completing tests for setters and lazy evaluation of unique kw * small fix for the linting errors * support dataframe-level uniqueness in strategies * add docs, fix error formatting, add multiindex support Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> * Add support for timezone-aware datetime strategies (#595) * add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]> * schemas with multi-index columns correctly report errors (#600) fixes #589 * strategies module supports undefined checks in regex columns (#599) * Add support for empty data type annotation in SchemaModel (#602) * remove artifacts of py3.6 support * add support for empty data type annotation in SchemaModel * fix frictionless version in dev dependencies * fix setuptools version instead of frictionless * fix setuptools pinning * remove frictionless from core pandera deps (#609) * support frictionless primary keys with multiple fields (#608) * fix validation of check raising error without message (#613) * docs/requirements.txt pin setuptools (#611) * bump version 0.7.1 Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: Kevin Kho <[email protected]>

peter1456 · 2021-09-14T04:27:00Z

Hi @cosmicBboy,

Sorry for not checking carefully, but seems like pandas would not broadcast properly even when the MultiIndex column names are wrapped in a list. See the following example

import pandas as pd
from pandera import Column, DataFrameSchema, Check

schema = DataFrameSchema({
    ("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
})

schema.validate(df, lazy=True)

When it is run, the following error is thrown:

Traceback (most recent call last):
  File "test_pandera.py", line 12, in <module>
    schema.validate(df, lazy=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandera\schemas.py", line 655, in validate
    raise errors.SchemaErrors(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandera\errors.py", line 87, in __init__
    error_counts, failure_cases = self._parse_schema_errors(schema_errors)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandera\errors.py", line 173, in _parse_schema_errors
    failure_cases = err.failure_cases.assign(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3695, in assign
    data[k] = com.apply_if_callable(v, data)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3040, in __setitem__
    self._set_item(key, value)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3116, in _set_item
    value = self._sanitize_column(key, value)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3764, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 747, in sanitize_index
    raise ValueError(
ValueError: Length of values (1) does not match length of index (2)

It seems like pandas try to fit an list-like object in the column "as-is", and failed, as the pandas interpret [("foo", "bar")] as list of length 1, and threw an error as the err.failure_cases has length 2, which looks like that before the column assignments:

   index  failure_case
0      1             2
1      2             3

Desktop:

OS: Windows 10
Version: Python 3.8.5, with pandera 0.7.1, pandas 1.1.3 installed

peter1456 · 2021-09-14T06:33:51Z

Hi @cosmicBboy, thanks again!

cosmicBboy · 2021-09-14T23:50:56Z

fixed by #622

peter1456 added the bug Something isn't working label Aug 18, 2021

cosmicBboy added a commit that referenced this issue Sep 6, 2021

schemas with multi-index columns correctly report errors

f3cc318

fixes #589

cosmicBboy mentioned this issue Sep 6, 2021

schemas with multi-index columns correctly report errors #600

Merged

cosmicBboy added a commit that referenced this issue Sep 6, 2021

schemas with multi-index columns correctly report errors (#600)

6b3a4e9

fixes #589

peter1456 closed this as completed Sep 9, 2021

peter1456 reopened this Sep 14, 2021

cosmicBboy mentioned this issue Sep 14, 2021

Bugfix/589: MultiIndex schema error reporting #622

Merged

cosmicBboy closed this as completed Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

peter1456 commented Aug 18, 2021 •

edited

Loading

cosmicBboy commented Sep 6, 2021

peter1456 commented Sep 9, 2021

peter1456 commented Sep 14, 2021 •

edited

Loading

peter1456 commented Sep 14, 2021 •

edited

Loading

cosmicBboy commented Sep 14, 2021

ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

Comments

peter1456 commented Aug 18, 2021 • edited Loading

Code Sample, a copy-pastable example

Expected behavior

Desktop:

Additional context

Potential Fix

cosmicBboy commented Sep 6, 2021

peter1456 commented Sep 9, 2021

peter1456 commented Sep 14, 2021 • edited Loading

Desktop:

peter1456 commented Sep 14, 2021 • edited Loading

cosmicBboy commented Sep 14, 2021

peter1456 commented Aug 18, 2021 •

edited

Loading

peter1456 commented Sep 14, 2021 •

edited

Loading

peter1456 commented Sep 14, 2021 •

edited

Loading