Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickup specific elements only appear on numeric columns #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

otiai10
Copy link

@otiai10 otiai10 commented Aug 22, 2019

Problem

pandas's KeyError is raised when applying infer_schema to some tables which DO NOT include numeric columns.

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    infer_schema.infer_schema(dataframe, 'foobar', sample_size=40)
  File "/usr/local/lib/python3.6/dist-packages/pydqc/infer_schema.py", line 208, in infer_schema
    'sample_uni_percentage', 'sample_min', 'sample_median', 'sample_max', 'sample_std']]
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2981, in __getitem__
    indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1271, in _convert_to_indexer
    return self._get_listlike_indexer(obj, axis, **kwargs)[1]
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1078, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py", line 1171, in _validate_read_indexer
    raise KeyError("{} not in index".format(not_found))
KeyError: "['sample_std', 'sample_min', 'sample_max', 'sample_median'] not in index"
root@aa29efcd9fe8:/workspace# vi /usr/local/lib/python3.6/dist-packages/pydqc/infer_schema.py

Reason

Those indices are added _cal_column_stat but only if the type of column is 'numeric'. After that, in infer_schema, it is assumed that at least one of the provided columns is 'numeric' and those indices should exist.

Solution

If there is NO column numeric, don't pick up those columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant