Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support nested column pruning in ORC reader when reading a struct column. #8848

Closed
firestarman opened this issue Jul 26, 2021 · 4 comments · Fixed by #9496
Closed

[FEA] Support nested column pruning in ORC reader when reading a struct column. #8848

firestarman opened this issue Jul 26, 2021 · 4 comments · Fixed by #9496
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@firestarman
Copy link
Contributor

firestarman commented Jul 26, 2021

Assuming there is an orc file containing one struct column "_c0" as below.

>>> pandas.read_orc("/data/tmp/test.log")
                                                  _c0
0                                                None
1   {'child0': -21.0, 'child1': 770290356.0, 'chil...
2   {'child0': 112.0, 'child1': 741485586.0, 'chil...
3   {'child0': -50.0, 'child1': -2094070070.0, 'ch...
4   {'child0': -49.0, 'child1': -228841867.0, 'chi...
5   {'child0': -55.0, 'child1': -2116626173.0, 'ch...
6   {'child0': None, 'child1': 2117211837.0, 'chil...
7   {'child0': -61.0, 'child1': -456120487.0, 'chi...
8                                                None
9   {'child0': 126.0, 'child1': -850819620.0, 'chi...
10  {'child0': 88.0, 'child1': 2143646677.0, 'chil...
11  {'child0': -18.0, 'child1': -1698102546.0, 'ch...
12  {'child0': -121.0, 'child1': 1460680657.0, 'ch...
>>> 

We can read only one nested column by Pandas orc reader, e.g

>>> pandas.read_orc("/data/tmp/test.log", columns=['_c0.child0'])
                   _c0
0                 None
1    {'child0': -21.0}
2    {'child0': 112.0}
3    {'child0': -50.0}
4    {'child0': -49.0}
5    {'child0': -55.0}
6     {'child0': None}
7    {'child0': -61.0}
8                 None
9    {'child0': 126.0}
10    {'child0': 88.0}
11   {'child0': -18.0}
12  {'child0': -121.0}
>>> 

or two of the nested columns,

>>> pandas.read_orc("/data/tmp/test.log", columns=['_c0.child1', '_c0.child0'])
                                           _c0
0                                         None
1     {'child0': -21.0, 'child1': 770290356.0}
2     {'child0': 112.0, 'child1': 741485586.0}
3   {'child0': -50.0, 'child1': -2094070070.0}
4    {'child0': -49.0, 'child1': -228841867.0}
5   {'child0': -55.0, 'child1': -2116626173.0}
6     {'child0': None, 'child1': 2117211837.0}
7    {'child0': -61.0, 'child1': -456120487.0}
8                                         None
9    {'child0': 126.0, 'child1': -850819620.0}
10    {'child0': 88.0, 'child1': 2143646677.0}
11  {'child0': -18.0, 'child1': -1698102546.0}
12  {'child0': -121.0, 'child1': 1460680657.0}

However cudf will complain an error for all the cases above.

>>> cudf.read_orc("/data/tmp/test.log", columns=['_c0.child0'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/liangcail/miniconda3/envs/cudf-python-rt/lib/python3.7/site-packages/cudf/io/orc.py", line 302, in read_orc
    timestamp_type,
  File "cudf/_lib/orc.pyx", line 73, in cudf._lib.orc.read_orc
  File "cudf/_lib/orc.pyx", line 110, in cudf._lib.orc.read_orc
RuntimeError: cuDF failure at: ../src/io/orc/reader_impl.cu:540: Unknown column name : _c0.child0

Depends on #7830

@firestarman firestarman added feature request New feature or request Needs Triage Need team to review and classify labels Jul 26, 2021
@firestarman
Copy link
Contributor Author

You can get the test file in issue #8704.

@devavret
Copy link
Contributor

CC: @rgsl888prabhu

@vuule I think we also need to support this for parquet reader to address this #7248 (comment).

@beckernick beckernick added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 26, 2021
@beckernick beckernick added this to the IO Data Type Expansion milestone Jul 26, 2021
@firestarman
Copy link
Contributor Author

firestarman commented Jul 27, 2021

One more thing.
The column order in the output of Pandas aligns with the file schema ('child0', 'child1'), even ['_c0.child1', '_c0.child0'] is specified.

I hope cuDF can support the output columns having the same order with the column names in parameter columns, meaning ('child1', 'child0'). Or at least having an addition parameter to control this behavior.

@devavret
Copy link
Contributor

I hope cuDF can support the output columns having the same order with the column names in parameter columns, meaning ('child1', 'child0'). Or at least having an addition parameter to control this behavior.

Actually that's easier to support. It's what we do for non-nested column selection.

@vuule vuule self-assigned this Oct 8, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 28, 2021
Closes #8848

- Allows caller to specify nested column paths, so that the fields not listed in the `columns` parameter are excluded.
- The order of fields/columns in the output table is consistent with the order of paths/names in the `columns` parameter.
- Moved `aggregate_orc_metadata` implementation to a separate file (can be `.cpp`!)
- Add tests to cover different cases with a mix of nested and parent columns selection.
- changed a few fields from `uint32_t` to `int32_t` to avoid unsigned arithmetic.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)

URL: #9496
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants