You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ls test
part-00000-ddf3787b-958c-4c57-9968-f3afdec1eb7e-c000.snappy.orc part-00079-ddf3787b-958c-4c57-9968-f3afdec1eb7e-c000.snappy.orc
part-00039-ddf3787b-958c-4c57-9968-f3afdec1eb7e-c000.snappy.orc _SUCCESS
import dask_cudf
dask_cudf.read_orc('test/*.orc')
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-74c488c779a0> in <module>
1 import dask_cudf
2
----> 3 dask_cudf.read_orc('test/*.orc')
~/conda/envs/rapids/lib/python3.8/site-packages/dask_cudf/io/orc.py in read_orc(path, columns, filters, storage_options, **kwargs)
86
87 with fs.open(paths[0], "rb") as f:
---> 88 meta = cudf.read_orc(f, stripes=[0], columns=columns, **kwargs)
89
90 name = "read-orc-" + tokenize(fs_token, path, columns, **kwargs)
~/conda/envs/rapids/lib/python3.8/site-packages/cudf/io/orc.py in read_orc(filepath_or_buffer, engine, columns, filters, stripes, skiprows, num_rows, use_index, timestamp_type, **kwargs)
258 if engine == "cudf":
259 df = DataFrame._from_table(
--> 260 liborc.read_orc(
261 filepath_or_buffer,
262 columns,
cudf/_lib/orc.pyx in cudf._lib.orc.read_orc()
cudf/_lib/orc.pyx in cudf._lib.orc.read_orc()
RuntimeError: cuDF failure at: ../src/io/orc/orc.cpp:478: Invalid stripe index
The ORC files in test can all be read individually by cudf.read_orc. Note that some of the files have 0 rows. It seems like dask-cudf is failing when attempting to read an empty ORC file.
This does not happen with Parquet files.
The text was updated successfully, but these errors were encountered:
Thanks for raising this @randerzander - Looks like dask_cudf may failing to generate metadata from an empty file. I should be able to look into this tomorrow :)
Closes#8011
Dask-cuDF currently reads a single stripe to infer metadata in `read_orc`. When the first path corresponds to an empty file, there is no stripe "0" to read. This PR includes a simple fix (and test coverage).
Authors:
- Richard (Rick) Zamora (https://github.com/rjzamora)
Approvers:
- Keith Kraus (https://github.com/kkraus14)
URL: #8021
I'm trying to use dask-cudf to read ORC files generated by a Spark job.
The above snippet results in 4 files in test:
The ORC files in
test
can all be read individually bycudf.read_orc
. Note that some of the files have 0 rows. It seems like dask-cudf is failing when attempting to read an empty ORC file.This does not happen with Parquet files.
The text was updated successfully, but these errors were encountered: