-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading nested struct panics with OutOfSpec
error
#3942
Comments
@jorgecarleitao: I don't think that all the cases are covered in the current I would reopen the previous #3892 ticket but I cannot. cc: @ritchie46 |
@ritchie46, @jorgecarleitao: Any ETA on having this fix pulled from arrow2 into here? |
It already is. |
Let me pull the master and try the test again. |
@jorgecarleitao, I did run some tests and I did find another case with
Maybe can help in any way until I'll be able to create a slim parquet file. The current file that produces this error is about |
@ritchie46, @jorgecarleitao: I managed to print out the conflicting data structures. This is how they are looking... Values at index
Values at index
The first line (the one with index The fields are:
I don't think is the culprit is the data because there is no issue in Spark. I think, there is an issue with the |
Hey @andrei-ionescu . Thanks again for the patience and for the report - it is very useful 🙇. Sorry for the late reply, I am on vacations with limited access to internet. Just to make sure I understood the last comment: "index 0" and "index 1" represent the column index, "line" represents the row number, and the issue is that the columns have a different number of rows. Are you able to create a (mock) file with e.g. pandas or pyarrow that reproduces the problem? |
@jorgecarleitao: Here is the file — part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet.zip — about |
@jorgecarleitao, @ritchie46: Is this cherry picked in polars? |
@jorgecarleitao, @ritchie46: I just tried latest arrow2 + latest polars (both straight from the git repo) + the file above and I still see the same Am I missing something?
|
No, that was on me. The fix was insufficient - I believe jorgecarleitao/arrow2#1188 fixes this. Your file is a really good fuzzy test. |
@jorgecarleitao: I'm glad that it's helpful. |
@jorgecarleitao, I just tested/checked the code changes you merged with the jorgecarleitao/arrow2#1188 and I can still see the issue. I also can validate that the error message now is the new one you changed in the PR:
|
Strange - I can read the file you posted here with # in arrow2
cargo run --release --example parquet_read --features io_parquet,io_parquet_compression,io_print -- part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet Changing |
The error still comes from |
I don't think the fix was already in the polars branch. |
I'm building the example from git with updated dependencies in Polars to reference the latest arrow2. |
It may be something in the
Maybe there is something wrong with the params received from polars. |
Is there an update on this? I curious on whether something else is required here as this is an important use-case |
If you can run it in arrow, I expect this is something on our side. I will look into this |
I can also read the file on latest master: >>> pl.read_parquet("nested_struct_OutOfSpec.snappy.parquet")
shape: (2, 1)
┌─────────────────────────────────────┐
│ dim │
│ --- │
│ struct[4] │
╞═════════════════════════════════════╡
│ {{null,null,null,null,null,null,... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {{null,null,null,"2gYhOc2Edy8GBw... │
└─────────────────────────────────────┘ Thanks for the fix upstream @jorgecarleitao. @andrei-ionescu we are close to a crates.io release. You can already point to latest master to have your fix working, but it will also work on crates.io soon. :) I will close this now. |
|
The issue occurs when appending structs of different chunk sizes. MWE: s = pl.Series([{'_experience': {'aaid': {'id': '7759804769753743647',
'namespace': {'code': '3245164418740504690'},
'primary': True},
'mcid': {'id': None, 'namespace': {'code': None}, 'primary': None}}},
{'_experience': {'aaid': {'id': '8337071409830986729',
'namespace': {'code': '3245164418740504690'},
'primary': False},
'mcid': {'id': '6495617396286731444',
'namespace': {'code': '3624253825458969727'},
'primary': True}}},
{'_experience': {'aaid': {'id': '5948492535810675291',
'namespace': {'code': '3245164418740504690'},
'primary': True},
'mcid': {'id': None, 'namespace': {'code': None}, 'primary': None}}}])
s.append(s[:2]) stacktracethread '' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n However, the values at index 1 have a length of 3, which is different from values at index 0, 2.")', /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/8604cb7/src/array/struct_/mod.rs:118:52 stack backtrace: 0: rust_begin_unwind at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/std/src/panicking.rs:584:5 1: core::panicking::panic_fmt at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/core/src/panicking.rs:142:14 2: core::result::unwrap_failed at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/core/src/result.rs:1814:5 3: core::result::Result::unwrap at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/core/src/result.rs:1107:23 4: arrow2::array::struct_::StructArray::new at /home/ritchie46/.cargo/git/checkouts/arrow2-8a2ad61d97265680/8604cb7/src/array/struct_/mod.rs:118:9 5: polars_core::chunked_array::logical::struct_::StructChunked::update_chunks at /home/ritchie46/code/polars/polars/polars-core/src/chunked_array/logical/struct_/mod.rs:76:32 6: polars_core::series::implementations::struct_::>::append at /home/ritchie46/code/polars/polars/polars-core/src/series/implementations/struct_.rs:128:9 7: polars_core::series::Series::append at /home/ritchie46/code/polars/polars/polars-core/src/series/mod.rs:210:9 8: polars_core::series::implementations::struct_::>::append at /home/ritchie46/code/polars/polars/polars-core/src/series/implementations/struct_.rs:126:13 9: polars_core::series::Series::append at /home/ritchie46/code/polars/polars/polars-core/src/series/mod.rs:210:9 10: polars_core::series::implementations::struct_::>::append at /home/ritchie46/code/polars/polars/polars-core/src/series/implementations/struct_.rs:126:13 11: polars_core::series::Series::append at /home/ritchie46/code/polars/polars/polars-core/src/series/mod.rs:210:9 12: polars::series::PySeries::append at /home/ritchie46/code/polars/py-polars/src/series.rs:493:9 13: polars::series::_::_::__init::__INVENTORY::__wrap::{{closure}} at /home/ritchie46/code/polars/py-polars/src/series.rs:198:1 14: std::panicking::try::do_call at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/std/src/panicking.rs:492:40 15: __rust_try 16: std::panicking::try at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/std/src/panicking.rs:456:19 17: std::panic::catch_unwind at /rustc/93ffde6f04d3d24327a4e17a2a2bf4f63c246235/library/std/src/panic.rs:137:14 18: polars::series::_::_::__init::__INVENTORY::__wrap at /home/ritchie46/code/polars/py-polars/src/series.rs:198:1 19: method_vectorcall_VARARGS_KEYWORDS at /opt/conda/conda-bld/python-split_1649141344976/work/Objects/descrobject.c:348 20: _PyObject_VectorcallTstate at /opt/conda/conda-bld/python-split_1649141344976/work/Include/cpython/abstract.h:118 21: PyObject_Vectorcall at /opt/conda/conda-bld/python-split_1649141344976/work/Include/cpython/abstract.h:127 22: call_function at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:5077 23: _PyEval_EvalFrameDefault at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:3506 24: _PyEval_EvalFrame at /opt/conda/conda-bld/python-split_1649141344976/work/Include/internal/pycore_ceval.h:40 25: _PyEval_EvalCode at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4329 26: _PyFunction_Vectorcall at /opt/conda/conda-bld/PanicException Traceback (most recent call last) File ~/code/polars/py-polars/polars/internals/series.py:1410, in Series.append(self, other, append_chunks) PanicException: called python-split_1649141344976/work/Objects/call.c:396
119: PyObject_Vectorcall |
#4217 fixes the issue. Note that we still cannot read the file because it contains a |
@ritchie46, thanks for looking into this.
|
Polars will not add the map dtype. It's benefit do not outweigh the extra complexity. Maybe we can investigate conversion of maps to struct. But I will have to explore that. |
With #4226 we can read the entire file. The |
@ritchie46, @jorgecarleitao: We need to re-open this one more time. With the code given bellow and the previous file — part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet.zip — I get again the let df = LazyFrame::scan_parquet(
file_location,
ScanArgsParquet::default())
.unwrap()
.filter(
col("timestamp").cast(DataType::Datetime(TimeUnit::Nanoseconds, None))
.gt(datetime(DatetimeArgs {
year: lit(2022),
month: lit(1),
day: lit(1),
hour: None,
minute: None,
second: None,
millisecond: None
}))
)
.select([
count().alias("monthcount"),
col("timestamp"),
])
.collect()
.unwrap();
dbg!(df); When I remove the filter, it does not panic. Here is the panic error:
Could you have another look? |
@ritchie46, @jorgecarleitao Any updates on this? |
@andrei-ionescu found another issue, opened it upstream jorgecarleitao/arrow2#1239. |
@ritchie46, @jorgecarleitao Thanks for looking into it! I've seen the upstream ticket and fix PR are complete. Is it ready in this PR? Can I run another set of tests? |
Yes, give it a spin. :) |
Folks I am facing a similar error on the latest version any pointers as to how i can fix this ?? |
Hi @rajatkb-sc - I also encountered the same issue in 0.19. I was able to narrow it down to an empty struct inside a nested list in a json file. I wrote a script to loop through the json and delete empty nodes before loading to a dataframe, and it resolved the issue. |
What language are you using?
Rust
Which feature gates did you use?
"polars-io", "parquet", "lazy", "dtype-struct"
Have you tried latest version of polars?
What version of polars are you using?
Latest,
master
branch.What operating system are you using polars on?
macOS Monterey 12.3.1
What language version are you using
Describe your bug.
Reading nested struct panics with
OutOfSpec
error.What are the steps to reproduce the behavior?
Given the attached parquet file with only 2 rows: nested_struct_OutOfSpec.snappy.parquet.zip
Running the following code:
Results in this panic error:
What is the actual behavior?
The result is a panic error with this output:
What is the expected behavior?
The parquet file should have been correctly loaded.
The
parquet-tools
util shows it property. Also, Apache Spark properly reads it and processes it.The text was updated successfully, but these errors were encountered: