-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get schema from IPC (Feather v2) file without loading it first. #1114
Comments
Once getting columns from an IPC file is exposed, the following issue (once implemented) should help with faster loading of selected columns: |
I took a look at arrow2's code, and it seems that this is already possible: |
I first have to expose a schema object to the python API. @jorgecarleitao, I don't know much about IPC. Is it also possible to determine the size of a recordbatch, say only read |
It is possible, but not yet implemented ^_^ |
@ritchie46 Is the IPC schema already exposed to the polars python API? |
@ghuls, no not yet. I also don't think arrow supports this at the moment. |
@ritchie46 You mean in arrow-rs? Isn't it supported here: https://github.com/apache/arrow-rs/blob/master/arrow/src/ipc/reader.rs#L649-L652 |
Check! |
A bit late realization, but as this is supported in pyarrow. I don't think we should compile this behavior, and just create a python function that returns the schema. For the schema I think we should use a |
Sounds good for now. This still takes 10 seconds for a feather file with 1 million columns.
Reading actual data takes ages when not using memory mapping with pyarrow. So support for projection pushdown when reading from an IPC file (as added in jorgecarleitao/arrow2#264) would be very welcome. |
I didn't realize this read data first. In that case we'd better look at this snippet: this could be done in the python bindings. |
@ritchie46 polars/polars/polars-io/src/ipc.rs Line 97 in 629f501
|
Yes, this must be added now. Also a scan_ipc. |
Tested the new In [13]: feather_file = 'test.rankings.v2.feather'
# Get schem from Feather file with polars.
In [14]: %time feather_schema_polars = pl.read_ipc_schema(feather_file)
CPU times: user 2.36 s, sys: 405 ms, total: 2.77 s
Wall time: 2.76 s
# Get schema from Feather file with pyarrow (pyarrow might do some more stuff here).
In [15]: %time feather_dataset = ds.dataset(feather_file, format='feather')
CPU times: user 6.87 s, sys: 324 ms, total: 7.19 s
Wall time: 7.19 s
In [16]: feather_schema_pyarrow = feather_dataset.schema
# Number of columns
In [47]: len(feather_schema_polars)
Out[47]: 2208374
# Compare column names from schema extracted by polars and pyarrow.
In [48]: list(feather_schema_polars.keys()) == feather_schema_pyarrow.names
Out[48]: True |
This is possible now. |
Are you planning to support projection directly (instead of passing None unconditionally)? |
Oh, yes. Maybe we could make an issue for that and label it |
Are you sure parquet supports it already in non lazy mode? |
Not entrirely. :P |
Created: #1569 which adds lazy projection for IPC and eager column reading for IPC and parquet. |
Describe your feature request
Get schema from IPC (Feather v2) file without loading it first.
Similar to what is possible with pyarrow dataset API:
Getting all column names before loading the data is useful in case you have a lot of columns (with unknown names). from which
you want to select only a subset.
The text was updated successfully, but these errors were encountered: