Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get schema from IPC (Feather v2) file without loading it first. #1114

Closed
ghuls opened this issue Aug 9, 2021 · 20 comments
Closed

Get schema from IPC (Feather v2) file without loading it first. #1114

ghuls opened this issue Aug 9, 2021 · 20 comments

Comments

@ghuls
Copy link
Collaborator

ghuls commented Aug 9, 2021

Describe your feature request

Get schema from IPC (Feather v2) file without loading it first.

Similar to what is possible with pyarrow dataset API:

ds = pyarrow.dataset.dataset(feather_file,` format="feather")

# Get all column names
ds.schema.names

Getting all column names before loading the data is useful in case you have a lot of columns (with unknown names). from which
you want to select only a subset.

import pyarrow as pa
import pyarrow.feather as pf
import pyarrow.dataset as ds

feather_file = 'test.feather'

In [9]: %time feather_v2_dataset = ds.dataset(feather_file, format="feather")
CPU times: user 8.74 s, sys: 1.19 s, total: 9.93 s
Wall time: 9.97 s

In [10]: %time column_names = feather_v2_dataset.schema.names
CPU times: user 485 ms, sys: 248 ms, total: 733 ms
Wall time: 733 ms

# Select 10 columns (2 million columns in total).
selected_column_names = column_names[0:100000:10000]

In [12]: len(selected_column_names)
Out[12]: 10

# Use pyarrow memory mapping to load only the columns of interest.
In [16]: %time pa_table = pf.read_table(feather_file, columns=selected_column_names, memory_map=True)
CPU times: user 6.85 s, sys: 32.5 s, total: 39.3 s
Wall time: 48 s

# Second time loading the data is faster (as it is still in cache). 
In [17]: %time pa_table = pf.read_table(feather_file, columns=selected_column_names, memory_map=True)
CPU times: user 6.17 s, sys: 1.75 s, total: 7.92 s
Wall time: 10.8 s

# Convert to polars dataframe.
%time pl_df = pl.DataFrame(pa_table)
CPU times: user 1.18 ms, sys: 183 µs, total: 1.37 ms
Wall time: 1.26 ms

# Without memory mapping it takes ages:
# Pyarrow Dataset API does not seem to use memory mapping at the moment and has similar timing

In [21]: %time pa_table = pf.read_table(feather_file, columns=selected_column_names, memory_map=False)
CPU times: user 9.38 s, sys: 4min 57s, total: 5min 7s
Wall time: 20min 47s

In [50]: %time pa_table_from_ds = feather_v2_dataset.to_table(columns=selected_column_names)
CPU times: user 7.36 s, sys: 5min 23s, total: 5min 30s
Wall time: 24min 13s
@ghuls
Copy link
Collaborator Author

ghuls commented Aug 9, 2021

Once getting columns from an IPC file is exposed, the following issue (once implemented) should help with faster loading of selected columns:
jorgecarleitao/arrow2#261
jorgecarleitao/arrow2#264

@jorgecarleitao
Copy link
Collaborator

I took a look at arrow2's code, and it seems that this is already possible: read_file_metadata returns FileMetadata, which contains FileMetadata::schema().

@ritchie46
Copy link
Member

I first have to expose a schema object to the python API.

@jorgecarleitao, I don't know much about IPC. Is it also possible to determine the size of a recordbatch, say only read n rows?

@jorgecarleitao
Copy link
Collaborator

It is possible, but not yet implemented ^_^

@ghuls
Copy link
Collaborator Author

ghuls commented Sep 13, 2021

@ritchie46 Is the IPC schema already exposed to the polars python API?

@ritchie46
Copy link
Member

@ghuls, no not yet. I also don't think arrow supports this at the moment.

@ghuls
Copy link
Collaborator Author

ghuls commented Sep 13, 2021

@ritchie46
Copy link
Member

Check!

@ritchie46
Copy link
Member

A bit late realization, but as this is supported in pyarrow. I don't think we should compile this behavior, and just create a python function that returns the schema. For the schema I think we should use a Dict[str, Polars DataType].

@ghuls
Copy link
Collaborator Author

ghuls commented Sep 14, 2021

Sounds good for now.

This still takes 10 seconds for a feather file with 1 million columns.

ds = pyarrow.dataset.dataset(feather_file,` format="feather")

# Get schema.
ds.schema

Reading actual data takes ages when not using memory mapping with pyarrow.

So support for projection pushdown when reading from an IPC file (as added in jorgecarleitao/arrow2#264) would be very welcome.

@ritchie46
Copy link
Member

I didn't realize this read data first. In that case we'd better look at this snippet:

#1114 (comment)

this could be done in the python bindings.

@ghuls
Copy link
Collaborator Author

ghuls commented Sep 16, 2021

@ritchie46
Can the projection option of arrow::io::ipc:read::FileReader::new:
https://github.com/jorgecarleitao/arrow2/blob/b77de38cc1cf804ba00e63daf7c0e9f26bb2fa5b/src/io/ipc/read/reader.rs#L234-L238
be exposed in polars (now None is passed unconditionally):

let ipc_reader = read::FileReader::new(&mut self.reader, metadata, None);

@ritchie46
Copy link
Member

@ritchie46
Can the projection option of arrow::io::ipc:read::FileReader:🆕
https://github.com/jorgecarleitao/arrow2/blob/b77de38cc1cf804ba00e63daf7c0e9f26bb2fa5b/src/io/ipc/read/reader.rs#L234-L238
be exposed in polars (now None is passed unconditionally):

let ipc_reader = read::FileReader::new(&mut self.reader, metadata, None);

Yes, this must be added now. Also a scan_ipc.

@ghuls
Copy link
Collaborator Author

ghuls commented Sep 17, 2021

Tested the new pl.read_ipc_schema function.

In [13]: feather_file = 'test.rankings.v2.feather'

# Get schem from Feather file with polars.
In [14]: %time feather_schema_polars = pl.read_ipc_schema(feather_file)
CPU times: user 2.36 s, sys: 405 ms, total: 2.77 s
Wall time: 2.76 s


# Get schema from Feather file with pyarrow (pyarrow might do some more stuff here).
In [15]: %time feather_dataset = ds.dataset(feather_file, format='feather')
CPU times: user 6.87 s, sys: 324 ms, total: 7.19 s
Wall time: 7.19 s

In [16]: feather_schema_pyarrow = feather_dataset.schema

# Number of columns
In [47]: len(feather_schema_polars)
Out[47]: 2208374

# Compare column names from schema extracted by polars and pyarrow.
In [48]: list(feather_schema_polars.keys()) == feather_schema_pyarrow.names
Out[48]: True

@ritchie46
Copy link
Member

This is possible now.

@ghuls
Copy link
Collaborator Author

ghuls commented Oct 20, 2021

@ritchie46
Can the projection option of arrow::io::ipc:read::FileReader:new
https://github.com/jorgecarleitao/arrow2/blob/b77de38cc1cf804ba00e63daf7c0e9f26bb2fa5b/src/io/ipc/read/reader.rs#L234-L238
be exposed in polars (now None is passed unconditionally):

let ipc_reader = read::FileReader::new(&mut self.reader, metadata, None);

Yes, this must be added now. Also a scan_ipc.

Are you planning to support projection directly (instead of passing None unconditionally)?

@ritchie46
Copy link
Member

Oh, yes. Maybe we could make an issue for that and label it good_first_issue. Its a copy of the parquet logic.

@ghuls
Copy link
Collaborator Author

ghuls commented Oct 20, 2021

Are you sure parquet supports it already in non lazy mode?

@ritchie46
Copy link
Member

Are you sure parquet supports it already in non lazy mode?

Not entrirely. :P

@ghuls
Copy link
Collaborator Author

ghuls commented Oct 20, 2021

Created: #1569 which adds lazy projection for IPC and eager column reading for IPC and parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants