default JSON/parquet handler write implementations should stream to object store #418

zachschuermann · 2024-10-23T00:20:58Z

write_json_file/write_parquet in the default json/parquet handler should stream instead of buffering everything before doing a single PUT

The text was updated successfully, but these errors were encountered:

This PR is the second (of two) major pieces for supporting simple blind appends. It implements: 1. **new `Transaction` APIs** for appending data to delta tables: a. `get_write_context()` to get a `WriteContext` to pass to the data path which includes all information needed to write: `target directory`, `snapshot schema`, `transformation expression`, and (future: columns to collect stats on) b. `add_write_metadata(impl EngineData)` to add metadata about a write to the transaction along with a new static method `transaction::get_write_metadata_schema` to provide the expected schema of this engine data. c. new machinery in 'commit' method to commit new `Add` actions for each row of write_metadata from the API above. 2. **new default engine capabilities** for using the default engine to write parquet data (to append to tables): a. parquet handler can now `write_parquet_file(EngineData)` b. usage example in `write.rs` tests for now 3. **new append tests** in the `write.rs` integration test suite Details and some follow-ups: - the parquet writing (similar to JSON) currently just buffers everything into memory before issuing one big PUT. we should make this smarter: single PUT for small data and MultipartUpload for larger data. tracking in #418 - schema enforcement is done at the data layer. this means it is up to the engine to call the expression evaluation and we expect this to fail if the output schema is incorrect (see `test_append_invalid_schema` in `write.rs` integration test). we may want to change this in the future to eagerly error based on the engine providing a schema up front at metadata time (transaction creation time) based on #370 resolves #390

zachschuermann mentioned this issue Oct 23, 2024

[write] add Transaction with commit info and commit implementation #370

Merged

zachschuermann changed the title ~~default JSON handler write implementation should stream to object store~~ default JSON/parquet handler write implementations should stream to object store Oct 31, 2024

zachschuermann mentioned this issue Oct 31, 2024

[write] Transaction append data API #393

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default JSON/parquet handler write implementations should stream to object store #418

default JSON/parquet handler write implementations should stream to object store #418

zachschuermann commented Oct 23, 2024 •

edited

Loading

default JSON/parquet handler write implementations should stream to object store #418

default JSON/parquet handler write implementations should stream to object store #418

Comments

zachschuermann commented Oct 23, 2024 • edited Loading

zachschuermann commented Oct 23, 2024 •

edited

Loading