Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default JSON/parquet handler write implementations should stream to object store #418

Open
zachschuermann opened this issue Oct 23, 2024 · 0 comments

Comments

@zachschuermann
Copy link
Collaborator

zachschuermann commented Oct 23, 2024

write_json_file/write_parquet in the default json/parquet handler should stream instead of buffering everything before doing a single PUT

@zachschuermann zachschuermann changed the title default JSON handler write implementation should stream to object store default JSON/parquet handler write implementations should stream to object store Oct 31, 2024
zachschuermann added a commit that referenced this issue Nov 8, 2024
This PR is the second (of two) major pieces for supporting simple blind
appends. It implements:
1. **new `Transaction` APIs** for appending data to delta tables:
a. `get_write_context()` to get a `WriteContext` to pass to the data
path which includes all information needed to write: `target directory`,
`snapshot schema`, `transformation expression`, and (future: columns to
collect stats on)
b. `add_write_metadata(impl EngineData)` to add metadata about a write
to the transaction along with a new static method
`transaction::get_write_metadata_schema` to provide the expected schema
of this engine data.
c. new machinery in 'commit' method to commit new `Add` actions for each
row of write_metadata from the API above.
2. **new default engine capabilities** for using the default engine to
write parquet data (to append to tables):
  a. parquet handler can now `write_parquet_file(EngineData)`
  b. usage example in `write.rs` tests for now
3. **new append tests** in the `write.rs` integration test suite

Details and some follow-ups:
- the parquet writing (similar to JSON) currently just buffers
everything into memory before issuing one big PUT. we should make this
smarter: single PUT for small data and MultipartUpload for larger data.
tracking in #418
- schema enforcement is done at the data layer. this means it is up to
the engine to call the expression evaluation and we expect this to fail
if the output schema is incorrect (see `test_append_invalid_schema` in
`write.rs` integration test). we may want to change this in the future
to eagerly error based on the engine providing a schema up front at
metadata time (transaction creation time)

based on #370
resolves #390
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant