Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

append transaction data path #390

Closed
Tracked by #377
zachschuermann opened this issue Oct 11, 2024 · 0 comments · Fixed by #393
Closed
Tracked by #377

append transaction data path #390

zachschuermann opened this issue Oct 11, 2024 · 0 comments · Fixed by #393
Assignees

Comments

@zachschuermann
Copy link
Collaborator

zachschuermann commented Oct 11, 2024

  • write_context API
  • expression fixup for physical to logical transform
  • write_metadata API for new add files
@zachschuermann zachschuermann changed the title data path append transaction data path Oct 11, 2024
@zachschuermann zachschuermann self-assigned this Oct 11, 2024
zachschuermann added a commit that referenced this issue Nov 8, 2024
This PR is the second (of two) major pieces for supporting simple blind
appends. It implements:
1. **new `Transaction` APIs** for appending data to delta tables:
a. `get_write_context()` to get a `WriteContext` to pass to the data
path which includes all information needed to write: `target directory`,
`snapshot schema`, `transformation expression`, and (future: columns to
collect stats on)
b. `add_write_metadata(impl EngineData)` to add metadata about a write
to the transaction along with a new static method
`transaction::get_write_metadata_schema` to provide the expected schema
of this engine data.
c. new machinery in 'commit' method to commit new `Add` actions for each
row of write_metadata from the API above.
2. **new default engine capabilities** for using the default engine to
write parquet data (to append to tables):
  a. parquet handler can now `write_parquet_file(EngineData)`
  b. usage example in `write.rs` tests for now
3. **new append tests** in the `write.rs` integration test suite

Details and some follow-ups:
- the parquet writing (similar to JSON) currently just buffers
everything into memory before issuing one big PUT. we should make this
smarter: single PUT for small data and MultipartUpload for larger data.
tracking in #418
- schema enforcement is done at the data layer. this means it is up to
the engine to call the expression evaluation and we expect this to fail
if the output schema is incorrect (see `test_append_invalid_schema` in
`write.rs` integration test). we may want to change this in the future
to eagerly error based on the engine providing a schema up front at
metadata time (transaction creation time)

based on #370
resolves #390
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant