Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write support #41
Write support #41
Changes from 42 commits
7133054
ffecf72
25eb597
4cd493e
a726b1d
b88f736
f53626d
0c665ef
3f79dbd
02430bb
eb4dd62
c891382
aae5a57
cff3a1d
082387e
997b673
9d52906
8893cf3
55f27c9
9a0096b
4f5b710
926d947
50575a8
5482ae0
2fa01f4
580c824
254d7e8
f4ae6c5
760c0d4
3dba41a
bcc5176
6d5fbb1
3309129
12c4699
8ef1a06
aabfb09
149c3ec
17fd689
54e36ab
ab36ec3
d6df342
1398a2f
c426068
1861647
cebc781
4d0d11c
abda552
3cd5829
5f86b15
5044da6
e020efb
0b42471
a41abd0
286cf47
559618c
4153e78
54e75d6
158077c
bbc0b35
d441af9
e395a8f
2a65357
a013f35
abc0741
b817a15
48ba852
664e113
85ac0eb
7e8c04f
7baf3ec
ab020b9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the
_generate_datafile_filename
method be a bit more robust? The UUID in the filename is not for uniqueness, it is to identify files from the same write. Here's a breakdown of the format used by Spark:Uniqueness is a combination of the task ID, the write-specific UUID, and the file counter for the task. The partition ordinal is used to preserve locality in file names.
It would be nice to expose some of these options in the WriteTask, like a unique UUID, unique task ID, and part ordinal. Then the counter would be handled locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that we write a single file in the first iteration. I've moved the uuid and task-id to the
WriteTask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Fokko! I am working with @syun64 to test out the impending write feature. During the test, we realized the field ids are not being set in the written parquet file.
To help illustrate this, I put together A diff against your working branch
The field_ids not written correctly in the parquet (current behavior) looks like:
and the parquet schema after using a different metadata key for field id in the arrow schema to write the parquet file looks like:
We feel it is a peculiar issue with pyarrow.parquet.ParquetWriter where we need to define the field_ids in the metadata of the pyarrow.schema conforming to a particular format like "PARQUET:field_id" instead of "field_id".
Do you think we should use a different pyarrow schema when we write the pyiceberg file?
prefix with the 'PARQUET:'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jqin61 for testing this as it is paramount that the field-IDs are written properly. I'm able to reproduce this locally:
After changing this to
PARQUET:field_id
it is fixed indeed:Thanks for flagging this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I have
Should I expect
df
to have a pa.Schema that can be converted withpyarrow_to_schema
? even the modified one presented in the diff above? I wasn't able to get either branch to work as the schema ofdf
above has nometadata
for its fieldsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @robtandy thanks for chiming in here. I think the PyArrow to schema should also include the field-id metadata. When you create a new table, it should re-assign the field-ids if they are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing that
df
is used for both the input dataframe and the output data file.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should not do that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also come from the write if possible so we don't have a S3 request here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the sort order of the table at the time the file was written, it is the sort used to order files in the data file. If we can't guarantee that the records are in this order we should not apply this metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should come from the caller (write task?) and default to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an unpartitioned write, we need to ensure that this is the unpartitioned spec in the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check if the partition spec is empty:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since
write_file
is a public method, we can't guarantee that the caller did this check. I agree that it is safe when called fromappend
oroverwrite
, but a caller could use this method directly to create a data file for a partitioned table right?Wouldn't it be easy to just pass the spec ID and partition tuple (an empty
Record
) throughWriteTask
for now? I think it would make sense if aWriteTask
were for a single partition.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data files are never written with
equality_ids
. That is only used for equality delete files. It should always be null for data files.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I messed up here, I thought it was referring to the
identifier_field_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is required for manifest list. This is always known when we write the manifest list because manifest lists are written for every commit attempt (after the sequence number has been updated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let me check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct because the sequence number should always be passed in.