-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5537: [JS] Support delta dictionaries in RecordBatchWriter and DictionaryBuilder #4502
Conversation
73d8c01
to
4574bb0
Compare
4574bb0
to
6a70a25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Just a couple of minor comments.
I'm a little confused about what's going on with the hard coded nullable
in Field
. It looks like maybe it's something that was changed for testing and then left in?
} else if (val && val.type && (values[++valueIndex] = val)) { | ||
val instanceof Data && (values[valueIndex] = val = Vector.new(val) as Vector); | ||
fields[++fieldIndex] = Field.new(field, val.type, val.nullCount > 0) as Field<T[keyof T]>; | ||
fields[++fieldIndex] = Field.new(field, val.type, true) as Field<T[keyof T]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all these checks hard-coded to true now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit Ah yes, I forgot I had to change this. It's not for testing, but for streaming. The problem goes like this:
- We start pulling from a source stream and appending values to the builder
- The builder reaches the high-water-mark, and yields a RecordBatch to the Writer
- The Writer sees it hasn't written a Schema yet, so it uses the Schema of the first batch
If the first RecordBatch doesn't contain nulls, the Schema Message's Field will have nullable=false
. If the source later yields a null, changing the Field's nullable=true
won't be reflected in the stream since the Schema has already been sent. It's hacky, but I changed the default value of nullable=true
to accommodate this unless explicitly set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically nullable
doesn't indicate the existence of nulls, just that it's possible there will be nulls. I'm not sure if other implementations infer nullable=true
to also mean there's a non-zero-length null bitmap buffer. If so, we can update the writer to synthesize one for compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. The spec says you're allowed to elide validity bitmaps when the null count is 0, so I think we're ok there:
Arrays having a 0 null count may choose to not allocate the null bitmap. Implementations may choose to always allocate one anyway as a matter of convenience, but this should be noted when memory is being shared.
(from http://arrow.apache.org/docs/format/Layout.html#null-bitmaps)
Doesn't need to happen here, but could we set nullable based on the builder options? e.g. it there are no null_values
. I guess we would probably need another builder option (like nullable = false
) to do that right, since you could still append null
or undefined
even without null_values
.
Adds support for building and writing delta dictionaries. Moves the
dictionary
Vector pointer to the Data class, similar to #4316.Forked from #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842