Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5537: [JS] Support delta dictionaries in RecordBatchWriter and DictionaryBuilder #4502

Closed
wants to merge 1 commit into from

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Jun 9, 2019

Adds support for building and writing delta dictionaries. Moves the dictionary Vector pointer to the Data class, similar to #4316.

Forked from #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842

@trxcllnt trxcllnt force-pushed the js/delta-dictionaries branch from 4574bb0 to 6a70a25 Compare June 19, 2019 21:38
@trxcllnt trxcllnt marked this pull request as ready for review June 19, 2019 21:38
@trxcllnt trxcllnt changed the title ARROW-5537: [JS] [WIP] Support delta dictionaries in RecordBatchWriter and DictionaryBuilder ARROW-5537: [JS] Support delta dictionaries in RecordBatchWriter and DictionaryBuilder Jun 19, 2019
Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Just a couple of minor comments.

I'm a little confused about what's going on with the hard coded nullable in Field. It looks like maybe it's something that was changed for testing and then left in?

} else if (val && val.type && (values[++valueIndex] = val)) {
val instanceof Data && (values[valueIndex] = val = Vector.new(val) as Vector);
fields[++fieldIndex] = Field.new(field, val.type, val.nullCount > 0) as Field<T[keyof T]>;
fields[++fieldIndex] = Field.new(field, val.type, true) as Field<T[keyof T]>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all these checks hard-coded to true now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheNeuralBit Ah yes, I forgot I had to change this. It's not for testing, but for streaming. The problem goes like this:

  • We start pulling from a source stream and appending values to the builder
  • The builder reaches the high-water-mark, and yields a RecordBatch to the Writer
  • The Writer sees it hasn't written a Schema yet, so it uses the Schema of the first batch

If the first RecordBatch doesn't contain nulls, the Schema Message's Field will have nullable=false. If the source later yields a null, changing the Field's nullable=true won't be reflected in the stream since the Schema has already been sent. It's hacky, but I changed the default value of nullable=true to accommodate this unless explicitly set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically nullable doesn't indicate the existence of nulls, just that it's possible there will be nulls. I'm not sure if other implementations infer nullable=true to also mean there's a non-zero-length null bitmap buffer. If so, we can update the writer to synthesize one for compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. The spec says you're allowed to elide validity bitmaps when the null count is 0, so I think we're ok there:

Arrays having a 0 null count may choose to not allocate the null bitmap. Implementations may choose to always allocate one anyway as a matter of convenience, but this should be noted when memory is being shared.

(from http://arrow.apache.org/docs/format/Layout.html#null-bitmaps)

Doesn't need to happen here, but could we set nullable based on the builder options? e.g. it there are no null_values. I guess we would probably need another builder option (like nullable = false) to do that right, since you could still append null or undefined even without null_values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants