-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deprecation warnings for everything related to dict_id
#6873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brancz -- I took a look at PR and it looks good to me
In my opinion it is better to have these fields explicitly called out as deprecated (and eventually removed) than in the code with the implication they might be used (but are not really)
As @tustvold has pointed out a few times, we could make the kernels use Arc::ptr_eq
to test for the same dictionaries if we want to optimize for that case
I think to get the CI to pass we could just sprinkle
#[allow(deprecated)]
Appropriately around the codebase
@@ -65,6 +65,10 @@ pub struct IpcWriteOptions { | |||
/// schema or generate unique dictionary IDs internally during encoding. | |||
/// | |||
/// Defaults to `false` | |||
#[deprecated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
#[deprecated( | ||
since = "54.0.0", | ||
note = "The ability to preserve dictionary IDs will be removed. With it, all fields related to it." | ||
)] | ||
dict_id: i64, | ||
dict_is_ordered: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should deprecate dict_is_ordered as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that one is a bit less clear to me. I can see how this could be used to optimize the processing of queries, and at least at the moment I can't think of a better place for this, nor does it create problems in the same way as dict_id
does.
63e466f
to
a3140ac
Compare
a3140ac
to
7b4de7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a bit that we're deprecating functionality that people depend on, without a viable alternative. Am I right in thinking with this change there will be no non-deprecated way to control dictionary allocation when converting IPC data? I wonder if we need to add (or better document this functionality if it exists) before we deprecate the Field level functionality.
I think it would be very compelling if the current tests could be rewritten to use whatever this new API is, as opposed to just sprinkling allow(deprecated)
FYI @jhorstmann
The problem is that it requires deprecations one way or another. If the deprecations aren't here then they're about the dictionary tracker to allow that functionality. I agree it's a dilemma. Or like Andrew said, there are ways to do this without the user having to do anything, but then the field would be unused. |
How would I write an IPC dictionary with an ID of 32 without using Field::dict_id? As written the deprecation notice basically reads, "we've removed this functionality, if you need it 🤷♂️", I think we need to actually provide a viable alternative. This doesn't need to be using Field::dict_id, and could be some setup using DictionaryTracker, but we do need to document whatever it is. Otherwise people lack a clear migration path. |
I can help with this -- is there some test / example of what assigning dict_ids in the IPC means (aka some example of the use that I can use an example to craft the API?) |
I think the existing tests that are currently relying on |
I think we need to be a bit more specific, because how does this make sense for any of the Regarding still having another ability to set dict IDs: We can introduce a new option on |
Ok so is the thesis that dictionary ID are just an implementation detail of the IPC format, if so then I guess this is fine, and we can effectively use the deprecation as a scream test. I'm a little wary we are removing functionality here, but I'm also very out of the loop w.r.t the dictionary tracker changes. If Andrew is happy feel free to ignore me. |
Been checked out on this one for a bit but IIRC this was the original insight behind all of this. Browsing around through other arrow implementations in other languages they don't seem to expose the dictionary ID at all in the schema and leave it as in internal detail of IPC serde. If people are using the dictionary ID it's probably for something that should properly be done with kv metadata instead. |
Ok I also took the time to analyze every single allow deprecate in tests and tl;dr is there are only 2 tests that actually test the dictionary ID ending up on the wire, but as a side effect, they actually want to test something else. Here is my full analysis:tests in
test in
tests in
tests in
tests in
Originally coming from Arrow's implementation in Go I can also attest that this is the only library around that I've found to be exposing this and not just making it a detail of the wire format. After going through this in detail, I conclude these deprecations are safe to do, and we should follow up with testing dictionary equality (in the now default mode) to actually set dictionary IDs equal for equal dictionaries, however, that doesn't influence the APIs, so in my opinion we can go ahead with this as is. |
Given that this PR deprecates the APIs and doesn't remove them and no one seems to have come up with any reasonable usecase for the current dict_id encoding (despite some clearly non trivial analysis by @brancz 🙏 and @thinkharderdev ) I think this PR cuts a nice balance between:
Let's merge this and see if anyone comes up with a motiviating usecase. Thank you again for sticking with this @brancz |
I'll plan to merge this PR tomorrow unless there are other comments / concerns cc @etseidl in case you are interested in this sort of thing |
Once merged I’ll start drafting changes to add the optimizations discussed as well as the dice_id removal which I’ll continually rebase over the next months and then we can land them for the March release. Thank you very much for the productive collaboration everyone! |
🚀 here we go |
Well I already found a potential issue in DataFusion, see apache/datafusion#13663 (comment) We can wait to see how that conversation is resolved |
Which issue does this PR close?
Related to #5981 and #1206
Rationale for this change
With #6788 not preserving the dictionary ID is now the default, meaning we can finally remove preserving dictionary IDs, and more generally get rid of them in the
Field
struct. This adds the deprecation warnings that this will happen in the future.What changes are included in this PR?
All actual code changes have already landed, this just adds the deprecation warnings so we can do the breaking changes in the future.
Are there any user-facing changes?
None yet, this is preparation for future breaking changes.
@alamb @tustvold