Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deprecation warnings for everything related to dict_id #6873

Merged
merged 2 commits into from
Dec 16, 2024

Conversation

brancz
Copy link
Contributor

@brancz brancz commented Dec 12, 2024

Which issue does this PR close?

Related to #5981 and #1206

Rationale for this change

With #6788 not preserving the dictionary ID is now the default, meaning we can finally remove preserving dictionary IDs, and more generally get rid of them in the Field struct. This adds the deprecation warnings that this will happen in the future.

What changes are included in this PR?

All actual code changes have already landed, this just adds the deprecation warnings so we can do the breaking changes in the future.

Are there any user-facing changes?

None yet, this is preparation for future breaking changes.

@alamb @tustvold

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brancz -- I took a look at PR and it looks good to me

In my opinion it is better to have these fields explicitly called out as deprecated (and eventually removed) than in the code with the implication they might be used (but are not really)

As @tustvold has pointed out a few times, we could make the kernels use Arc::ptr_eq to test for the same dictionaries if we want to optimize for that case

I think to get the CI to pass we could just sprinkle

        #[allow(deprecated)]

Appropriately around the codebase

cc @thinkharderdev

@@ -65,6 +65,10 @@ pub struct IpcWriteOptions {
/// schema or generate unique dictionary IDs internally during encoding.
///
/// Defaults to `false`
#[deprecated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

#[deprecated(
since = "54.0.0",
note = "The ability to preserve dictionary IDs will be removed. With it, all fields related to it."
)]
dict_id: i64,
dict_is_ordered: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should deprecate dict_is_ordered as well 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that one is a bit less clear to me. I can see how this could be used to optimize the processing of queries, and at least at the moment I can't think of a better place for this, nor does it create problems in the same way as dict_id does.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow-flight Changes to the arrow-flight crate labels Dec 13, 2024
@brancz brancz force-pushed the dict-id-deprecations branch from 63e466f to a3140ac Compare December 13, 2024 09:04
@brancz brancz force-pushed the dict-id-deprecations branch from a3140ac to 7b4de7c Compare December 13, 2024 09:07
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit that we're deprecating functionality that people depend on, without a viable alternative. Am I right in thinking with this change there will be no non-deprecated way to control dictionary allocation when converting IPC data? I wonder if we need to add (or better document this functionality if it exists) before we deprecate the Field level functionality.

I think it would be very compelling if the current tests could be rewritten to use whatever this new API is, as opposed to just sprinkling allow(deprecated)

FYI @jhorstmann

@brancz
Copy link
Contributor Author

brancz commented Dec 13, 2024

The problem is that it requires deprecations one way or another. If the deprecations aren't here then they're about the dictionary tracker to allow that functionality. I agree it's a dilemma. Or like Andrew said, there are ways to do this without the user having to do anything, but then the field would be unused.

@tustvold
Copy link
Contributor

tustvold commented Dec 13, 2024

How would I write an IPC dictionary with an ID of 32 without using Field::dict_id? As written the deprecation notice basically reads, "we've removed this functionality, if you need it 🤷‍♂️", I think we need to actually provide a viable alternative. This doesn't need to be using Field::dict_id, and could be some setup using DictionaryTracker, but we do need to document whatever it is. Otherwise people lack a clear migration path.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2024

How would I write an IPC dictionary with an ID of 32 without using Field::dict_id? As written the deprecation notice basically reads, "we've removed this functionality, if you need it 🤷‍♂️", I think we need to actually provide a viable alternative. This doesn't need to be using Field::dict_id, and could be some setup using DictionaryTracker, but we do need to document whatever it is. Otherwise people lack a clear migration path.

I can help with this -- is there some test / example of what assigning dict_ids in the IPC means (aka some example of the use that I can use an example to craft the API?)

@tustvold
Copy link
Contributor

I think the existing tests that are currently relying on allow(deprecated) would be a good starting point

@brancz
Copy link
Contributor Author

brancz commented Dec 14, 2024

I think the existing tests that are currently relying on allow(deprecated) would be a good starting point

I think we need to be a bit more specific, because how does this make sense for any of the new_dict calls? Practically none of the instances where new_dict is called is the dictionary ID actually relevant but it just happens to be part of the function signature. If I'm understanding you right though, you mean all the tests where the dictionary IDs value actually matters, right?

Regarding still having another ability to set dict IDs: We can introduce a new option on IpcWriteOptions where the DictionaryTracker can be pre-set. And then we allow a dictionary tracker to be instantiated with a pre-populated list of dict IDs. Then a user could pre-populate the IDs, so we would have a second way of setting dict IDs that will still exist when the field doesn't exist anymore. I'll put it together and we can discuss more (really want to make this work because dict_id being on the Field is causing a ton of problems for us regarding equality for merging schemas).

@tustvold
Copy link
Contributor

tustvold commented Dec 14, 2024

Ok so is the thesis that dictionary ID are just an implementation detail of the IPC format, if so then I guess this is fine, and we can effectively use the deprecation as a scream test. I'm a little wary we are removing functionality here, but I'm also very out of the loop w.r.t the dictionary tracker changes. If Andrew is happy feel free to ignore me.

@thinkharderdev
Copy link
Contributor

Ok so is the thesis that dictionary ID are just an implementation detail of the IPC format

Been checked out on this one for a bit but IIRC this was the original insight behind all of this. Browsing around through other arrow implementations in other languages they don't seem to expose the dictionary ID at all in the schema and leave it as in internal detail of IPC serde. If people are using the dictionary ID it's probably for something that should properly be done with kv metadata instead.

@brancz
Copy link
Contributor Author

brancz commented Dec 14, 2024

Ok I also took the time to analyze every single allow deprecate in tests and tl;dr is there are only 2 tests that actually test the dictionary ID ending up on the wire, but as a side effect, they actually want to test something else.

Here is my full analysis:

tests in arrow-flight/src/encode.rs:

  • test_schema_metadata_encoded: doesn't care about new_with_preserve_dict_id, can be changed to new
  • flight_data_from_arrow_batch uses default already

test in arrow-integration-test/src/schema.rs: just marshals whatever the field is, it doesn't care about the dict_id (simply happens to use new_dict)
test in arrow-integration-testing/src/flight_server_scenarios/integration_test.rs: uses new_with_preserve_dict_id, can be changed to regular new
test in arrow-ipc/src/convert.rs: just tests roundtripping, doesn't care about dict_id (simply happens to use new_dict)
tests in arrow-ipc/src/reader.rs:

  • test_roundtrip_nested_dict_no_preserve_dict_id: uses new_with_preserve_dict_id, can be changed to regular new
  • test_roundtrip_stream_nested_dict_of_map_of_dict: doesn't care about dict_id (simply happens to use new_dict)
  • test_roundtrip_stream_dict_of_list_of_dict: doesn't care about dict_id (simply happens to use new_dict)
  • test_roundtrip_stream_dict_of_fixed_size_list_of_dict: doesn't care about dict_id (simply happens to use new_dict)
  • test_roundtrip_view_types_nested_dict: doesn't care about dict_id (simply happens to use new_dict)
  • test_unaligned: uses new_with_preserve_dict_id, can be changed to regular new
  • test_unaligned_throws_error_with_require_alignment: uses new_with_preserve_dict_id, can be changed to regular new
  • test_same_dict_id_without_preserve: uses new_with_preserve_dict_id, can be changed to regular new, but also uses new_dict
  • test_read_ree_dict_record_batches_from_buffer: doesn't care about dict_id and IpcWriteOptions can be changed to default
  • track_union_nested_dict, track_struct_nested_dict: actually tests dict_id to be set on the wire (but not sure why it doesn't need to for the purpose of the test)
  • test_dictionary_ordered: uses new_dict but doesn't care about its value

tests in arrow-schema/src/field.rs:

  • need to look into uses of test_new_dict_with_string and schema_field_with_dict_id, person_schema
  • test_fields_with_dict_id, test_field_comparison_case, schema_field_accessors: specifically tests the existance of the functions/fields we want to remove (not the functionality)

tests in parquet/src/arrow/arrow_writer/mod.rs:

  • arrow_writer_string_dictionary, arrow_writer_primitive_dictionary, arrow_writer_string_dictionary_unsigned_index: just tests roundtripping, doesn't care about dict_id (simply happens to use new_dict)

tests in parquet/src/arrow/schema/mod.rs:

  • test_arrow_schema_roundtrip: doesn't care about new_dict, simply calls it.

Originally coming from Arrow's implementation in Go I can also attest that this is the only library around that I've found to be exposing this and not just making it a detail of the wire format.

After going through this in detail, I conclude these deprecations are safe to do, and we should follow up with testing dictionary equality (in the now default mode) to actually set dictionary IDs equal for equal dictionaries, however, that doesn't influence the APIs, so in my opinion we can go ahead with this as is.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2024

Ok so is the thesis that dictionary ID are just an implementation detail of the IPC format, if so then I guess this is fine, and we can effectively use the deprecation as a scream test. I'm a little wary we are removing functionality here, but I'm also very out of the loop w.r.t the dictionary tracker changes. If Andrew is happy feel free to ignore me.

Given that this PR deprecates the APIs and doesn't remove them and no one seems to have come up with any reasonable usecase for the current dict_id encoding (despite some clearly non trivial analysis by @brancz 🙏 and @thinkharderdev ) I think this PR cuts a nice balance between:

  1. a real path for removing dict_ids
  2. Not causing immense downstream pain by removing an API that is needed

Let's merge this and see if anyone comes up with a motiviating usecase.

Thank you again for sticking with this @brancz

@alamb
Copy link
Contributor

alamb commented Dec 15, 2024

I'll plan to merge this PR tomorrow unless there are other comments / concerns

cc @etseidl in case you are interested in this sort of thing

@brancz
Copy link
Contributor Author

brancz commented Dec 15, 2024

Once merged I’ll start drafting changes to add the optimizations discussed as well as the dice_id removal which I’ll continually rebase over the next months and then we can land them for the March release.

Thank you very much for the productive collaboration everyone!

@alamb alamb merged commit fc6936a into apache:main Dec 16, 2024
28 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 16, 2024

🚀 here we go

etseidl pushed a commit to etseidl/arrow-rs that referenced this pull request Dec 16, 2024
…6873)

* Add deprecation warnings for everything related to `dict_id`

* Add allow deprecations where necessary
@alamb
Copy link
Contributor

alamb commented Dec 16, 2024

Well I already found a potential issue in DataFusion, see apache/datafusion#13663 (comment)

We can wait to see how that conversation is resolved

@brancz brancz deleted the dict-id-deprecations branch December 17, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants