Immutability for data collators #30603

vasqu · 2024-05-01T21:14:50Z

What does this PR do?

Introduces new tests that check if a data collator might introduce side effects, i.e. the given input changes after the call to the collator. Motivated by #30556

Furthermore, fixes the seq2seq collator to not introduce side effects on the given input's labels. This is done by:

Passing only relevant features to the tokenizer to pad.
Manually crafting the labels afterwards.
Reintroducing tokenizer behaviour by converting labels to the respective datatype (pt, tf, np).
As a side note, added some more checks on None labels, especially when given "labels": None in the dictionary.

Last remarks:

The test handles most of the cases that are introduced in the base tests but if I missed something give me a heads up :D
I'm not sure how to handle it when the user introduces None labels. For now, I return them by None again.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts @Rocketknight1

…llators

vasqu · 2024-05-01T21:17:32Z

Oh yea, one last thing. I've created separate classes for the immutability tests. Thought it got too convoluted otherwise.

src/transformers/data/data_collator.py

HuggingFaceDocBuilderDev · 2024-05-02T12:20:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1

This looks good to me - I'm happy with the change to the data collator code itself! The one issue I'd raise is that it seems like there's a lot of code duplication in the tests, with just some changes like torch.tensor -> np.array or different values for return_tensors. This is probably fine, though - that kind of explicit duplication makes it easier to locate test errors.

cc @amyeroberts for core maintainer review (and also let me know if you think I'm wrong about the code duplication)

src/transformers/data/data_collator.py

vasqu · 2024-05-03T12:54:20Z

@Rocketknight1 With code duplications, do you mean across the different classes between pt/tf/np? I agree with that, it's more so a dependency check to see that it suddenly doesn't branch out to something unwanted and that different input datatypes are handled correctly. For example, when the default collator is used, it branches into separate np, tf, and pt calls (i.e. numpy_default_data_collator, tf_default_data_collator, torch_default_data_collator). I haven't deep-dived where else something like this might happen.

Rocketknight1 · 2024-05-07T14:13:31Z

Yes, that's what I was referring to - and I think it's fine to keep it as-is!

amyeroberts

Thanks for tackling this and adding tests ❤️

Just a few formatting style comments (to be propogated to TF and Flax testers too). Otherwise looks great

amyeroberts · 2024-05-07T15:14:02Z

src/transformers/data/data_collator.py

+        # this might occur when we pass {..., "labels": None}
+        if labels is not None and all(label is None for label in labels):
+            labels = None
+        no_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]


nit - it'd call this non_label_features instead of no-labels features. I'd parse the latter as features which have no corresponding label, which isn't necessarily the case

Suggested change

no_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]

non_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]

src/transformers/data/data_collator.py

amyeroberts · 2024-05-08T09:55:46Z