Mitigate a conflict when using sentencepiece #33327

tengomucho · 2024-09-05T14:58:42Z

What does this PR do?

This is a fix to a conflict between the fast tokenizers usage and sentencepiece module.
This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting.

I added a test showing the problem and a guard when loading the proto file to mitigate the problem. Note that the problem is not completly removed: if for any obscure reason an invalid sentencepiece proto file was loaded before, it will keep on using that. Also, if sentencepiece was loaded after the tokenizers load the proto file, the error will occur again (but at least there will be a way to avoid it).

Before submitting

Did you read the contributor guideline,
Pull Request section?
Original discussion in internal slack: https://huggingface.slack.com/archives/C014N4749J9/p1725371896664959
Did you write any new necessary tests?

Who can review?

@ydshieh

ydshieh · 2024-09-05T15:17:11Z

src/transformers/utils/sentencepiece_model_pb2_new.py

+pool = _descriptor_pool.Default()
+# Before adding the serialized file, try to find it in the pool (that is global).
+try:
+    DESCRIPTOR = pool.FindFileByName("sentencepiece_model.proto")


My concern here is: when we get this, does it (always) works with transformers? (we might get strange stuff for this object from other libraries).

No, we do not have a real guarantee that it will always work. That is why I said it mitigates the issues, it doesn't remove it altogether. Having said that, if some other code loaded before transformers uses protobuf and adds a file whose name matches sentencepiece_model.proto, I guess it's not going to be anything that is not sentencepiece! Alternatively I could do it the other way around, I can try to add the fail, and if it raises a TypeError I can try to do the FindFileByName.
In both cases I could also add a warning if FindFileByName is used.

No, I don't mean it's sentencepiece or not sentencepiece. What I worry a bit is it's (may) not the one we have and we don't know what would happen ...

that could probably be controlled by an environment variable however.

the problem is that is the sentencepiece_model.proto was added in the current process, there is no way to add it again. We cannot be sure it's the right one, as it does not provide a version or a hash. And if we try to load it, it will raise an exception from the C extension. So that is why I thought that assuming it's the same proto is probably a fair fallback solution.

if you have a better option, I'd be happy to consider it though!

HuggingFaceDocBuilderDev · 2024-09-05T15:19:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh · 2024-09-05T15:21:29Z

tests/tokenization/test_tokenization_utils.py

+
+    @require_sentencepiece
+    def test_sentencepiece_cohabitation(self):
+        from sentencepiece import sentencepiece_model_pb2 as _original_protobuf  # noqa: F401


when we do this, we are importing sentencepiece.sentencepiece_model_pb2 and it will persists druing the whole (remaining) life time of the python test process.

I would probably avoid this, and run this test in a separate test

are you suggesting to move this test to its own file?

sorry, typo, I mean in a separate python subprocess.

But we can wait for a core maintainer's review.

OK! Who is the core maintainer? Can we tag him/her?

I already ping Arthur

ArthurZucker · 2024-09-06T12:08:47Z

Hey! Could you provide the reproducer as well?
I really don't mind merging as is, but just want to make sure, you 1st have a project that imported protobuf, and 2. used from transformers.convert_slow_tokenizer import import_protobuf and that failed right?
What is a bit weird is we have this guard:

def import_protobuf(error_message=""):
    if is_protobuf_available():
        import google.protobuf

        if version.parse(google.protobuf.__version__) < version.parse("4.0.0"):
            from transformers.utils import sentencepiece_model_pb2
        else:
            from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
        return sentencepiece_model_pb2
    else:
        raise ImportError(PROTOBUF_IMPORT_ERROR.format(error_message))

is there no way to check if it was not already imported? (meaning try important, if the error is import error, we expect ?

tengomucho · 2024-09-06T13:40:19Z

Hello! I saw first this problem because I tried to use Jetstream/Pytorch, and they use sentencepiece and its sentencepiece_pb.py. In transformers there is a copy of the same file, with few changes to make some variables globals. So, while in python world they are technically two different modules, the C protobuf extension shares the same space, and they conflict when calling AddSerializedFile. That's why I proposed this option, because I didn't find any other way to figure out how to avoid this conflict.

I was just thinking now that an alternative could be to check if the sentencepiece module is available, import the sentencepiece_pb.py from that package, and avoid conflicts. We could then delete our "sentencepiece_pb_new.py" file. But that means it will only work when the sentencepiece module is available.

Let me know if you prefer this option and I can modify this PR accordingly.

This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting.

When sentencepiece is available, use that protobuf instead of the internal one.

tengomucho · 2024-09-10T15:45:20Z

I force-pushed a cleaner solution, I check if sentencepiece is available and if it's the case, we use that instead of our proto version, so we avoid conflicts.

ArthurZucker

A lot better now, thanks

* test(tokenizers): add a test showing conflict with sentencepiece This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting. * fix(tokenizers): mitigate sentencepiece/protobuf conflict When sentencepiece is available, use that protobuf instead of the internal one. * chore(style): fix with ruff

tengomucho marked this pull request as ready for review September 5, 2024 14:58

ydshieh reviewed Sep 5, 2024

View reviewed changes

ydshieh requested a review from ArthurZucker September 5, 2024 15:30

tengomucho force-pushed the alvaro/sentencepiece-conflict branch from f4dd1d2 to 3f3bca5 Compare September 10, 2024 15:41

tengomucho added 2 commits September 10, 2024 17:43

test(tokenizers): add a test showing conflict with sentencepiece

536bcba

This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting.

fix(tokenizers): mitigate sentencepiece/protobuf conflict

e815f86

When sentencepiece is available, use that protobuf instead of the internal one.

tengomucho force-pushed the alvaro/sentencepiece-conflict branch from 3f3bca5 to e815f86 Compare September 10, 2024 15:45

chore(style): fix with ruff

6d944e9

tengomucho requested a review from ydshieh September 11, 2024 07:08

ArthurZucker approved these changes Sep 11, 2024

View reviewed changes

ArthurZucker merged commit 7a56598 into main Sep 13, 2024
22 of 24 checks passed

ArthurZucker deleted the alvaro/sentencepiece-conflict branch September 13, 2024 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate a conflict when using sentencepiece #33327

Mitigate a conflict when using sentencepiece #33327

tengomucho commented Sep 5, 2024 •

edited

Loading

ydshieh Sep 5, 2024

tengomucho Sep 5, 2024

ydshieh Sep 5, 2024

ydshieh Sep 5, 2024

tengomucho Sep 5, 2024

tengomucho Sep 5, 2024

HuggingFaceDocBuilderDev commented Sep 5, 2024

ydshieh Sep 5, 2024

tengomucho Sep 5, 2024

ydshieh Sep 5, 2024

tengomucho Sep 5, 2024 •

edited

Loading

ydshieh Sep 5, 2024

ArthurZucker commented Sep 6, 2024

tengomucho commented Sep 6, 2024 •

edited

Loading

tengomucho commented Sep 10, 2024

ArthurZucker left a comment

Mitigate a conflict when using sentencepiece #33327

Mitigate a conflict when using sentencepiece #33327

Conversation

tengomucho commented Sep 5, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengomucho Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Sep 6, 2024

tengomucho commented Sep 6, 2024 • edited Loading

tengomucho commented Sep 10, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

tengomucho commented Sep 5, 2024 •

edited

Loading

tengomucho Sep 5, 2024 •

edited

Loading

tengomucho commented Sep 6, 2024 •

edited

Loading