Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate a conflict when using sentencepiece #33327

Merged
merged 3 commits into from
Sep 13, 2024

Conversation

tengomucho
Copy link
Contributor

@tengomucho tengomucho commented Sep 5, 2024

What does this PR do?

This is a fix to a conflict between the fast tokenizers usage and sentencepiece module.
This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting.

I added a test showing the problem and a guard when loading the proto file to mitigate the problem. Note that the problem is not completly removed: if for any obscure reason an invalid sentencepiece proto file was loaded before, it will keep on using that. Also, if sentencepiece was loaded after the tokenizers load the proto file, the error will occur again (but at least there will be a way to avoid it).

Before submitting

Who can review?

@ydshieh

@tengomucho tengomucho marked this pull request as ready for review September 5, 2024 14:58
pool = _descriptor_pool.Default()
# Before adding the serialized file, try to find it in the pool (that is global).
try:
DESCRIPTOR = pool.FindFileByName("sentencepiece_model.proto")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is: when we get this, does it (always) works with transformers? (we might get strange stuff for this object from other libraries).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not have a real guarantee that it will always work. That is why I said it mitigates the issues, it doesn't remove it altogether. Having said that, if some other code loaded before transformers uses protobuf and adds a file whose name matches sentencepiece_model.proto, I guess it's not going to be anything that is not sentencepiece! Alternatively I could do it the other way around, I can try to add the fail, and if it raises a TypeError I can try to do the FindFileByName.
In both cases I could also add a warning if FindFileByName is used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't mean it's sentencepiece or not sentencepiece. What I worry a bit is it's (may) not the one we have and we don't know what would happen ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could probably be controlled by an environment variable however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that is the sentencepiece_model.proto was added in the current process, there is no way to add it again. We cannot be sure it's the right one, as it does not provide a version or a hash. And if we try to load it, it will raise an exception from the C extension. So that is why I thought that assuming it's the same proto is probably a fair fallback solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a better option, I'd be happy to consider it though!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.


@require_sentencepiece
def test_sentencepiece_cohabitation(self):
from sentencepiece import sentencepiece_model_pb2 as _original_protobuf # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we do this, we are importing sentencepiece.sentencepiece_model_pb2 and it will persists druing the whole (remaining) life time of the python test process.

I would probably avoid this, and run this test in a separate test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting to move this test to its own file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, typo, I mean in a separate python subprocess.

But we can wait for a core maintainer's review.

Copy link
Contributor Author

@tengomucho tengomucho Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Who is the core maintainer? Can we tag him/her?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already ping Arthur

@ArthurZucker
Copy link
Collaborator

Hey! Could you provide the reproducer as well?
I really don't mind merging as is, but just want to make sure, you 1st have a project that imported protobuf, and 2. used from transformers.convert_slow_tokenizer import import_protobuf and that failed right?
What is a bit weird is we have this guard:

def import_protobuf(error_message=""):
    if is_protobuf_available():
        import google.protobuf

        if version.parse(google.protobuf.__version__) < version.parse("4.0.0"):
            from transformers.utils import sentencepiece_model_pb2
        else:
            from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
        return sentencepiece_model_pb2
    else:
        raise ImportError(PROTOBUF_IMPORT_ERROR.format(error_message))

is there no way to check if it was not already imported? (meaning try important, if the error is import error, we expect ?

@tengomucho
Copy link
Contributor Author

tengomucho commented Sep 6, 2024

Hello! I saw first this problem because I tried to use Jetstream/Pytorch, and they use sentencepiece and its sentencepiece_pb.py. In transformers there is a copy of the same file, with few changes to make some variables globals. So, while in python world they are technically two different modules, the C protobuf extension shares the same space, and they conflict when calling AddSerializedFile. That's why I proposed this option, because I didn't find any other way to figure out how to avoid this conflict.

I was just thinking now that an alternative could be to check if the sentencepiece module is available, import the sentencepiece_pb.py from that package, and avoid conflicts. We could then delete our "sentencepiece_pb_new.py" file. But that means it will only work when the sentencepiece module is available.

Let me know if you prefer this option and I can modify this PR accordingly.

This is due to the fact that protobuf C implementation uses a global
pool for all added descriptors, so if two different files add
descriptors, they will end up conflicting.
When sentencepiece is available, use that protobuf instead of the
internal one.
@tengomucho
Copy link
Contributor Author

I force-pushed a cleaner solution, I check if sentencepiece is available and if it's the case, we use that instead of our proto version, so we avoid conflicts.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot better now, thanks

@ArthurZucker ArthurZucker merged commit 7a56598 into main Sep 13, 2024
22 of 24 checks passed
@ArthurZucker ArthurZucker deleted the alvaro/sentencepiece-conflict branch September 13, 2024 11:19
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* test(tokenizers): add a test showing conflict with sentencepiece

This is due to the fact that protobuf C implementation uses a global
pool for all added descriptors, so if two different files add
descriptors, they will end up conflicting.

* fix(tokenizers): mitigate sentencepiece/protobuf conflict

When sentencepiece is available, use that protobuf instead of the
internal one.

* chore(style): fix with ruff
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* test(tokenizers): add a test showing conflict with sentencepiece

This is due to the fact that protobuf C implementation uses a global
pool for all added descriptors, so if two different files add
descriptors, they will end up conflicting.

* fix(tokenizers): mitigate sentencepiece/protobuf conflict

When sentencepiece is available, use that protobuf instead of the
internal one.

* chore(style): fix with ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants