-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate a conflict when using sentencepiece #33327
Conversation
pool = _descriptor_pool.Default() | ||
# Before adding the serialized file, try to find it in the pool (that is global). | ||
try: | ||
DESCRIPTOR = pool.FindFileByName("sentencepiece_model.proto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is: when we get this, does it (always) works with transformers
? (we might get strange stuff for this object from other libraries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not have a real guarantee that it will always work. That is why I said it mitigates the issues, it doesn't remove it altogether. Having said that, if some other code loaded before transformers uses protobuf and adds a file whose name matches sentencepiece_model.proto
, I guess it's not going to be anything that is not sentencepiece! Alternatively I could do it the other way around, I can try to add the fail, and if it raises a TypeError
I can try to do the FindFileByName
.
In both cases I could also add a warning if FindFileByName
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't mean it's sentencepiece or not sentencepiece
. What I worry a bit is it's (may) not the one we have and we don't know what would happen ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that could probably be controlled by an environment variable however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that is the sentencepiece_model.proto
was added in the current process, there is no way to add it again. We cannot be sure it's the right one, as it does not provide a version or a hash. And if we try to load it, it will raise an exception from the C extension. So that is why I thought that assuming it's the same proto is probably a fair fallback solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have a better option, I'd be happy to consider it though!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
||
@require_sentencepiece | ||
def test_sentencepiece_cohabitation(self): | ||
from sentencepiece import sentencepiece_model_pb2 as _original_protobuf # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we do this, we are importing sentencepiece.sentencepiece_model_pb2
and it will persists druing the whole (remaining) life time of the python test process.
I would probably avoid this, and run this test in a separate test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you suggesting to move this test to its own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, typo, I mean in a separate python subprocess.
But we can wait for a core maintainer's review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Who is the core maintainer? Can we tag him/her?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already ping Arthur
Hey! Could you provide the reproducer as well? def import_protobuf(error_message=""):
if is_protobuf_available():
import google.protobuf
if version.parse(google.protobuf.__version__) < version.parse("4.0.0"):
from transformers.utils import sentencepiece_model_pb2
else:
from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
return sentencepiece_model_pb2
else:
raise ImportError(PROTOBUF_IMPORT_ERROR.format(error_message)) is there no way to check if it was not already imported? (meaning try important, if the error is import error, we expect ? |
Hello! I saw first this problem because I tried to use Jetstream/Pytorch, and they use sentencepiece and its I was just thinking now that an alternative could be to check if the Let me know if you prefer this option and I can modify this PR accordingly. |
f4dd1d2
to
3f3bca5
Compare
This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting.
When sentencepiece is available, use that protobuf instead of the internal one.
3f3bca5
to
e815f86
Compare
I force-pushed a cleaner solution, I check if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot better now, thanks
* test(tokenizers): add a test showing conflict with sentencepiece This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting. * fix(tokenizers): mitigate sentencepiece/protobuf conflict When sentencepiece is available, use that protobuf instead of the internal one. * chore(style): fix with ruff
* test(tokenizers): add a test showing conflict with sentencepiece This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting. * fix(tokenizers): mitigate sentencepiece/protobuf conflict When sentencepiece is available, use that protobuf instead of the internal one. * chore(style): fix with ruff
What does this PR do?
This is a fix to a conflict between the fast tokenizers usage and sentencepiece module.
This is due to the fact that protobuf C implementation uses a global pool for all added descriptors, so if two different files add descriptors, they will end up conflicting.
I added a test showing the problem and a guard when loading the proto file to mitigate the problem. Note that the problem is not completly removed: if for any obscure reason an invalid sentencepiece proto file was loaded before, it will keep on using that. Also, if sentencepiece was loaded after the tokenizers load the proto file, the error will occur again (but at least there will be a way to avoid it).
Before submitting
Pull Request section?
Who can review?
@ydshieh