-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Regex, RegexTokenizer, Vocab, Vectors, SentencePiece pickle-able #1104
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1104 +/- ##
=======================================
Coverage 77.54% 77.54%
=======================================
Files 45 45
Lines 3086 3086
=======================================
Hits 2393 2393
Misses 693 693 Continue to review full report at Codecov.
|
I cannot find the usage of |
@@ -111,25 +111,49 @@ def test_vectors_add_item(self): | |||
self.assertEqual(vectors_obj['b'], tensorB) | |||
self.assertEqual(vectors_obj['not_in_it'], unk_tensor) | |||
|
|||
def test_vectors_load_and_save(self): | |||
def test_vectors_update(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the test for updating as I think it should be tested separately from serialization.
The issue with linux conda build has nothing to do with the changes in this PR |
This should be resolved with #1106 |
This commit makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece pickle-able on both PyBind11 and TorchScript. The approach is 1. define `_serialize_XXX` and `_deserialize_XXX` This is the replacement of `_get_states_XXX` and `_set_states_XXX`. I saw the names of the original functions were flipped, and used wrongly in `__getstate__` and `__setstate__` so I changed the function names to something more descriptive. 2. Use `c10::intrusive_ptr` as holder for custom class when using pybind11. This allows to use the same serialization/deserialization function for both PyBind11 and TorchScript. See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers for the detail of holder.
This PR makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.
closes #1085
Approach
define
_serialize_XXX
and_deserialize_XXX
next to these classesThis is the replacement of
_get_states_XXX
and_set_states_XXX
.I saw the names of the original functions were flipped, and used wrongly in
__getstate__
and__setstate__
so I changed the function names to somethingmore descriptive and less confusing.
Use
c10::intrusive_ptr
as holder for custom class when using pybind11.This allows to use the same serialization/deserialization function for both
PyBind11 and TorchScript.
See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
for the detail of holder.
For pickling TorchScript-bound
SentencePiece
, use byte Tensor as bytes containerThe serialized form of
SentencePiece
is byte string and returningstd::string
toPython realm causes decoding error as Python tries to decode it as UTF-8.
PyBind11 can work around this with
pybind11::bytes
type, but TorchScript does notsupport byte string, this approach uses bytes Tensor as a container/intermediate format
to pass byte string to Python.
Problem
TorchScript does not supportbytes
, thus SentencePiece bound via TorchScript is not pickle-able.Added I/O round trip tests
Regex
RegexTokenizer
BasicEnglishNormalize
Vocab
Vectors
SentencePiece