Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Regex, RegexTokenizer, Vocab, Vectors, SentencePiece pickle-able #1104

Merged
merged 7 commits into from
Dec 22, 2020

Conversation

mthrok
Copy link
Contributor

@mthrok mthrok commented Dec 16, 2020

This PR makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.

closes #1085

Approach

  1. define _serialize_XXX and _deserialize_XXX next to these classes
    This is the replacement of _get_states_XXX and _set_states_XXX.
    I saw the names of the original functions were flipped, and used wrongly in
    __getstate__ and __setstate__ so I changed the function names to something
    more descriptive and less confusing.

  2. Use c10::intrusive_ptr as holder for custom class when using pybind11.
    This allows to use the same serialization/deserialization function for both
    PyBind11 and TorchScript.
    See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
    for the detail of holder.

  3. For pickling TorchScript-bound SentencePiece, use byte Tensor as bytes container
    The serialized form of SentencePiece is byte string and returning std::string to
    Python realm causes decoding error as Python tries to decode it as UTF-8.
    PyBind11 can work around this with pybind11::bytes type, but TorchScript does not
    support byte string, this approach uses bytes Tensor as a container/intermediate format
    to pass byte string to Python.

Problem

Added I/O round trip tests

class PyBind11 TorchScript
Regex
RegexTokenizer
BasicEnglishNormalize
Vocab
Vectors
SentencePiece

@codecov
Copy link

codecov bot commented Dec 17, 2020

Codecov Report

Merging #1104 (d6249f8) into master (9562f80) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1104   +/-   ##
=======================================
  Coverage   77.54%   77.54%           
=======================================
  Files          45       45           
  Lines        3086     3086           
=======================================
  Hits         2393     2393           
  Misses        693      693           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9562f80...d6249f8. Read the comment docs.

@mthrok mthrok marked this pull request as ready for review December 17, 2020 03:31
@mthrok
Copy link
Contributor Author

mthrok commented Dec 17, 2020

I cannot find the usage of Regex class in Python realm. What should I do with that?

@@ -111,25 +111,49 @@ def test_vectors_add_item(self):
self.assertEqual(vectors_obj['b'], tensorB)
self.assertEqual(vectors_obj['not_in_it'], unk_tensor)

def test_vectors_load_and_save(self):
def test_vectors_update(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the test for updating as I think it should be tested separately from serialization.

@mthrok
Copy link
Contributor Author

mthrok commented Dec 17, 2020

The issue with linux conda build has nothing to do with the changes in this PR

conda/conda-package-handling#71

@seemethere
Copy link
Member

The issue with linux conda build has nothing to do with the changes in this PR

conda/conda-package-handling#71

This should be resolved with #1106

This commit makes Regex, RegexTokenizer, Vocab, Vectors and SentencePiece
pickle-able on both PyBind11 and TorchScript.

The approach is

1. define `_serialize_XXX` and `_deserialize_XXX`
    This is the replacement of `_get_states_XXX` and `_set_states_XXX`.
    I saw the names of the original functions were flipped, and used wrongly in
    `__getstate__` and `__setstate__` so I changed the function names to something
    more descriptive.

2. Use `c10::intrusive_ptr` as holder for custom class when using pybind11.
    This allows to use the same serialization/deserialization function for both
    PyBind11 and TorchScript.
    See https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#smart-pointers
    for the detail of holder.
@zhangguanheng66 zhangguanheng66 merged commit aa7176b into pytorch:master Dec 22, 2020
@mthrok mthrok deleted the sp-binding2 branch December 22, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants