Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subword Tokenizer HuggingFace like API #7942

Merged
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
a949b2e
first_successful_compilation
VibhuJawa Apr 9, 2021
0f3623f
working_subword_tokenizer
VibhuJawa Apr 9, 2021
991085c
minor bug fixes
VibhuJawa Apr 9, 2021
56e0552
first_successful_compilation
VibhuJawa Apr 9, 2021
96f1921
working_subword_tokenizer
VibhuJawa Apr 9, 2021
30405c5
minor bug fixes
VibhuJawa Apr 9, 2021
100ec9e
Merge branch 'fea_subword_inmem_hash_bindings' of github.com:vibhujaw…
VibhuJawa Apr 9, 2021
0b7956e
Added cleaner API
VibhuJawa Apr 12, 2021
20a5c24
some API cleanup and inital tests
VibhuJawa Apr 15, 2021
0843cd7
test cleanup
VibhuJawa Apr 15, 2021
6d22915
cleanup + working tests
VibhuJawa Apr 15, 2021
37196d0
Modifed CI
VibhuJawa Apr 15, 2021
daa087e
Documentation Changes
VibhuJawa Apr 15, 2021
25049c0
fix_test
VibhuJawa Apr 15, 2021
cecfaa4
fixed style issues
VibhuJawa Apr 15, 2021
9e13a52
Fixing style issues with subword_tokenize.pyx
VibhuJawa Apr 15, 2021
c2b936e
style fix to /subword_tokenize.pxd
VibhuJawa Apr 15, 2021
221ccf7
Fixed some import and ci installation issues
VibhuJawa Apr 16, 2021
44acf86
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
f3c474d
Update python/cudf/cudf/_lib/cpp/nvtext/subword_tokenize.pxd
VibhuJawa Apr 19, 2021
74a11a1
Update python/cudf/cudf/_lib/nvtext/subword_tokenize.pyx
VibhuJawa Apr 19, 2021
24f66ff
Update python/cudf/cudf/_lib/nvtext/subword_tokenize.pyx
VibhuJawa Apr 19, 2021
0ec8201
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
837e3a1
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
ff3c458
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
beb90a1
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
e52922d
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
ab868f3
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
e979eff
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
8ada172
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
3af9c2e
Update python/cudf/cudf/core/subword_tokenizer.py
VibhuJawa Apr 19, 2021
5fac42e
Addressed reviews and code style related changes
VibhuJawa Apr 19, 2021
4679c85
Added Transformers dependency
VibhuJawa Apr 19, 2021
6d3e86c
Added transformers to setup.py
VibhuJawa Apr 19, 2021
ee8a95b
merged branch-0.20 into fea_subword_inmem_hash_bindings
VibhuJawa Apr 21, 2021
0387d97
fixed style check in python/cudf/setup.py
VibhuJawa Apr 21, 2021
b940580
Removed transformers ci/gpu/build.sh
VibhuJawa Apr 21, 2021
0018779
Address reviews to the API
VibhuJawa Apr 21, 2021
26cc440
fixed documentation formatting
VibhuJawa Apr 22, 2021
e5cc893
mypy style fixes
VibhuJawa Apr 22, 2021
23e9fa4
added transformers to dev_requirements.txt
VibhuJawa Apr 22, 2021
01b4f43
Added a test to the old subword tokenizer API to triage CI error
VibhuJawa Apr 23, 2021
2aef505
Merge branch 'branch-0.20' into fea_subword_inmem_hash_bindings
VibhuJawa May 3, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/environments/cudf_dev_cuda11.0.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ dependencies:
- protobuf
- nvtx>=0.2.1
- cachetools
- transformers
- pip:
- git+https://github.com/dask/dask.git@main
- git+https://github.com/dask/distributed.git@main
Expand Down
1 change: 1 addition & 0 deletions conda/environments/cudf_dev_cuda11.1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ dependencies:
- protobuf
- nvtx>=0.2.1
- cachetools
- transformers
- pip:
- git+https://github.com/dask/dask.git@main
- git+https://github.com/dask/distributed.git@main
Expand Down
1 change: 1 addition & 0 deletions conda/environments/cudf_dev_cuda11.2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ dependencies:
- protobuf
- nvtx>=0.2.1
- cachetools
- transformers
- pip:
- git+https://github.com/dask/dask.git@main
- git+https://github.com/dask/distributed.git@main
Expand Down
6 changes: 6 additions & 0 deletions docs/cudf/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,12 @@ Window
.. autoclass:: Rolling
:members:

SubwordTokenizer
----------------
.. currentmodule:: cudf.core.subword_tokenizer

.. autoclass:: SubwordTokenizer
:members:

General utility functions
-------------------------
Expand Down
28 changes: 27 additions & 1 deletion python/cudf/cudf/_lib/cpp/nvtext/subword_tokenize.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
from libcpp cimport bool
from libcpp.memory cimport unique_ptr
from libcpp.string cimport string
from libc.stdint cimport uint32_t
from libc.stdint cimport uint16_t, uint32_t


from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
Expand All @@ -17,6 +18,31 @@ cdef extern from "nvtext/subword_tokenize.hpp" namespace "nvtext" nogil:
unique_ptr[column] tensor_attention_mask
unique_ptr[column] tensor_metadata

cdef struct hashed_vocabulary "nvtext::hashed_vocabulary":
uint16_t first_token_id
uint16_t separator_token_id
uint16_t unknown_token_id
uint32_t outer_hash_a
uint32_t outer_hash_b
uint16_t num_bin
unique_ptr[column] table
unique_ptr[column] bin_coefficients
unique_ptr[column] bin_offsets

cdef unique_ptr[hashed_vocabulary] load_vocabulary_file(
const string &filename_hashed_vocabulary
) except +

cdef tokenizer_result subword_tokenize(
const column_view & strings,
hashed_vocabulary & hashed_vocablary_obj,
uint32_t max_sequence_length,
uint32_t stride,
bool do_lower,
bool do_truncate,
uint32_t max_rows_tensor
) except +

cdef tokenizer_result subword_tokenize(
const column_view &strings,
const string &filename_hashed_vocabulary,
Expand Down
58 changes: 52 additions & 6 deletions python/cudf/cudf/_lib/nvtext/subword_tokenize.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,74 @@ from libc.stdint cimport uintptr_t

from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.column.column_view cimport column_view
from cudf._lib.cpp.nvtext.subword_tokenize cimport (
from cudf._lib.cpp.nvtext.subword_tokenize cimport(
subword_tokenize as cpp_subword_tokenize,
hashed_vocabulary as cpp_hashed_vocabulary,
load_vocabulary_file as cpp_load_vocabulary_file,
tokenizer_result as cpp_tokenizer_result,
move as tr_move
move as tr_move,
)
from cudf._lib.column cimport Column


def subword_tokenize(
cdef class Hashed_Vocabulary:
cdef unique_ptr[cpp_hashed_vocabulary] c_obj

def __cinit__(self, hash_file):
cdef string c_hash_file = <string>str(hash_file).encode()
with nogil:
self.c_obj = move(cpp_load_vocabulary_file(c_hash_file))


def subword_tokenize_inmem_hash(
Column strings,
object hash_file,
Hashed_Vocabulary hashed_vocabulary,
uint32_t max_sequence_length=64,
uint32_t stride=48,
bool do_lower=True,
bool do_truncate=False,
uint32_t max_rows_tensor=500
):
"""
Subword tokenizes text series by using the pre-loaded hashed vocabulary
"""
cdef column_view c_strings = strings.view()
cdef string c_hash_file = <string>str(hash_file).encode()
cdef cpp_tokenizer_result c_result
with nogil:
c_result = tr_move(
cpp_subword_tokenize(
c_strings,
hashed_vocabulary.c_obj.get()[0],
max_sequence_length,
stride,
do_lower,
do_truncate,
max_rows_tensor
)
)
# return the 3 tensor components
tokens = Column.from_unique_ptr(move(c_result.tensor_token_ids))
masks = Column.from_unique_ptr(move(c_result.tensor_attention_mask))
metadata = Column.from_unique_ptr(move(c_result.tensor_metadata))
return tokens, masks, metadata


def subword_tokenize_vocab_file(
Column strings,
object hash_file,
uint32_t max_sequence_length=64,
uint32_t stride=48,
bool do_lower=True,
bool do_truncate=False,
uint32_t max_rows_tensor=500
):
VibhuJawa marked this conversation as resolved.
Show resolved Hide resolved
"""
Subword tokenizes text series by using the hashed vocabulary
stored on disk
"""
cdef column_view c_strings = strings.view()
cdef cpp_tokenizer_result c_result
cdef string c_hash_file = <string>str(hash_file).encode()
with nogil:
c_result = tr_move(
cpp_subword_tokenize(
Expand All @@ -42,7 +89,6 @@ def subword_tokenize(
max_rows_tensor
)
)

# return the 3 tensor components
tokens = Column.from_unique_ptr(move(c_result.tensor_token_ids))
masks = Column.from_unique_ptr(move(c_result.tensor_attention_mask))
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/column/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
porter_stemmer_measure as cpp_porter_stemmer_measure,
)
from cudf._lib.nvtext.subword_tokenize import (
subword_tokenize as cpp_subword_tokenize,
subword_tokenize_vocab_file as cpp_subword_tokenize_vocab_file,
)
from cudf._lib.nvtext.tokenize import (
_count_tokens_column as cpp_count_tokens_column,
Expand Down Expand Up @@ -4435,7 +4435,7 @@ def subword_tokenize(
array([[0, 0, 2],
[1, 0, 1]], dtype=uint32)
"""
tokens, masks, metadata = cpp_subword_tokenize(
tokens, masks, metadata = cpp_subword_tokenize_vocab_file(
self._column,
hash_file,
max_length,
Expand Down
Loading