Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python/Cython bindings for multibyte_split #8998

Merged
merged 88 commits into from
Sep 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
e1b71e6
multibyte-split scaffolding
cwharris Jun 27, 2021
836773a
cudf::io::text::input_stream
cwharris Jun 27, 2021
3e06c18
trie test scaffolding
cwharris Jul 2, 2021
ac14dbd
superstate + tests
cwharris Jul 7, 2021
ea8cee2
added device trie
cwharris Jul 7, 2021
a4a8dd0
add superstate to multibyte_split
cwharris Jul 7, 2021
094d2d2
cub block scan superstates
cwharris Jul 8, 2021
1117ab8
block-wide superstate matching
cwharris Jul 9, 2021
51b1444
fix superstate constructor bug where only the first 8 states were ini…
cwharris Jul 9, 2021
d1f7eb3
multibyte_split multiple delimeter support
cwharris Jul 9, 2021
a628d73
scan output-offsets in multibyte_split
cwharris Jul 9, 2021
e1cc84d
printf offsets in multibyte_split
cwharris Jul 9, 2021
c7177bc
add match-length to trie to adjust for output offset in multibyte_split
cwharris Jul 9, 2021
42dc014
adjust multibyte_split test case to expect delimiters to be retained …
cwharris Jul 9, 2021
5171711
printf match_begin and match_end for multibyte_split
cwharris Jul 9, 2021
6b62ceb
multibyte_split test passing
cwharris Jul 10, 2021
a2c9756
add multibyte_split comments, break test intentionally to work on mul…
cwharris Jul 12, 2021
21b8b25
multibyte_split add multi-block support
cwharris Jul 13, 2021
f59a93e
rename BYTES_PER_TILE to ITEMS_PER_TILE
cwharris Jul 13, 2021
5fa112a
add bounds check to multibyte_split load and flag
cwharris Jul 14, 2021
cf42fd0
multibyte_split benchmark scaffolding
cwharris Jul 14, 2021
e6e9741
multibyte_split increase threads per block and adjust test case.
cwharris Jul 14, 2021
b5c2e05
use circular buffer in multibyte_split to allow for stream inputs
cwharris Jul 16, 2021
738af48
update multibyte_split to work with streaming inputs
cwharris Jul 16, 2021
0121b22
consolidate two passes of stream-scanning to a single function
cwharris Jul 16, 2021
a233ca2
add tile_state partial to multibyte_split but dont use yet
cwharris Jul 16, 2021
4946058
add reusable tilestate callback to `multibyte_split`
cwharris Jul 16, 2021
d69aeca
begin working on warp-reduce window aggregation of tile state in mult…
cwharris Jul 16, 2021
079d1ea
fix multibyte_split bug where non-streaming approach would hang
cwharris Jul 17, 2021
970aac2
interleaved streaming io for multibyte_split
cwharris Jul 18, 2021
fee7ebb
use no-copy string column construction in multibyte_split
cwharris Jul 19, 2021
e5a5204
document multibyte_split minimum tile count requirements
cwharris Jul 19, 2021
216d620
Merge branch 'branch-21.10' into multibyte-split
cwharris Jul 19, 2021
65af4de
multibyte_split tunable concurrency via stream pool
cwharris Jul 22, 2021
a4fe128
multibyte_split remove device_istream replace with data_chunk_reader
cwharris Jul 23, 2021
9bc6c89
add data_chunk_source factories, nvtx ranges to multibyte_split, use …
cwharris Jul 23, 2021
08b3069
use make_device_uvector_async in trie.hpp
cwharris Jul 23, 2021
7088791
rm device_istream
cwharris Jul 23, 2021
b61c14f
multibyte_split add some docs, add more test cases
cwharris Jul 23, 2021
017f05d
revert CMakeLists ordering
cwharris Jul 23, 2021
f432e68
convert trie storage from SOA to AOS
cwharris Jul 25, 2021
59a70a9
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 26, 2021
f1d3b4a
fix spelling mistakes
cwharris Jul 26, 2021
51ac35c
break multibyte_split by adding queue/multistate support
cwharris Jul 27, 2021
3d04556
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 28, 2021
1fb36ee
fix `abac` pattern matching test, introduce new bug :(
cwharris Jul 29, 2021
ecf440a
fix multibyte_split aggregation strategy to avoid assuming T{} is an …
cwharris Jul 29, 2021
9e34efb
Merge branch 'multibyte-split-queue' into multibyte-split
cwharris Jul 29, 2021
fc014e5
add second host buffer to istream_data_chunk_reader to facilitate ove…
cwharris Jul 29, 2021
896ed31
actually add second buffer to istream_data_chunk_reader
cwharris Jul 29, 2021
7792521
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 30, 2021
2f75b50
clean up multibyte_split code
cwharris Jul 30, 2021
162e9cf
adjust copyright
cwharris Jul 30, 2021
ade1150
remove confusing test case in multibyte_split
cwharris Jul 30, 2021
8e08012
limit multibyte_split to 32 threads, because of a bug that needs fixi…
cwharris Jul 30, 2021
5ad2148
fix emoji bits documentation
cwharris Jul 31, 2021
511ab9f
style adjustments and documentation update to multibyte_split
cwharris Aug 2, 2021
69280e8
move tile-scanning utilites to detail namespace
cwharris Aug 2, 2021
2d37dc9
remove "inline" from constexpr members in cudf::io::text
cwharris Aug 2, 2021
9c6bf2a
fix large input bug in multibyte_split where offsets were not account…
cwharris Aug 3, 2021
ee817b1
improve data_chunk_reader docs
cwharris Aug 3, 2021
4cdbee5
make multibyte_split accept data_chunk_source as a const& arg
cwharris Aug 3, 2021
c3783db
add tile_state.hpp to meta.yaml
cwharris Aug 3, 2021
432399c
create bad-case scenario benchmark
cwharris Aug 3, 2021
ad21c4f
remove data_chunk in favor of device_span until it becomes clear an r…
cwharris Aug 4, 2021
18e0863
use std::vector<cuda_stream_view> instread of stream_pool
cwharris Aug 4, 2021
45e5b65
rename ticket to h_ticket
cwharris Aug 4, 2021
ee122a8
adjust `scan_tile_state_view::get_prefix` to make the purpose of thre…
cwharris Aug 4, 2021
c9d2889
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 5, 2021
ca6bbac
fix UB in multibyte_split concurrent kernel execution, improve perf
cwharris Aug 6, 2021
d68d951
add error messages to multibyte_split to indicate unsupported use cases
cwharris Aug 6, 2021
7a9a217
Add cuIO plumbing for read_text
jdye64 Aug 6, 2021
9684646
remove __threadfence() in favor of cuda::atomic
cwharris Aug 9, 2021
c564869
Python/Cython bindings for multibyte_split
jdye64 Aug 9, 2021
da7be52
Python/Cython bindings for multibyte_split
jdye64 Aug 9, 2021
a2a7277
Merge branch 'multibyte-split' into read_text_pyx_only
jdye64 Aug 9, 2021
c06da69
Merge branch 'multibyte-split' into read_text_pyx_only
jdye64 Aug 10, 2021
9f0eff4
merge with usptream/branch-21.10
jdye64 Aug 31, 2021
d3f3845
removed reader_impl that is no longer required after certain refactor…
jdye64 Aug 31, 2021
dee6b21
Remove vector of delimiters and only allow a single delimiter
jdye64 Aug 31, 2021
886fb77
remove cyclic dependency import causing test issues
jdye64 Sep 3, 2021
5cf04a9
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 8, 2021
fca30e6
address spacing in except+
jdye64 Sep 8, 2021
80e83d6
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 11, 2021
bd2711e
updates to read_text
jdye64 Sep 13, 2021
d2ff48f
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 13, 2021
ba3eb40
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 14, 2021
6a06a5d
updates per review
jdye64 Sep 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
read_json,
read_orc,
read_parquet,
read_text,
)
from cudf.utils.dtypes import _NA_REP
from cudf.utils.utils import set_allocator
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
table,
transpose,
unary,
text,
)

MAX_COLUMN_SIZE = np.iinfo(np.int32).max
Expand Down
27 changes: 27 additions & 0 deletions python/cudf/cudf/_lib/cpp/io/text.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

from libcpp.memory cimport unique_ptr
from libcpp.string cimport string

from cudf._lib.cpp.column.column cimport column


cdef extern from "cudf/io/text/data_chunk_source.hpp" \
namespace "cudf::io::text" nogil:

cdef cppclass data_chunk_source:
data_chunk_source() except +

cdef extern from "cudf/io/text/data_chunk_source_factories.hpp" \
namespace "cudf::io::text" nogil:

unique_ptr[data_chunk_source] make_source(string data) except +
unique_ptr[data_chunk_source] \
make_source_from_file(string filename) except +


cdef extern from "cudf/io/text/multibyte_split.hpp" \
namespace "cudf::io::text" nogil:

unique_ptr[column] multibyte_split(data_chunk_source source,
string delimiter) except +
39 changes: 39 additions & 0 deletions python/cudf/cudf/_lib/text.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

import cudf

from cython.operator cimport dereference
from libcpp.memory cimport make_unique, unique_ptr
from libcpp.string cimport string
from libcpp.utility cimport move

from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.io.text cimport (
data_chunk_source,
make_source,
make_source_from_file,
multibyte_split,
)


def read_text(object filepaths_or_buffers,
object delimiter=None):
"""
Cython function to call into libcudf API, see `multibyte_split`.

See Also
--------
cudf.io.text.read_text
"""
cdef string filename = filepaths_or_buffers.encode()
cdef string delim = delimiter.encode()

cdef unique_ptr[data_chunk_source] datasource
cdef unique_ptr[column] c_col

with nogil:
datasource = move(make_source_from_file(filename))
c_col = move(multibyte_split(dereference(datasource), delim))

return {None: Column.from_unique_ptr(move(c_col))}
1 change: 1 addition & 0 deletions python/cudf/cudf/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@
read_parquet_metadata,
write_to_dataset,
)
from cudf.io.text import read_text
28 changes: 28 additions & 0 deletions python/cudf/cudf/io/text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2018-2021, NVIDIA CORPORATION.

from io import BytesIO, StringIO

from nvtx import annotate

import cudf
from cudf._lib import text as libtext
from cudf.utils import ioutils


@annotate("READ_TEXT", color="purple", domain="cudf_python")
@ioutils.doc_read_text()
def read_text(
filepath_or_buffer, delimiter=None, **kwargs,
):
"""{docstring}"""

filepath_or_buffer, compression = ioutils.get_filepath_or_buffer(
path_or_data=filepath_or_buffer,
compression=None,
iotypes=(BytesIO, StringIO),
**kwargs,
)

return cudf.Series._from_data(
libtext.read_text(filepath_or_buffer, delimiter=delimiter,)
)
16 changes: 16 additions & 0 deletions python/cudf/cudf/tests/data/text/chess.pgn
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[Event "F/S Return Match"]
[Site "Belgrade, Serbia JUG"]
[Date "1992.11.04"]
[Round "29"]
[White "Fischer, Robert J."]
[Black "Spassky, Boris V."]
[Result "1/2-1/2"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 {This opening is called the Ruy Lopez.}
4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 d6 8. c3 O-O 9. h3 Nb8 10. d4 Nbd7
11. c4 c6 12. cxb5 axb5 13. Nc3 Bb7 14. Bg5 b4 15. Nb1 h6 16. Bh4 c5 17. dxe5
Nxe4 18. Bxe7 Qxe7 19. exd6 Qf6 20. Nbd2 Nxd6 21. Nc4 Nxc4 22. Bxc4 Nb6
23. Ne5 Rae8 24. Bxf7+ Rxf7 25. Nxf7 Rxe1+ 26. Qxe1 Kxf7 27. Qe3 Qg5 28. Qxg5
hxg5 29. b3 Ke6 30. a3 Kd6 31. axb4 cxb4 32. Ra5 Nd5 33. f3 Bc8 34. Kf2 Bf5
35. Ra7 g6 36. Ra6+ Kc5 37. Ke1 Nf4 38. g3 Nxh3 39. Kd2 Kb5 40. Rd6 Kc5 41. Ra6
Nf2 42. g4 Bd3 43. Re6 1/2-1/2
26 changes: 26 additions & 0 deletions python/cudf/cudf/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@
from cudf.testing._utils import assert_eq


@pytest.fixture(scope="module")
def datadir(datadir):
return datadir / "text"


def test_tokenize():
strings = cudf.Series(
[
Expand Down Expand Up @@ -877,3 +882,24 @@ def test_is_vowel_consonant():
actual = strings.str.is_consonant(indices)
assert type(expected) == type(actual)
assert_eq(expected, actual)


def test_read_text(datadir):
chess_file = str(datadir) + "/chess.pgn"
delimiter = "1."

with open(chess_file, "r") as f:
content = f.read().split(delimiter)

# Since Python split removes the delimiter and read_text does
# not we need to add it back to the 'content'
expected = cudf.Series(
[
c + delimiter if i < (len(content) - 1) else c
for i, c in enumerate(content)
]
)

actual = cudf.read_text(chess_file, delimiter=delimiter)

assert_eq(expected, actual)
22 changes: 22 additions & 0 deletions python/cudf/cudf/utils/ioutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1012,6 +1012,28 @@
doc_kafka_datasource = docfmt_partial(docstring=_docstring_kafka_datasource)


_docstring_text_datasource = """
Configuration object for a text Datasource

Parameters
----------
filepath_or_buffer : str, path object, or file-like object
Either a path to a file (a `str`, `pathlib.Path`, or
`py._path.local.LocalPath`), URL (including http, ftp, and S3 locations),
or any object with a `read()` method (such as builtin `open()` file handler
function or `StringIO`).
delimiter : string, default None, The delimiter that should be used
for splitting text chunks into seperate cudf column rows. Currently
only a single delimiter is supported.

jdye64 marked this conversation as resolved.
Show resolved Hide resolved
Returns
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
-------
result : GPU ``Series``

"""
doc_read_text = docfmt_partial(docstring=_docstring_text_datasource)


def is_url(url):
"""Check if a string is a valid URL to a network location.

Expand Down