Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python/Cython bindings for multibyte_split #8998

Merged
merged 88 commits into from
Sep 17, 2021
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
e1b71e6
multibyte-split scaffolding
cwharris Jun 27, 2021
836773a
cudf::io::text::input_stream
cwharris Jun 27, 2021
3e06c18
trie test scaffolding
cwharris Jul 2, 2021
ac14dbd
superstate + tests
cwharris Jul 7, 2021
ea8cee2
added device trie
cwharris Jul 7, 2021
a4a8dd0
add superstate to multibyte_split
cwharris Jul 7, 2021
094d2d2
cub block scan superstates
cwharris Jul 8, 2021
1117ab8
block-wide superstate matching
cwharris Jul 9, 2021
51b1444
fix superstate constructor bug where only the first 8 states were ini…
cwharris Jul 9, 2021
d1f7eb3
multibyte_split multiple delimeter support
cwharris Jul 9, 2021
a628d73
scan output-offsets in multibyte_split
cwharris Jul 9, 2021
e1cc84d
printf offsets in multibyte_split
cwharris Jul 9, 2021
c7177bc
add match-length to trie to adjust for output offset in multibyte_split
cwharris Jul 9, 2021
42dc014
adjust multibyte_split test case to expect delimiters to be retained …
cwharris Jul 9, 2021
5171711
printf match_begin and match_end for multibyte_split
cwharris Jul 9, 2021
6b62ceb
multibyte_split test passing
cwharris Jul 10, 2021
a2c9756
add multibyte_split comments, break test intentionally to work on mul…
cwharris Jul 12, 2021
21b8b25
multibyte_split add multi-block support
cwharris Jul 13, 2021
f59a93e
rename BYTES_PER_TILE to ITEMS_PER_TILE
cwharris Jul 13, 2021
5fa112a
add bounds check to multibyte_split load and flag
cwharris Jul 14, 2021
cf42fd0
multibyte_split benchmark scaffolding
cwharris Jul 14, 2021
e6e9741
multibyte_split increase threads per block and adjust test case.
cwharris Jul 14, 2021
b5c2e05
use circular buffer in multibyte_split to allow for stream inputs
cwharris Jul 16, 2021
738af48
update multibyte_split to work with streaming inputs
cwharris Jul 16, 2021
0121b22
consolidate two passes of stream-scanning to a single function
cwharris Jul 16, 2021
a233ca2
add tile_state partial to multibyte_split but dont use yet
cwharris Jul 16, 2021
4946058
add reusable tilestate callback to `multibyte_split`
cwharris Jul 16, 2021
d69aeca
begin working on warp-reduce window aggregation of tile state in mult…
cwharris Jul 16, 2021
079d1ea
fix multibyte_split bug where non-streaming approach would hang
cwharris Jul 17, 2021
970aac2
interleaved streaming io for multibyte_split
cwharris Jul 18, 2021
fee7ebb
use no-copy string column construction in multibyte_split
cwharris Jul 19, 2021
e5a5204
document multibyte_split minimum tile count requirements
cwharris Jul 19, 2021
216d620
Merge branch 'branch-21.10' into multibyte-split
cwharris Jul 19, 2021
65af4de
multibyte_split tunable concurrency via stream pool
cwharris Jul 22, 2021
a4fe128
multibyte_split remove device_istream replace with data_chunk_reader
cwharris Jul 23, 2021
9bc6c89
add data_chunk_source factories, nvtx ranges to multibyte_split, use …
cwharris Jul 23, 2021
08b3069
use make_device_uvector_async in trie.hpp
cwharris Jul 23, 2021
7088791
rm device_istream
cwharris Jul 23, 2021
b61c14f
multibyte_split add some docs, add more test cases
cwharris Jul 23, 2021
017f05d
revert CMakeLists ordering
cwharris Jul 23, 2021
f432e68
convert trie storage from SOA to AOS
cwharris Jul 25, 2021
59a70a9
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 26, 2021
f1d3b4a
fix spelling mistakes
cwharris Jul 26, 2021
51ac35c
break multibyte_split by adding queue/multistate support
cwharris Jul 27, 2021
3d04556
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 28, 2021
1fb36ee
fix `abac` pattern matching test, introduce new bug :(
cwharris Jul 29, 2021
ecf440a
fix multibyte_split aggregation strategy to avoid assuming T{} is an …
cwharris Jul 29, 2021
9e34efb
Merge branch 'multibyte-split-queue' into multibyte-split
cwharris Jul 29, 2021
fc014e5
add second host buffer to istream_data_chunk_reader to facilitate ove…
cwharris Jul 29, 2021
896ed31
actually add second buffer to istream_data_chunk_reader
cwharris Jul 29, 2021
7792521
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 30, 2021
2f75b50
clean up multibyte_split code
cwharris Jul 30, 2021
162e9cf
adjust copyright
cwharris Jul 30, 2021
ade1150
remove confusing test case in multibyte_split
cwharris Jul 30, 2021
8e08012
limit multibyte_split to 32 threads, because of a bug that needs fixi…
cwharris Jul 30, 2021
5ad2148
fix emoji bits documentation
cwharris Jul 31, 2021
511ab9f
style adjustments and documentation update to multibyte_split
cwharris Aug 2, 2021
69280e8
move tile-scanning utilites to detail namespace
cwharris Aug 2, 2021
2d37dc9
remove "inline" from constexpr members in cudf::io::text
cwharris Aug 2, 2021
9c6bf2a
fix large input bug in multibyte_split where offsets were not account…
cwharris Aug 3, 2021
ee817b1
improve data_chunk_reader docs
cwharris Aug 3, 2021
4cdbee5
make multibyte_split accept data_chunk_source as a const& arg
cwharris Aug 3, 2021
c3783db
add tile_state.hpp to meta.yaml
cwharris Aug 3, 2021
432399c
create bad-case scenario benchmark
cwharris Aug 3, 2021
ad21c4f
remove data_chunk in favor of device_span until it becomes clear an r…
cwharris Aug 4, 2021
18e0863
use std::vector<cuda_stream_view> instread of stream_pool
cwharris Aug 4, 2021
45e5b65
rename ticket to h_ticket
cwharris Aug 4, 2021
ee122a8
adjust `scan_tile_state_view::get_prefix` to make the purpose of thre…
cwharris Aug 4, 2021
c9d2889
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 5, 2021
ca6bbac
fix UB in multibyte_split concurrent kernel execution, improve perf
cwharris Aug 6, 2021
d68d951
add error messages to multibyte_split to indicate unsupported use cases
cwharris Aug 6, 2021
7a9a217
Add cuIO plumbing for read_text
jdye64 Aug 6, 2021
9684646
remove __threadfence() in favor of cuda::atomic
cwharris Aug 9, 2021
c564869
Python/Cython bindings for multibyte_split
jdye64 Aug 9, 2021
da7be52
Python/Cython bindings for multibyte_split
jdye64 Aug 9, 2021
a2a7277
Merge branch 'multibyte-split' into read_text_pyx_only
jdye64 Aug 9, 2021
c06da69
Merge branch 'multibyte-split' into read_text_pyx_only
jdye64 Aug 10, 2021
9f0eff4
merge with usptream/branch-21.10
jdye64 Aug 31, 2021
d3f3845
removed reader_impl that is no longer required after certain refactor…
jdye64 Aug 31, 2021
dee6b21
Remove vector of delimiters and only allow a single delimiter
jdye64 Aug 31, 2021
886fb77
remove cyclic dependency import causing test issues
jdye64 Sep 3, 2021
5cf04a9
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 8, 2021
fca30e6
address spacing in except+
jdye64 Sep 8, 2021
80e83d6
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 11, 2021
bd2711e
updates to read_text
jdye64 Sep 13, 2021
d2ff48f
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 13, 2021
ba3eb40
Merge remote-tracking branch 'upstream/branch-21.10' into read_text_p…
jdye64 Sep 14, 2021
6a06a5d
updates per review
jdye64 Sep 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
read_json,
read_orc,
read_parquet,
read_text,
)
from cudf.utils.dtypes import _NA_REP
from cudf.utils.utils import set_allocator
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
table,
transpose,
unary,
text,
)

MAX_COLUMN_SIZE = np.iinfo(np.int32).max
Expand Down
27 changes: 27 additions & 0 deletions python/cudf/cudf/_lib/cpp/io/text.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

from libcpp.memory cimport unique_ptr
from libcpp.string cimport string

from cudf._lib.cpp.column.column cimport column


cdef extern from "cudf/io/text/data_chunk_source.hpp" \
namespace "cudf::io::text" nogil:

cdef cppclass data_chunk_source:
data_chunk_source() except+
jdye64 marked this conversation as resolved.
Show resolved Hide resolved

cdef extern from "cudf/io/text/data_chunk_source_factories.hpp" \
namespace "cudf::io::text" nogil:

unique_ptr[data_chunk_source] make_source(string data) except+
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
unique_ptr[data_chunk_source] \
make_source_from_file(string filename) except+
jdye64 marked this conversation as resolved.
Show resolved Hide resolved


cdef extern from "cudf/io/text/multibyte_split.hpp" \
namespace "cudf::io::text" nogil:

unique_ptr[column]multibyte_split(data_chunk_source source,
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
string delimiter) except+
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
43 changes: 43 additions & 0 deletions python/cudf/cudf/_lib/text.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.

import cudf

from cython.operator cimport dereference
from libcpp.memory cimport make_unique, unique_ptr
from libcpp.string cimport string
from libcpp.utility cimport move

from cudf._lib.column cimport Column
from cudf._lib.cpp.column.column cimport column
from cudf._lib.cpp.io.text cimport (
data_chunk_source,
make_source,
make_source_from_file,
multibyte_split,
)


cpdef read_text(object filepaths_or_buffers,
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
object delimiter=None):
"""
Cython function to call into libcudf API, see `read_text`.
jdye64 marked this conversation as resolved.
Show resolved Hide resolved

See Also
--------
cudf.io.text.read_text
"""
cdef string filename = filepaths_or_buffers.encode()
cdef string delim = delimiter.encode()

cdef unique_ptr[data_chunk_source] datasource
cdef unique_ptr[column] c_col

with nogil:
datasource = move(make_source_from_file(filename))
c_col = move(multibyte_split(dereference(datasource), delim))

col = Column.from_unique_ptr(move(c_col))
df = cudf.DataFrame._from_data(
cudf.core.column_accessor.ColumnAccessor({"col_name": col}))

return df
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions python/cudf/cudf/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@
read_parquet_metadata,
write_to_dataset,
)
from cudf.io.text import read_text
30 changes: 30 additions & 0 deletions python/cudf/cudf/io/text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (c) 2018-2021, NVIDIA CORPORATION.

from io import BytesIO, StringIO

from nvtx import annotate

import cudf
from cudf._lib import text as libtext
from cudf.utils import ioutils


@annotate("READ_TEXT", color="purple", domain="cudf_python")
@ioutils.doc_read_text()
def read_text(
filepath_or_buffer, delimiter=None, **kwargs,
):
"""{docstring}"""

filepath_or_buffer = ioutils.get_filepath_or_buffer(
path_or_data=filepath_or_buffer,
compression=None,
iotypes=(BytesIO, StringIO),
**kwargs,
)

df = cudf.DataFrame._from_table(
libtext.read_text(filepath_or_buffer, delimiter=delimiter,)
)

return df
jdye64 marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 18 additions & 0 deletions python/cudf/cudf/utils/ioutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1012,6 +1012,24 @@
doc_kafka_datasource = docfmt_partial(docstring=_docstring_kafka_datasource)


_docstring_text_datasource = """
Configuration object for a text Datasource

Parameters
----------
filepath_or_buffer : str, path object, or file-like object
Either a path to a file (a `str`, `pathlib.Path`, or
`py._path.local.LocalPath`), URL (including http, ftp, and S3 locations),
or any object with a `read()` method (such as builtin `open()` file handler
function or `StringIO`).
delimiter : string, default None, The delimiter that should be used
for splitting text chunks into seperate cudf column rows. Currently
only a single delimiter is supported.

jdye64 marked this conversation as resolved.
Show resolved Hide resolved
"""
doc_read_text = docfmt_partial(docstring=_docstring_text_datasource)


def is_url(url):
"""Check if a string is a valid URL to a network location.

Expand Down