merge branch-22.12 #5

etseidl · 2022-09-28T16:12:25Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Fixes: #11011 This PR: - [x] Adds a side-section for `list` & `struct` handling. - [x] Reduces duplication. - [x] Exposes more `ListMethods` APIs. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #11770

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Fixes: #11721 This PR: - [x] Fixes: #11721, by not going through the fill & fill_inplace APIs which don't support `struct` and `list` columns. - [x] Fixes an issue in caching while constructing a `struct` or `list` scalar as `list` & `dict` objects are not hashable and we were running into the following errors: ```python In [9]: i = cudf.Scalar([10, 11]) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:51, in CachedScalarInstanceMeta.__call__(self, value, dtype) 49 try: 50 # try retrieving an instance from the cache: ---> 51 self.__instances.move_to_end(cache_key) 52 return self.__instances[cache_key] KeyError: ([10, 11], <class 'list'>, None, <class 'NoneType'>) During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) Cell In [9], line 1 ----> 1 i = cudf.Scalar([10, 11]) File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:57, in CachedScalarInstanceMeta.__call__(self, value, dtype) 53 except KeyError: 54 # if an instance couldn't be found in the cache, 55 # construct it and add to cache: 56 obj = super().__call__(value, dtype=dtype) ---> 57 self.__instances[cache_key] = obj 58 if len(self.__instances) > self.__maxsize: 59 self.__instances.popitem(last=False) TypeError: unhashable type: 'list' ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #11760

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

This PR fixes: #11159 by returning correct object type for the result of `isna` & `notna` in `Index`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #11769

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Fixes: #11683, #10823 This PR: - [x] Removes `kwargs` in CSV reader & writer such that users get clear errors when they misspell a parameter. - [x] Re-orders `read_csv` & `to_csv` parameters which will now match to pandas. The diff is actually adding `storage_options` to `read_csv` & `to_csv` after removing `kwargs`, and the rest of it all re-ordering appropriately. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vukasin Milovanovic (https://github.com/vuule) URL: #11762

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

…ut table (#11709) By definition, the `cudf::partition*` API will return a vector of offsets with size is at least the number of partitions. As such, an output empty table should associate with an output offset array like `[0, 0, ..., 0]` (all zeros). However, currently the output offsets in such situations is an empty array. This PR corrects the implementation for such corner cases. Closes #11700. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11709

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

This PR generates json column creation from the traversed json tree. It has following parts 1. `reduce_to_column_tree` - Reduce node tree into column tree by aggregating each property of each column and number of rows in each column. 2. `make_json_column2` - creates the GPU json column tree structure from tree and column info 3. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. 4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Depends on PR #11518 #11610 For code-review, use PR karthikeyann#5 which contains only this tree changes. ### Overview - PR #11264 Tokenizes the JSON string to Tokens - PR #11518 Converts Tokens to Nodes (tree representation) - PR #11610 Traverses this node tree --> assigns column id and row index to each node. - This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column` JSON has 5 categories of nodes. STRUCT, LIST, FIELD, VALUE, STRING, STRUCT, LIST are nested types. FIELD nodes are struct columns' keys. VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2` Tree Representation `tree_meta_t` has 4 data members. 1. node categories 2. node parents' id 3. node level 4. node's string range {begin, end} (as 2 vectors) Currently supported JSON formats are records orient, and JSON lines. ### This PR - Detailed explanation This PR has 3 steps. 1. `reduce_to_column_tree` - Required to compute total number of columns, column type, nested column structure, and number of rows in each column. - Generates `tree_meta_t` data members for column. - - Sort node tree by col_id (stable sort) - - reduce_by_key custom_op on node_categories, collapses to column category - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id. - - reduce_by_key max on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count. 5. `make_json_column2` - Converts nodes to GPU json columns in tree structure - - get column tree, transfer column names to host. - - Create `d_json_column` for non-field columns. - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column. - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length. - - Compute list offset - - Perform scan max operation on offsets. (to fill 0's with previous offset value). - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information. 6. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further. 7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) - Yunsong Wang (https://github.com/PointKernel) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) URL: #11714

This adds a BGZIP `data_chunk_reader` usable with `multibyte_split`. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form ``` 63 16 0 +----------------------+-------+ | block offset | local | +----------------------+-------+ ``` The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets. For a description of the BGZIP format, check section 4 in the [SAM specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf). Closes #10466 ## TODO - [x] Use events to avoid clobbering data that is still in use - [x] stricter handling of local_begin (currently it may overflow into subsequent blocks) - [x] add tests where local_begin and local_end are in the same chunk or even block - [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Michael Wang (https://github.com/isVoid) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #11652

This PR plumbs `schema_element` and `keep_quotes` support in json reader. **Deprecation:** This PR also contains changes deprecating `dtype` as `list` inputs. This seems to be a very outdated legacy feature we continued to support and cannot be supported with the `schema_element`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Lawrence Mitchell (https://github.com/wence-) URL: #11746

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

This PR adds support for the use of the`str.istitle()` method within udfs for `apply`. Authors: - https://github.com/brandon-b-miller - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11738

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort. Closes rapidsai#12422. ### Example: ``` #0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446 #1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113 #2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e #3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88 #4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9 #5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027 #6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f rapidsai#7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6 rapidsai#8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195 rapidsai#9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109 rapidsai#10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f rapidsai#11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87 rapidsai#12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95 rapidsai#13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c rapidsai#14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90 rapidsai#15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80 rapidsai#16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5 ``` ### Usage In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use. For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example: ``` try { // cudf API calls } catch (cudf::logic_error const& e) { std::cout << e.what() << std::endl; std::cout << e.stacktrace() << std::endl; throw e; } // similar with catching other exception types ``` ### Follow-up work The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace. ### Note: * This feature doesn't require libcudf to be built in Debug mode. * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions. * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging). * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Jason Lowe (https://github.com/jlowe) URL: rapidsai#13298

Raise ValueError if DataFrame column length does not match data

raydouglass and others added 24 commits September 23, 2022 11:38

DOC

f3af699

Merge pull request #11753 from rapidsai/branch-22.10

1a2e755

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11757 from rapidsai/branch-22.10

ba9c43c

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11758 from rapidsai/branch-22.10

7376f1f

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11763 from rapidsai/branch-22.10

59847c1

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11767 from rapidsai/branch-22.10

41474af

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11773 from rapidsai/branch-22.10

5fb657d

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11774 from rapidsai/branch-22.10

2b94483

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11775 from rapidsai/branch-22.10

7d40b30

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11776 from rapidsai/branch-22.10

a1cbb02

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11777 from rapidsai/branch-22.10

aa2ef0e

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11781 from rapidsai/branch-22.10

cc6f237

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11782 from rapidsai/branch-22.10

cc97584

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11784 from rapidsai/branch-22.10

1d7af9e

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

Merge pull request #11786 from rapidsai/branch-22.10

b8ab576

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci]

etseidl merged commit d3378fc into etseidl:feature/parquetv2 Sep 28, 2022

etseidl pushed a commit that referenced this pull request Nov 8, 2023

Merge pull request #5 from mroeschke/bug/df/column_mismatch

2869181

Raise ValueError if DataFrame column length does not match data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge branch-22.12 #5

merge branch-22.12 #5

etseidl commented Sep 28, 2022

merge branch-22.12 #5

merge branch-22.12 #5

Conversation

etseidl commented Sep 28, 2022

Description

Checklist