Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpuCI] Forward-merge branch-22.10 to branch-22.12 [skip gpuci] #11784

Merged
merged 3 commits into from
Sep 27, 2022

Conversation

GPUtester
Copy link
Collaborator

Forward-merge triggered by push to branch-22.10 that creates a PR to keep branch-22.12 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.

karthikeyann and others added 3 commits September 27, 2022 09:49
This PR generates json column creation from the traversed json tree. It has following parts
1. `reduce_to_column_tree` -  Reduce node tree into column tree by aggregating each property of each 	column and number of rows in each column.
2. `make_json_column2` - creates the GPU json column tree structure from tree and column info
3. `json_column_to_cudf_column2` -  converts this GPU json column to cudf column.
4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device.

Depends on PR #11518 #11610 
For code-review, use PR karthikeyann#5 which contains only this tree changes.

### Overview

- PR #11264 Tokenizes the JSON string to Tokens
- PR #11518 Converts Tokens to Nodes (tree representation)
- PR #11610 Traverses this node tree --> assigns column id and row index to each node.
- This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column`

JSON has 5 categories of nodes. STRUCT, LIST,  FIELD, VALUE, STRING,
STRUCT, LIST are nested types.
FIELD nodes are struct columns' keys.
VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2`

Tree Representation `tree_meta_t` has 4 data members.
1. node categories
2. node parents' id
3. node level
4. node's string range {begin, end} (as 2 vectors)

Currently supported JSON formats are records orient, and JSON lines.

### This PR - Detailed explanation
This PR has 3 steps.
1. `reduce_to_column_tree`
    - Required to compute total number of columns, column type, nested column structure, and number of rows in each column.
    - Generates `tree_meta_t` data members for column.
    - - Sort node tree by col_id (stable sort)
    - - reduce_by_key custom_op on node_categories, collapses to column category
    - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id.
    - - reduce_by_key max  on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count.
5. `make_json_column2` 
    - Converts nodes to GPU json columns in tree structure
    - - get column tree, transfer column names to host.
    - - Create `d_json_column` for non-field columns.
    - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column.
    - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length.
    - - Compute list offset 
    - - Perform scan max operation on offsets. (to fill 0's with previous offset value).
    - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information.
6. `json_column_to_cudf_column2` -  converts this GPU json column to cudf column.
    - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further.
7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Elias Stehle (https://github.com/elstehle)
  - Yunsong Wang (https://github.com/PointKernel)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Tobias Ribizel (https://github.com/upsj)
  - https://github.com/nvdbaranec
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11714
This adds a BGZIP `data_chunk_reader` usable with `multibyte_split`. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form
```
63                    16       0
+----------------------+-------+
|      block offset    | local |
+----------------------+-------+
```
The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets.

For a description of the BGZIP format, check section 4 in the [SAM specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf).

Closes #10466 

## TODO
- [x] Use events to avoid clobbering data that is still in use
- [x] stricter handling of local_begin (currently it may overflow into subsequent blocks)
- [x] add tests where  local_begin and local_end are in the same chunk or even block
- [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions

Authors:
  - Tobias Ribizel (https://github.com/upsj)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11652
This PR plumbs `schema_element` and `keep_quotes` support in json reader.

**Deprecation:** This PR also contains changes deprecating `dtype` as `list` inputs. This seems to be a very outdated legacy feature we continued to support and cannot be supported with the `schema_element`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #11746
@GPUtester GPUtester requested review from a team as code owners September 27, 2022 09:52
@GPUtester GPUtester merged commit 1d7af9e into branch-22.12 Sep 27, 2022
@GPUtester
Copy link
Collaborator Author

SUCCESS - forward-merge complete.

@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants