From 6db757b7434610f70a5f47c3d2e8d9d1b56e9bca Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Tue, 25 May 2021 07:23:58 -0700 Subject: [PATCH] Fix structs column description in dev docs (#8318) Currently structs column section of developer documentation mentions there's an offset column tied to its layout. This is not true and this PR fixes that. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - MithunR (https://github.com/mythrocks) - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec URL: https://github.com/rapidsai/cudf/pull/8318 --- cpp/docs/DEVELOPER_GUIDE.md | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/cpp/docs/DEVELOPER_GUIDE.md b/cpp/docs/DEVELOPER_GUIDE.md index 3abc35f9bd2..b1d62261225 100644 --- a/cpp/docs/DEVELOPER_GUIDE.md +++ b/cpp/docs/DEVELOPER_GUIDE.md @@ -964,21 +964,18 @@ this compound column representation of strings. ## Structs columns -Structs are represented similarly to lists, except that they have multiple child data columns. -The parent column's type is `STRUCT` and contains no data, but its size represents the number of -structs in the column, and its null mask represents the validity of each struct element. The parent -has `N + 1` children, where `N` is the number of fields in the struct. +A struct is a nested data type with a set of child columns each representing an individual field +of a logical struct. Field names are not represented. -1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each - struct in each dense column of elements. -2. For each field, a column containing the actual field data and optional null mask for all elements - of all the structs packed together. - -With this representation, `child[0][offsets[i]]` is the first field of struct `i`, -`child[1][offsets[i]]` is the second field of struct `i`, etc. +A structs column with `N` fields has `N` children. Each child is a column storing all the data +of a single field packed column-wise, with an optional null mask. The parent column's type is +`STRUCT` and contains no data, its size represents the number of struct rows in the column, and its +null mask represents the validity of each struct element. + +With this representation, `child[0][10]` is row 10 of the first field of the struct, `child[1][42]` +is row 42 of the second field of the struct. -As defined in the [Apache Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout), -in addition to the struct column's null mask, each struct field column has its own optional null +Notice that in addition to the struct column's null mask, each struct field column has its own optional null mask. A struct field's validity can vary independently from the corresponding struct row. For instance, a non-null struct row might have a null field. However, the fields of a null struct row are deemed to be null as well. For example, consider a struct column of type