From 22ad1463a33660f62a7aa70c198f549ba30738c1 Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Fri, 21 May 2021 12:27:19 -0700 Subject: [PATCH 1/4] . --- cpp/docs/DEVELOPER_GUIDE.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/cpp/docs/DEVELOPER_GUIDE.md b/cpp/docs/DEVELOPER_GUIDE.md index 3abc35f9bd2..0c107828abe 100644 --- a/cpp/docs/DEVELOPER_GUIDE.md +++ b/cpp/docs/DEVELOPER_GUIDE.md @@ -964,21 +964,20 @@ this compound column representation of strings. ## Structs columns -Structs are represented similarly to lists, except that they have multiple child data columns. -The parent column's type is `STRUCT` and contains no data, but its size represents the number of -structs in the column, and its null mask represents the validity of each struct element. The parent -has `N + 1` children, where `N` is the number of fields in the struct. +A struct is a nested data type parametrized by ordered sequence of types, called fields. Unlike +[Apache Arrow](https://arrow.apache.org/docs/format/Columnar.html#struct-layout), cuDF struct +type only differentiate with other type by the field data type and the field order. Field name is +not represented. -1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each - struct in each dense column of elements. -2. For each field, a column containing the actual field data and optional null mask for all elements - of all the structs packed together. - -With this representation, `child[0][offsets[i]]` is the first field of struct `i`, -`child[1][offsets[i]]` is the second field of struct `i`, etc. +A structs column with `N` fields has `N` children. Each child is a column storing all the data +of a single field packed column-wise, with an optional null mask. The parent column's type is +`STRUCT` and contains no data, its size represents the number of structs in the column, and its +null mask represents the validity of each struct element. + +With this representation, `child[0][10]` is the first field of struct `10`, `child[1][42]` is the second +field of struct `42`, etc. -As defined in the [Apache Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#struct-layout), -in addition to the struct column's null mask, each struct field column has its own optional null +Notice that in addition to the struct column's null mask, each struct field column has its own optional null mask. A struct field's validity can vary independently from the corresponding struct row. For instance, a non-null struct row might have a null field. However, the fields of a null struct row are deemed to be null as well. For example, consider a struct column of type From b86742ecdc17ebddcc34f34936abe08e4dd0ea1d Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Fri, 21 May 2021 12:43:34 -0700 Subject: [PATCH 2/4] Apply suggestions from code review Co-authored-by: nvdbaranec <56695930+nvdbaranec@users.noreply.github.com> --- cpp/docs/DEVELOPER_GUIDE.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/cpp/docs/DEVELOPER_GUIDE.md b/cpp/docs/DEVELOPER_GUIDE.md index 0c107828abe..65bda1066b1 100644 --- a/cpp/docs/DEVELOPER_GUIDE.md +++ b/cpp/docs/DEVELOPER_GUIDE.md @@ -964,18 +964,14 @@ this compound column representation of strings. ## Structs columns -A struct is a nested data type parametrized by ordered sequence of types, called fields. Unlike -[Apache Arrow](https://arrow.apache.org/docs/format/Columnar.html#struct-layout), cuDF struct -type only differentiate with other type by the field data type and the field order. Field name is -not represented. +A struct is a nested data type with a set of child columns each representing an individual field of a logical struct. Field names are not represented. A structs column with `N` fields has `N` children. Each child is a column storing all the data of a single field packed column-wise, with an optional null mask. The parent column's type is `STRUCT` and contains no data, its size represents the number of structs in the column, and its null mask represents the validity of each struct element. -With this representation, `child[0][10]` is the first field of struct `10`, `child[1][42]` is the second -field of struct `42`, etc. +With this representation, `child[0][10]` is row 10 of the first field of the struct, `child[1][42]` is row 42 of the second field of the struct. Notice that in addition to the struct column's null mask, each struct field column has its own optional null mask. A struct field's validity can vary independently from the corresponding struct row. For From fe556155b5d161e5fd714083019a07dc41074fad Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Fri, 21 May 2021 12:47:41 -0700 Subject: [PATCH 3/4] formats --- cpp/docs/DEVELOPER_GUIDE.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/cpp/docs/DEVELOPER_GUIDE.md b/cpp/docs/DEVELOPER_GUIDE.md index 65bda1066b1..27a70867f6e 100644 --- a/cpp/docs/DEVELOPER_GUIDE.md +++ b/cpp/docs/DEVELOPER_GUIDE.md @@ -964,14 +964,16 @@ this compound column representation of strings. ## Structs columns -A struct is a nested data type with a set of child columns each representing an individual field of a logical struct. Field names are not represented. +A struct is a nested data type with a set of child columns each representing an individual field +of a logical struct. Field names are not represented. A structs column with `N` fields has `N` children. Each child is a column storing all the data of a single field packed column-wise, with an optional null mask. The parent column's type is `STRUCT` and contains no data, its size represents the number of structs in the column, and its null mask represents the validity of each struct element. -With this representation, `child[0][10]` is row 10 of the first field of the struct, `child[1][42]` is row 42 of the second field of the struct. +With this representation, `child[0][10]` is row 10 of the first field of the struct, `child[1][42]` +is row 42 of the second field of the struct. Notice that in addition to the struct column's null mask, each struct field column has its own optional null mask. A struct field's validity can vary independently from the corresponding struct row. For From 88f92bb4b5777d6ccabe09acad16c132601891e3 Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Fri, 21 May 2021 13:25:05 -0700 Subject: [PATCH 4/4] Update cpp/docs/DEVELOPER_GUIDE.md Co-authored-by: MithunR --- cpp/docs/DEVELOPER_GUIDE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/docs/DEVELOPER_GUIDE.md b/cpp/docs/DEVELOPER_GUIDE.md index 27a70867f6e..b1d62261225 100644 --- a/cpp/docs/DEVELOPER_GUIDE.md +++ b/cpp/docs/DEVELOPER_GUIDE.md @@ -969,7 +969,7 @@ of a logical struct. Field names are not represented. A structs column with `N` fields has `N` children. Each child is a column storing all the data of a single field packed column-wise, with an optional null mask. The parent column's type is -`STRUCT` and contains no data, its size represents the number of structs in the column, and its +`STRUCT` and contains no data, its size represents the number of struct rows in the column, and its null mask represents the validity of each struct element. With this representation, `child[0][10]` is row 10 of the first field of the struct, `child[1][42]`