Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] Allow adding existing arrays into structs #35

Open
sfc-gh-pfus opened this issue Apr 18, 2024 · 15 comments
Open

[Go] Allow adding existing arrays into structs #35

sfc-gh-pfus opened this issue Apr 18, 2024 · 15 comments
Labels
Type: enhancement New feature or request

Comments

@sfc-gh-pfus
Copy link

sfc-gh-pfus commented Apr 18, 2024

Describe the enhancement requested

Hi! We have a use case, in which we have existing Arrow arrays and we want to compose them to struct. Unfortunately, currently StructBuilder does not support it. So currently we have to copy all values from the specific Arrow array to the similar array in struct one by one. It would be great if we can just reuse existing array without manual copying.

What we do now?
Something like this:

for j := 0; j < int(numRows); j++ {
  sb.Append(true)
  for i := 0; i < structCol.NumField(); i++ {
  newInternalCol := internalCols[i]
  if newInternalCol.DataType().ID() == arrow.STRING {
    sb.FieldBuilder(i).(*array.StringBuilder).Append(newInternalCol.(*array.String).Value(j))
  } else if newInternalCol.DataType().ID() == arrow.INT8 {
    sb.FieldBuilder(i).(*array.Int8Builder).Append(newInternalCol.(*array.Int8).Value(j))

What we want:

sb.SetColumn(i, newInternalCol[i])

The main problem here is not this few lines of code that we need to write, but manual memory copying. Is it doable?

Component(s)

Go

@sfc-gh-pfus sfc-gh-pfus added the Type: enhancement New feature or request label Apr 18, 2024
@zeroshade
Copy link
Member

Could you just use the existing interface https://pkg.go.dev/github.com/apache/arrow/go/[email protected]/arrow/array#NewStructArray to create your struct array?

newStruct, err := array.NewStructArray(internalCols, []string{"a", "b", "c", "d"....})

@sfc-gh-pfus
Copy link
Author

Actually no, because what you suggest works on standard Go arrays. I have Arrow array (for instance *array.String).

@zeroshade
Copy link
Member

the function I'm referring to takes a slice of arrow.Array interfaces. You would be able to do something like:

var cols []arrow.Array

sb := array.NewStringBuilder(memory.DefaultAllocator)
defer sb.Release()

sb.AppendValues([]string{"a", "b", "c"}, nil)
cols = append(cols, sb.NewStringArray())
sb.AppendValues([]string{"d", "e", "f"}, nil)
cols = append(cols, sb.NewStringArray())

newStruct, err := array.NewStructArray(cols, []string{"col1", "col2"})
defer newStruct.Release()

and newStruct would be a Struct array with two string columns. If this doesn't work, could you provide a more complete code example of your scenario?

@sfc-gh-pfus
Copy link
Author

@zeroshade Sorry, I misunderstood your example! This is great. Are there similar functions for creating lists and maps? You probably know the topic in snowflake driver :) We have use case like this:

  1. Snowflake returns data in arrow arrays (strings, ints etc). We have arrow batches API which returns this data to client.
  2. In many cases we return these arrays as-is, but sometimes we have to make some conversions, for instance pruning decimal128 to int64 or convert two-fielded struct representing timestamp into arrow Timestamp.
  3. Now, we introduce structured objects, lists and maps. So instead of array of "primitives" we have arrow arrays of complex structures, like structs (various types), lists of specific types or maps. But still we need to apply the same conversion schema (like pruning decimal to int64) for each "internal" field.

API you presented is great for structs. Is there anything similar for lists and maps? So for each field in struct:

newArr := convertFunction(origArr)

And I can apply it to all arrays in struct and build a struct from it using array.NewStructArray. Is there something like this so I can use it in:

newListArr := convertFunction(listArr.ListValues())
newKeyArr := convertFunction(mapArr.Keys())
newValueArr := convertFunction(mapArr.Items())

And build new list/map?

@zeroshade
Copy link
Member

There isn't currently a fully convenient one like we have for structs currently, but you can easily do so by using the array.Data object directly:

newValues := convertFunction(listArr.ListValues())
defer newValues.Release()

newListData := array.NewData(listArr.DataType(), listArr.Len(), listArr.Data().Buffers(), []arrow.ArrayData{newValues.Data()}, listArr.NullN(), /* offset=*/0)
defer newListData.Release()

newListArr := array.NewListData(newListData)

Similarly you can construct map arrays, which are physically laid out as equivalent to a list<element: struct<key: K, value: V>>

That said, I think it would be a reasonable ask to add new convenience functions for this to make it easier for consumers to construct new list arrays and map arrays without having to manipulate the Data object themselves directly.

@sfc-gh-pfus
Copy link
Author

sfc-gh-pfus commented Apr 19, 2024

@zeroshade it works! But why is this so complicated, I would have never figured it out myself :(

I only had to make tiny change - while creating newListData we should use arrow.ListOf(newValues.DataType()) instead of listArr.DataType().

Let me check if I understand how to do it for maps.

@sfc-gh-pfus
Copy link
Author

That worked:

                 structArr, err := array.NewStructArray([]arrow.Array{keyCol, valueCol}, []string{"k", "v"})
		if err != nil {
			return nil, err
		}
		defer structArr.Release()
		newData := array.NewData(arrow.MapOf(keyCol.DataType(), valueCol.DataType()), mapCol.Len(), mapCol.Data().Buffers(), []arrow.ArrayData{structArr.Data()}, mapCol.NullN(), 0)
		defer newData.Release()
		return array.NewMapData(newData), nil

Thanks a lot! It would be nice to have some helper functions though :)

@zeroshade
Copy link
Member

Because of the need for the offsets buffer and so on, it's tough to think of what the interface for such helper functions for Lists and Maps would be. What do you think would make sense for helper constructors for those?

@kou kou changed the title Allow adding existing arrays into structs [Go] Allow adding existing arrays into structs Apr 23, 2024
@candiduslynx
Copy link
Contributor

newListData := array.NewData(listArr.DataType(), listArr.Len(), listArr.Data().Buffers(), []arrow.ArrayData{newValues.Data()}, listArr.NullN(), /* offset=*/0)
defer newListData.Release()

newListArr := array.NewListData(newListData)

@zeroshade I think there's one vital thing to actually address here: offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).
I actually implemented the bulk of the conversion in our filetypes package:
https://github.com/cloudquery/filetypes/blob/7fe7b10f029d7a853bf118eed6cf156eff375807/parquet/read.go#L77-L102

@zeroshade
Copy link
Member

@candiduslynx I'm very confused by that file, it looks like you're using a builder to make a deep copy of the arrays which doesn't make any sense to me.

You have this function:

func slice(r arrow.Record) []arrow.Record {
	res := make([]arrow.Record, r.NumRows())
	for i := int64(0); i < r.NumRows(); i++ {
		res[i] = r.NewSlice(i, i+1)
	}
	return res
}

If I'm reading this right, you're slicing the record int a slice of records of exactly 1 row each? Why?

But I'm more confused by this reverseTransformArray function. reverseTransformRecord appears to loop through the columns and call reverseTransformArray with each column's type and the column itself. Which means that reverseTransformArray is receiving a datatype which is always identical to the datatype of the Array being passed in. This means that something like reverseTransformFromString is always called with a string array. So you're creating a builder and building up a new array which is identical to the one that was passed in rather than just calling .Retain() on the array and returning the same array that was passed in or using the .Copy method on the ArrayData object and then using MakeFromData. If you really wanted to make a deep copy (but why??), you could do it in a generic way by explicitly copying the buffers regardless of the type to create new ArrayData objects.

offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).

What do you mean "special"? The offset handling for struct arrays should work precisely the same as any other type. Can you elaborate on what the issue there is?

@candiduslynx
Copy link
Contributor

If I'm reading this right, you're slicing the record int a slice of records of exactly 1 row each? Why?

Tests requirement, it'll be removed soon (we need to sort rows in tests, so we slice to single-row records).

But I'm more confused by this reverseTransformArray function. reverseTransformRecord appears to loop through the columns and call reverseTransformArray with each column's type and the column itself.

The schema passed to the reverseTransformRecord function doesn't necessarily match the schema in the record read.
We have to convert some columns to other types to be better represented in parquet (maybe we should revisit this, actually). We have similar code in our duckdb plugin and we use the parquet formatting to put the data into the tables. Unfortunately, DuckDB doesn't support all of the types 1-to-1, so we are converting some values. That also means that to reconstruct the record read for tests we need to perform the reverse transformation.

I'll revisit the code in filetypes package as there seems to be some discrepancy, but overall it is what it is.

offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).

What do you mean "special"? The offset handling for struct arrays should work precisely the same as any other type. Can you elaborate on what the issue there is?

cloudquery/filetypes#279
I noticed that when working with sliced struct arrays the arr.Data().Offset() would return unusable info. That's because the underlying arrays are sliced, too, so constructing the struct array this way fails (you have to use 0).

@zeroshade
Copy link
Member

Tests requirement, it'll be removed soon (we need to sort rows in tests, so we slice to single-row records).

I've been wanting to add sort functions to the compute package (which already has Take & Filter).... if you want to try your hand at contributing sort functions, I'd be happy to review them and it would solve your issue of making single-row slices :)

I noticed that when working with sliced struct arrays the arr.Data().Offset() would return unusable info. That's because the underlying arrays are sliced, too, so constructing the struct array this way fails (you have to use 0).

I'd be curious what a minimal reproducer of this might look like. In theory for a sliced array, only the top-level needs the offset. So if your child arrays are already sliced, then it makes sense you don't need to have an offset in the top struct as offsets are only related to the Array they are attached to.

So i can think of three possible situations here:

  • There's a bug in the slicing logic that propagated an offset to children during slicing when it shouldn't have
  • There's a bug in the accessing logic (IsValid/IsNull/Value/etc.) which didn't properly handle the offset somewhere
  • There's a mistake somewhere in the code when constructing the arrays regarding handling what should or shouldn't be sliced and where

@candiduslynx
Copy link
Contributor

  • There's a mistake somewhere in the code when constructing the arrays regarding handling what should or shouldn't be sliced and where
func (a *Struct) setData(data *Data) {
	a.array.setData(data)
	a.fields = make([]arrow.Array, len(data.childData))
	for i, child := range data.childData {
		if data.offset != 0 || child.Len() != data.length {
			sub := NewSliceData(child, int64(data.offset), int64(data.offset+data.length))
			a.fields[i] = MakeFromData(sub)
			sub.Release()
		} else {
			a.fields[i] = MakeFromData(child)
		}
	}
}

I think is the culprit

@zeroshade
Copy link
Member

I'll take a look later this week and see what i can figure out

@assignUser assignUser transferred this issue from apache/arrow Aug 30, 2024
@candiduslynx
Copy link
Contributor

cc: @yevgenypats @erezrokah (seems like the repo moved)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants