Add Performant Low-level Operation For Adding Row #34

AnthonyMBonafide · 2021-06-28T19:52:06Z

The SchemaWriter interface gives functionality for adding row level data via the AddData method. This method accepts the row information in the form of map[string]interface{} which allows the caller to provide the name of the column as the key(string) and the value in the form that is approiate for the underlying data(i.e. string, int64, []byte, bool, etc). However, this comes with a performance impact in the form of heap allocations and increased garbage collector managed memory. This is due to the key of type interface{} resulting in the usage of pointers and escaping to the heap. To increase performance, can a new method be added to support providing row data in a way that can reduce the allocations escaping to the heap while still giving the caller the control to handle dynamic data, like what is happening in the CSV to Parquet tool?

Doing a quick scan of the code base and to my untrained eye, it looks like one way to achieve this may be to create a generic struct that can encapsulate the data and use that rather than a map.

// RowData represents a row of data in a CSV file, and can be provided to the `SchemaWriter`
type RowData struct{
	Values []RowData
}

// RowData represents each field/column for a row in a CSV file
type RowData struct{
	DataName string

	/*
		Different data types.
		Only one of these should be populated at a time
	*/
	StringValue string
	IntValue int
	Int16Value int16
	Int32Value int32
	Int64Value int64
	BoolValue bool
}

The SchemaWriter interface can be updated to accept these new types, For example,

// AddDataRow writes a row of data to the underlying writer using the specified data and metadata
func AddDataRow(data RowData) error

I am wondering if my assumptions regarding performance are correct, if there are any known work arounds other than adding new functionality, if this is something that is desired for this project, and what is the desired method to achieve the results.

The text was updated successfully, but these errors were encountered:

panamafrancis · 2022-02-02T09:41:07Z

We won't consider union types, however with the advent of Go Generics we will take a look at this topic again when we have the chance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Performant Low-level Operation For Adding Row #34

Add Performant Low-level Operation For Adding Row #34

AnthonyMBonafide commented Jun 28, 2021

panamafrancis commented Feb 2, 2022

Add Performant Low-level Operation For Adding Row #34

Add Performant Low-level Operation For Adding Row #34

Comments

AnthonyMBonafide commented Jun 28, 2021

panamafrancis commented Feb 2, 2022