You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SchemaWriter interface gives functionality for adding row level data via the AddData method. This method accepts the row information in the form of map[string]interface{} which allows the caller to provide the name of the column as the key(string) and the value in the form that is approiate for the underlying data(i.e. string, int64, []byte, bool, etc). However, this comes with a performance impact in the form of heap allocations and increased garbage collector managed memory. This is due to the key of type interface{} resulting in the usage of pointers and escaping to the heap. To increase performance, can a new method be added to support providing row data in a way that can reduce the allocations escaping to the heap while still giving the caller the control to handle dynamic data, like what is happening in the CSV to Parquet tool?
Doing a quick scan of the code base and to my untrained eye, it looks like one way to achieve this may be to create a generic struct that can encapsulate the data and use that rather than a map.
// RowData represents a row of data in a CSV file, and can be provided to the `SchemaWriter`typeRowDatastruct{
Values []RowData
}
// RowData represents each field/column for a row in a CSV filetypeRowDatastruct{
DataNamestring/* Different data types. Only one of these should be populated at a time */StringValuestringIntValueintInt16Valueint16Int32Valueint32Int64Valueint64BoolValuebool
}
The SchemaWriter interface can be updated to accept these new types, For example,
// AddDataRow writes a row of data to the underlying writer using the specified data and metadatafuncAddDataRow(dataRowData) error
I am wondering if my assumptions regarding performance are correct, if there are any known work arounds other than adding new functionality, if this is something that is desired for this project, and what is the desired method to achieve the results.
The text was updated successfully, but these errors were encountered:
The SchemaWriter interface gives functionality for adding row level data via the AddData method. This method accepts the row information in the form of
map[string]interface{}
which allows the caller to provide the name of the column as the key(string
) and the value in the form that is approiate for the underlying data(i.e.string
,int64
,[]byte
,bool
, etc). However, this comes with a performance impact in the form of heap allocations and increased garbage collector managed memory. This is due to the key of typeinterface{}
resulting in the usage of pointers and escaping to the heap. To increase performance, can a new method be added to support providing row data in a way that can reduce the allocations escaping to the heap while still giving the caller the control to handle dynamic data, like what is happening in the CSV to Parquet tool?Doing a quick scan of the code base and to my untrained eye, it looks like one way to achieve this may be to create a generic struct that can encapsulate the data and use that rather than a
map
.The
SchemaWriter
interface can be updated to accept these new types, For example,I am wondering if my assumptions regarding performance are correct, if there are any known work arounds other than adding new functionality, if this is something that is desired for this project, and what is the desired method to achieve the results.
The text was updated successfully, but these errors were encountered: