You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current arithmetic and computation API of the DataFrame is inconsistent and quite slow in scenarios where columns of different types are involved as each column casting to different type requires coping the entire column data.
Moreover some of the methods of computation API may produce inaccurate results (due to overflow exception).
Motivation of this change includes
reusing generic operators from future TensorPrimitives API for fast computation over DataFrame columns of different types (and possibly even chaining such operations to avoid intensive memory usage for storing intermediate calculation results)
accurate calculation of aggregation functions
increased performance and using SIMD commands for the significant part of DataFrame operations
reduce DataFrame code duplication
Details of current implementation limitations
Inconsistency of the API is related to types of data returned by arithmetic operations over PrimitiveColumn<T> instances and their concrete aliases (like DoubleDataFrame or Int32DataFrameColumn)
Inaccurate results of aggregated functions. Currently aggregated functions are calculated using the same type as input column, for example for byte column it’s byte type.
And
to use the common numeric type for the arithmetic output (the smallest numeric type which can accommodate any value of any input). If any input is a floating point type the common numeric type is the widest floating point type among the inputs. Otherwise the common numeric type is integral and is signed if any input is signed.
to return the widest possible type for accumulating aggregation functions (like sum, product and etc). It’s double for floating point types (double, float, Half), long for signed integers (long, int, short, sbyte) and ulong for unsigned integers (ulong, uint, ushort, byte) and input type for other aggregated functions (like min, max, first and etc).
Implementation can reuse the low level parts of the future generic TensorPrimitives API, like IUnaryOperator<T>, IBinaryOperator<T> and their concrete implementations.
Operators allow to invoke highly efficient vectorized calculations over arguments of the same type. New implementation should extend this functionality by providing efficient vectorized convertors and conversion rules for cases when arguments have different data type.
Example
I created an POC of possible implementation.
Apache Arrow doesn’t have a .Net implementation of Compute Functions, so I took at as an exercise for the POC. DataFrame follows the same paradigm, so all the ideas can be applied to it as well.
I extend several operators from TensorPrimitive API to have generic version (this is done to illustrate how future Generic Tensor API can be look like and be used). I also created several OperatorExecutors and Functions classes, that provides rules for argument type conversions for several cases (binary arithmetic calculations and two types of aggregating functions). I also created a list of wideners and convertes that leverage SIMD vectorized calculation as an example
The text was updated successfully, but these errors were encountered:
This is great. Could the DataFrame also support streaming operations (similar to how Apache Arrow Acero is architected?). I've recently been looking to implement support for Acero in the C# Arrow client: apache/arrow#37544
@davesearle DataFrame was designed to handle situations where all the data is in memory, so streaming was not the primary goal. However DataFrame allows convertion to a collection of Arrow RecordBatches (for most of the cases without memory copy) and that's why potentionsly can be used as an input for Acero record_batch_source.
Background and motivation
Current arithmetic and computation API of the DataFrame is inconsistent and quite slow in scenarios where columns of different types are involved as each column casting to different type requires coping the entire column data.
Moreover some of the methods of computation API may produce inaccurate results (due to overflow exception).
Motivation of this change includes
Details of current implementation limitations
PrimitiveColumn<T>
instances and their concrete aliases (likeDoubleDataFrame
orInt32DataFrameColumn
)For example
results in sum column containing int values.
The same code, but referring columns using parent type
results in sum column containing double values.
And
results in sum equals to 1 and mean to 0.5, which is obviously incorrect.
Proposal
Proposal is:
to use the common numeric type for the arithmetic output (the smallest numeric type which can accommodate any value of any input). If any input is a floating point type the common numeric type is the widest floating point type among the inputs. Otherwise the common numeric type is integral and is signed if any input is signed.
to return the widest possible type for accumulating aggregation functions (like sum, product and etc). It’s double for floating point types (double, float, Half), long for signed integers (long, int, short, sbyte) and ulong for unsigned integers (ulong, uint, ushort, byte) and input type for other aggregated functions (like min, max, first and etc).
This approach is used in Apache Arrow Compute functions
Implementation
Implementation can reuse the low level parts of the future generic TensorPrimitives API, like
IUnaryOperator<T>
,IBinaryOperator<T>
and their concrete implementations.Operators allow to invoke highly efficient vectorized calculations over arguments of the same type. New implementation should extend this functionality by providing efficient vectorized convertors and conversion rules for cases when arguments have different data type.
Example
I created an POC of possible implementation.
Apache Arrow doesn’t have a .Net implementation of Compute Functions, so I took at as an exercise for the POC. DataFrame follows the same paradigm, so all the ideas can be applied to it as well.
I extend several operators from TensorPrimitive API to have generic version (this is done to illustrate how future Generic Tensor API can be look like and be used). I also created several OperatorExecutors and Functions classes, that provides rules for argument type conversions for several cases (binary arithmetic calculations and two types of aggregating functions). I also created a list of wideners and convertes that leverage SIMD vectorized calculation as an example
The text was updated successfully, but these errors were encountered: