Skip to content

Multidimensional Data

vincentherrmann edited this page Jun 20, 2016 · 14 revisions

The MultidimensionalData protocol provides powerful tools for handling multidimensional data of any kind. An entity following the MultidimensionalData protocol is essentially a multidimensional array. To prevent misunderstanding, we generally try to avoid the term "dimension" (except in the name of the package and protocol). What we can think of as one layer in a nested multidimensional array, we call "mode", the number of elements in this layer is called "modeSize".

All elements of the MultidimensionalData object are stored in a big flat values array. Analogously (or as an extension) to row-major matrices, like they are used in C, the first mode is the one highest up in the hierarchy. This means, consecutive elements in the first mode are spread out with equal distance over the whole values array, whereas consecutive elements in the last mode are consecutive elements in values.

multidimensional data unfolding

In a C-style nested array, the inner arrays may have various sizes in the same layer. This is not possible for this protocol. The modes can all have different sizes, but one specific mode must have the same size all the time, independent of the position in the data. We can think of a MultidimensionalData object as a "hypercuboid" that may not be ragged.

nested array vs multidimensional data

get, set, slice

Getting and setting single elements is done by using a subscript with multiple Integers defining the index of each mode as arguments. Therefore the number of Integers has to be the same as the number of modes. The Integers can be passed in either as a single array (data[[3, 4, 1]]) or as variadic arguments (data[3, 4, 1]), both ways are equivalent.

It is also possible to get and set a slice. The slice in itself is a multidimensional data object of the same type with either fewer modes, or smaller modes, or both. Arrays and ranges of Integers both can be used to define the indices in each mode contained in the slice. If a mode is completely retained in a slice, it can be indexed by the global variable all. If only one index of a mode is retained in the slice, this mode will vanish in the slice (it is not possible in this case to only write a single Integer, instead a range or array with count one should be used). Again, variadic arguments (data[2...5, [3], all]) or an argument array can be used. The array however must be preceded by the slice: keyword (data[slice: [2...5, [3], all]]). If there are less arguments than modes, the remaining last modes are retained completely.

1 mode slicing

2 mode slicing

This kind of syntax turns out to be a bit of a struggle with Swift generics. To allow heterogenous arrays as arguments for slicing, the DataSliceSubscript protocol is defined, and Array and Range are extended to follow this protocol. Under the hood, they have to be dynamically type checked. The slicing indices for a data object can be stored anyway as an DataSliceSubscript array.

perform and combine

There are two functions to help implementing operations on MultidimensionalData objects. They reduce boilerplate code and provide automatic parallel computation. In many cases it is straightforward to extend operations on vectors and matrices to operations on arbitrary tensors by using these functions in combination with a default implementation of the initial operation.

performForOuterModes

func performForOuterModes(outerModes: [Int], 
                    inout outputData: [Self], 
                           calculate: (currentIndex: [DataSliceSubscript], outerIndex: [DataSliceSubscript], sourceData: Self) -> [Self], 
                         writeOutput: (currentIndex: [DataSliceSubscript], outerIndex: [DataSliceSubscript], inputData: [Self], inout outputData: [Self]) -> ())

We often want to operate on a subset of modes of a single data object. These modes will either be contracted and vanish, or be extracted into a new object, or in any other way treated differently than the other modes. We could call these the "inner modes". The outer modes, which we give the performForOuterModes function as the first argument, are all modes in the data object, that are not inner modes. performForOuterModes recursively constructs nested for-loops, and for every index combination in the outer modes the calculate and writeOutput closures will be called.

Before we call performForOuterModes, the output data objects (or probably only one object in many cases) have to be created (e.g. with zeros()) and then given in an array as the outputData argument to the function.

The calculate closure will be called asynchronously by the performForOuterModes function and provided with three arguments:

  • currentIndex: The indices in self that are currently affected. If we slice self with the currentIndex, we will get a data object with only the inner modes, but every time calculate gets called it will contain different values.
  • outerIndex: The current index in only the outer modes. Although it is an DataSliceSubscript array, it defines the position of exactly one element of the outer modes.
  • sourceData: Simply self, has to passed in as an argument because of the parallel computation.

One or several results have to be returned as an array. These results will be written synchronously to the outputData object by the writeOutput closure (technically this could also be done asynchronously, but then the dispatch engine gets confused). Here we have four arguments:

  • currentIndex and outerIndex: Same as for calculate.
  • inputData: Exactly the array of data objects returned by calculate. This has to be written to the outputData.
  • inout outputData: Reference to the array of data objects given to the performForOuterModes function.

One of the simplest examples of performForOuterModes is the sum( overModes: ) function for tensors. Here the modes that will be summed are the inner modes and will be contracted, i.e. they vanish. The outputData tensor will have only the remaining, the outer modes. In calculate, the values from currentIndex slice are summed and a single scalar tensor (wrapped in an array) is returned. In writeOutput, this scalar is then written to the outerIndex in the outputData.

Slightly more complex is the normalize( overModes: ) function for tensors. We can define an arbitrary set of modes that will be normalized. Along these modes, the normalized tensor will have mean zero and a standard deviation of one. All modes that will not be normalized are the outer modes. This function has three outputData objects: the normalized tensor, the mean tensor, and the deviation tensor. The normalized tensor will obviously have the same modes as the tensor that gets normalized, only with scaled and shifted elements. Mean and deviation tensor will only have the outer modes, containing mean and deviation of the corresponding slice of normalized inner modes. In the calculate closure, the values of the currentIndex slice get normalized by a standard array normalization function. The normalized values get returned as a tensor with all inner modes, mean and deviation as scalar tensors. In writeOutput, the normalized slice is written to the currentIndex of the normalized tensor, mean and deviation are written to the outerIndex of the mean or deviation tensor respectively.

combine

func combine<T: MultidimensionalData>(a: T, forOuterModes outerModesA: [Int], 
                                 with b: T, forOuterModes outerModesB: [Int], 
                       inout outputData: [T],
                              calculate: (indexA: [DataSliceSubscript], indexB: [DataSliceSubscript], outerIndex: [DataSliceSubscript], sourceA: T, sourceB: T) -> [T],
                            writeOutput: (indexA: [DataSliceSubscript], indexB: [DataSliceSubscript], outerIndex: [DataSliceSubscript], inputData: [T], inout outputData: [T]) -> ())

The combine function is used for operations that combine in some way two multidimensional data objects of the same type (but not necessarily with the same modes) and return one or several of these objects. It works very similarly to performForOuterModes, only with two input objects (a and b). The constellation of inner modes depends on the calculation that will be performed, as we will see in the examples. The outer modes of the objects are independent. We can think of the outer modes from a and the outer modes from b as getting concatenated (a before b). For every index combination of this outer-modes-compound, calculate and writeOutput are called by the combine function.

The calculate closure now has five arguments:

  • indexA and indexB: The currently affected inner modes of a and b. If the outerIndex cycles through the outer modes of a, indexB does not change, and vice versa.
  • outerIndex: The current index in the combined outer modes.
  • sourceA and sourceB: Again simply a and b, for parallel computational reasons.

Again, the resulting data objects are returned. The workings of writeOutput should be clear.

Two different kinds of tensor multiplications will serve as examples for the application of the combine function. First we look at multiplyElementwise. Here the inner modes simply get multiplied element wise, hence they must be equivalent, i.e. same number of modes, same order and same mode sizes. In the calculate step, a conventional function for element wise vector multiplication is used to multiply the value arrays of the a and b slices. The returned tensor consists of the inner modes and has the product array as values. This tensor will be written to the outerIndex slice of the outputData.

The multiply function for tensors is a lot more sophisticated, the use of the combine function however is quite similar. In this kind of tensor multiplication, some modes that exist in both tensors may be summed over. To achieve this, we want to make as much use as possible of accelerated matrix multiplication algorithms. For this reason, the modes of both tensors are reordered if necessary, and the combine function is applied. Note, that the inner modes in this case are not the summation modes. Instead, they are the modes of which the slices, or rather the values arrays of the slices, can be used as matrices in a matrix multiplication. This mean, that the indexA slice, the indexB slice and the returned slice might all be different. The exact properties of these slices have been calculated before. The outer modes are those modes that cannot be integrated in a matrix multiplication.