Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSeries object for time-based objects #247

Open
wants to merge 62 commits into
base: main
Choose a base branch
from

Conversation

lewardo
Copy link
Member

@lewardo lewardo commented Aug 22, 2023

@weefuzzy For many potential new objects implementing a DataSeries object for datasets encompassing time would be incredibly useful.

In terms of implementation I had a few ideas but do not know which would be better with the rest of the codebase

  1. A client wrapper around a DataSetClient, with each point being N * frameLen long and dealing with the views and accessing in the wrapper/client
  2. A client based on a DataSet of rank 2, and the time being captured in the second point dimension
  3. A whole new DataSeries algorithm that is similar to the DataSet but separates the subtleties out in a new alg for cleanliness

The issue with all of them I'm seeing is the memory allocation of the RT updating points, I wanted to confirm the way that FluidTensors grow and shrink is memory-efficient, so that updating the length of a point in the middle of a dataset won't have to shift the rest of it down in memory.
Currently, you can update a point but not add to it so pushing a time frame onto a point would involve copying that point in its entirety (not a FluidTensorView) and concatenating that externally then replacing that point in the DataSet, unless FluidTensor were to be given an equivalent of push_back that would allow in-place expansion of central elements

For interface @tremblap enjoyed the idea of keeping a similar one to DataSet, namely that addPoint be the message to push another frame from a buffer to that Id in the dataset, and additional messages could be implemented to load a whole series from a buffer (the issue here would be that the DataSeries time dimension would not be the buffer time dimension, time in the buffer would hav eot be captured from channel 1 to T), but all of these ideas are still up for discussion.

This object is the gateway into being able to implement the more useful algs like DTW and various flavours of RNN

Apologies for the barrage of UI and implementation questions, I am aware you haven't much time but any guidance however brief would be appreciated, I know not the style of implementation that would be best or the nuances of my implementation that may or may not lead to terrible memory performance.

@lewardo lewardo changed the title Data series object for time-based Data series object for time-based objects Aug 22, 2023
@lewardo lewardo changed the title Data series object for time-based objects DataSeries object for time-based objects Aug 22, 2023
@weefuzzy
Copy link
Member

Thanks @lewardo

Yes, we need something like this container for time-series algorithms, but it's a can of worms. I'd started to think about it for my Echo State Network work (a kind of RNN), but ran out of time.

So, aspects of the can-of-worms-ness:

– The basic interface I had in mind was that each point would indeed be m frames of n-dim vectors, with the imposed constraint that within a DataSeries (or whatever) n is invariant across the whole container.
– That's immediately a pain in the arse on PD, which doesn't have multichannel arrays, and therefore no convenient mechanism for adding points (though the existing hackyness with clone could presumably be made to work)
– Meanwhile, it wasn't at all clear that the underlying data structure of DataSet would do for this. DataSet is neither especially cache-friendly, nor at all thread-safe, so I wasn't wild about making any more code depend on it.
– Updates into the middle of FluidTensor aren't especially efficient, because the underlying container is a std::vector. There are other types of semi-contiguous container that try and strike different balances between insertion cost and cache friendliness, but not yet in the standard lib. Moreover, we have to deal with a map structure as well, and the looming problems of thread safety.

FWIW, DataSet should never be mutated in the audio thread, until we have more sophisticated stuff in place both for concurrency and allocation.

@lewardo
Copy link
Member Author

lewardo commented Aug 22, 2023

as a first concept to allow parallel development of time-based algorithms, do you think a std::vector of DataSets would be a wise hacky temporary solution? It would circumvent the middle-access issue as each Dataset (representing a single time slice/set of frames) isn't ordered and as long as I deal with having the same id for each slice it could feasibly be more efficient than other options; it does however introduce the overhead of many identical ID sets etc., so I could implement a lightweight version of a DataSet that is one of many in the DataSeries object, each time slice being one of those, unordered with the id-index mappings to allow middle insertion to be more efficient ?

@weefuzzy
Copy link
Member

I think an adapted container that used a std::vector<std::vector>> internally, and a single map would cause you less pain. Longer-term, I don't think middle insertion is so common that its efficiency should necessarily be of prime importance, compared to thinking about the cache friendliness of typical access patterns.

@lewardo
Copy link
Member Author

lewardo commented Aug 23, 2023

in terms of integrating that into the FluidTensor ecosystem, do you think the best route would be to create a raw std::vector of FluidTensors, or a FluidTensor of rank one higher, which would bring back the 'middle insertion' issues when inserting a new frame to the end of a central element (iirc under the hood it's all one contiguous std::vector)

@weefuzzy
Copy link
Member

Hmm. Probably a std::vector<FluidTensor<T,2>> makes life easiest – then indexing map just needs to have an integral type for the index of a given ID.

@lewardo
Copy link
Member Author

lewardo commented Aug 23, 2023

Also from what I understand, in terms of having a single ID-index map if a user were to add multiple frames to one point before then adding to another point later it would introduce the issue that T=n for point X will not be at the same index for every X.
e.g. addPoint id-1 inBuf; addPoint id-2 inBuf; addPoint id-2 inBuf would as far as I can see lead to a data structure like

T=0 {id-1}{id-2}
T=1 {id-2}

so the std::unordered_map would have to map to index 1 for T=0 and index 0 for T=1?
Or should pushing a point result in the same index in each vector of frames, but then the order in which the user pushes the points will lead to potential massive overhead with blank vector indicies...

@lewardo
Copy link
Member Author

lewardo commented Aug 23, 2023

or perhaps I'm thinking of it the wrong way around, would time be down the std::vector along the FluidTensor?

@weefuzzy
Copy link
Member

weefuzzy commented Aug 23, 2023

Ah, not quite how I was imagining the representation. In my mind, each ID maps to a time sequence, e.g:

ID index content
point1 0 T0..Tx
point2 1 T0..Ty
point3 2 T0..Tz

So, for an RNN of any variety, each training point is itself a time sequence (and remember, that for training an RNN, we'll want to shuffle the order of the sequences without breaking their internal order because the component frames aren't independent)

@weefuzzy
Copy link
Member

Oh, our replies crossed

@lewardo
Copy link
Member Author

lewardo commented Aug 24, 2023

After a brief meeting with @tremblap for example patches for use cases we concluded on an interface for reading and writing time series of data from a DataSeries object, which are now functional
the getseries setseries updateseries and removeseries all accept an id and (with the exception of the latter) a buffer to read from/write to.
The buffer will have as many channels as the extent of each datum/frame and as many frames as there are for that id in the DataSeries
We are aware it is divergent from the interface for getting/setting single frames, where the buffer time dimension is the frame extent, as opposed to getting/setting series where the time is time and the channels are the extent, so any ideas for UI refactoring for consistency would be welcome, although in the many example patches we went through in out meeting it seemed intuitive enough.

@lewardo
Copy link
Member Author

lewardo commented Aug 24, 2023

There are various slicing operations that could be useful, that would use DataSet objects, for example the getDataSet message now gets a time slice of the dataSeries and returns it as a dataSet, ignoring any IDs that don't have that time frame.
Other similar messages could include getting a dataSet view into a dataSeries time slice so that writing to it can change the other object from another object? of course this would be as well as the usual {replace,remove,add,get}DataSet messages to dataSeries
In essence it would be much like frame-based operations, series-based operations, but be a different slice of the underlying tensor.

@tremblap
Copy link
Member

Hey @lewardo do you want assert report (on print function) here or elsewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants