-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API] Redesign towards pytorch-forecasting 2.0 #1736
Comments
Having reviewed multiple code bases - The model layer will mostly follow the data layer, given that My suggestions for high-level requirements on data loaders:
Observations of the current API:
The explicit specifications can be reconstructed from usage and docstrings, for convenience listed below:
|
question for @jdb78 - why did you design the Side note: one possible design is to have data loaders that are specific to neural networks, facing a more general API |
|
@geetu040, what I mean is the format of the So, for |
@fkiraly what are your reviews on model initialization in |
The idea is that basically all models can be represented as encoder/decoder. In some cases they are the same. |
Is that really true though for all models out there? And do we need this as See for instance Amazon Chronos: or Google TimesFM: What I think we need for 2.0 is an API design that can cover all torch-based forecasting models. |
I think there is no serious problem with that as ultimately it calls
|
Yup, the interface complication is the main concerning thing considering use-cases like those involving single model. But yes, the positive and negative sides need cost-benefit analysis to decide final. |
Some further thoughts about the design:
I would hence suggest, on the
|
I would like to the following topics to the discussion of redesigning the API:
Specific replies to thoughts from above.
Should the
Why should be the loader model specific. The loader's task is to provide an iterable by combining a Unfortunately, I haven't attended any of the planning sessions, thus the following question might be already answered. Why are we currently aiming for having one |
@benHeid, excellent points! Replies to your raised design angles:
Good idea to introduce a separation here. Question, why do you think we cannot modify the trainer in the current state of More generally, what are architectures you can think of that allow to treat all the abovementioned trainers under one interface?
I think this is more complex than needs be and needs a redesign, but it is not affecting the base API as such, as far as I can see - therefore I would leave it to later, after the base API rework. My approach would be to replace inheritance with tags and/or polymorphism in a lower number of base classes. The model specific ones perhaps need to stay, but the type specific ones could be merged. However, I think it makes sense only after we have reworked the base API, as it will imply how the base classes will look like.
Good question. The idiomatic architecture for As mentioned, the current architecture violates this idiom, since the |
Specific replies to thoughts from above.
I would disagree with the opinion that slicing should be part of the I interpret the architectural intention (which is not very clear in the If we accept this architectural intention as correct, this implies:
My reasoning is, that some but not all models require a loader that has encoder/decoder specifics. Examples are most current models, counterexamples are some models in DSIPTS and the above linked foundation models. If we accept that some models require encoder/decoder batches, while some others do not, and we think that this should be done in a
I think this is a corollary of how you envision the separation of
To clarify, that is not the aim in my design. In my design, there would probably be a small number of We should of course not need to define one The key problem is actually the state you are pointing out as problematic - currently, we have to define one
No problem, we will have a big one with new mentees, @agobbifbk plus team, in the new year (the FBK team will return next week from holiday). @thawn, you are very welcome to attend and participate in planning too. We will align on a date for this in the |
Technical it is possible. However, the user has to change the Trainer in that case. So I think introducing a Trainer with a major version release is more intuitive. And than the user does not have to change it.
I think all architectures can be trained with that trainer. However, there might be features that we would like to implement that are time series specific. E.g., Truncated back propagation through time, which was removed from lightning trainer: #1581 |
With regards to the discussion, where the slicing is located. I am still not convinced that a custom DataLoader should be a solution, and I still think that tasks like slicing are intended to be part of datasets and not of Dataloaders. The reasons why I think this are:
|
I do not completely grasp the context - could you perhaps add two code snippets, current and speculative, for modifying trainer?
Hm, I see how one could make an argument either way. Then, would the implied design be (a) raw time series dataset without slicing e.g., as a To avoid chaining, we can always have an all-in-one delegator. |
Currently, her are importing directly from
Yes |
This PR carries out a clean-up refactor of `TimeSeriesDataSet`. No changes are made to the logic. This is in preparation for major work items impacting the logic, e.g., removal of the `pandas` coupling (see #1685), or a 2.0 rework (see #1736). In general, a clean state would make these easier. Work carried out: * clear, and complete docstrings, in numpydoc format * separating logic, e.g., for parameter checking, data formatting, default handling * reducing cognitive complexity and max indentations, addressing "code smells" * linting
@fkiraly I am back from vacation. I hope you had a relaxing time during the Holidays. Before I get started with figuring out how I can help with the integration ofTime-Series-Library: Have you tried contacting the original developers of that library? I think this is important for two reasons:
|
I think somewhere I pinged them, but you are right - we should also let them know and ask very explicitly, with the current state of discussions. I see multiple options that make technical sense:
Both options will require agreeing on a common framework level API - I still think DSIPTS is closest to what we want, of course comments are much appreciated. FYI @wuhaixu2016, @ZDandsomSP, @GuoKaku, @DigitalLifeYZQiu, @Musongwhk (please feel free to add further pings if I forgot someone) |
Agree, at least credit needs to be given clearly to the authors of time-series-library! A complication is that there is much copy-pasting going on in the neural network area historically, some (but not all) code in time-series-library is even copied from elsewhere afaik. I think we should also come up with a way to fairly assign credit while backtracking the entire copy-merge-modify tree, although that might be a bit of work. A complication is that the people actively maintaining the code may be different from those who wrote parts of it (but no longer maintain it), and all deserve credit! sktime/sktime#6850 The first PR includes handling of |
Regarding the API of Time-Series-Libraryfrom my (limited) grasp of the code, this consists of the following:
My feeling is that adapting this to any more general API will require major work. Using TimesNet in my code required me to write my own dataset and dataloaders as well as model wrapper. Edit: this should not be taken as criticism of Time-Series-Library. I am very grateful to the authors for their code and their papers. Their work helped me a lot for my project. It just shows that the library was written for a specific purpose (benchmarking) and not as a general purpose API like what is planned here. |
A lot of models in DSIPTS are an adaptation of online repositories with the same logic of Time-Series-Library. What I've done so far is to align and standardize these API calls in the DSIPTS data preparation. I also found hard to understand the input parameters sometimes, this is why I leverage on Hydra. |
For discussion on Fri, here is a speculative design for a raw data container. |
outcome from prioritization meeting on Jan 15: data layer - dataset, dataloader 👍👍👍👍👍👍👍 💬 ✔️✔️✔️
model layer - base classes, configs, unified API 👍👍👍👍 ✔️
foundation models, model hubs 👍 👍 👍
documentation👍✔️
benchmarking 👍 👍💬
mlops and scaling (distributed, cluster etc)👍 👍
more learning tasks supported
|
I came up with this https://zarr.readthedocs.io/en/stable/. It is widely used when you have a large dataset that can not fit on RAM/VRAM. The idea is to create the zarr before creating the dataset. It generates chunks of data on a give dimension(s), in hour case temporal dimension, the datasets open the zarr and in the get_item function you retrieve the window you ask for.
PROs:
CONs:
|
+1 from me. We are also using zarr as file backend in our time series project. |
As default data format in our minimal dataset class, I realized that xarray may be a great choice: Xarray natively supports column names (i.e. metadata) as well as n-dimentional columns, which makes it an ideal replacement for both numpy and pandas. Furthermore, it has the advantage of dask support (which handles memory issues very well). By using Dask.array, xarray enables native chunking (ideal for non-overlapping time windows) as well as overlapping chunks (for sliding time windows). |
@agobbifbk, @thawn, I think there are multiple solutions that have a similar feature set - As long as we have consistent |
FYI, a simplified version of the previous design sketch here: #1757 - for discussion |
Desiderata from the DataSet component:
|
Completed a design proposal draft here: sktime/enhancement-proposals#39 Input appreciated! |
moved the various notes to the top. Are there any missing notes? |
Discussion thread for API re-design for
pytorch.forecasting
next 1.X and towards 2.0. Comments appreciated from everyone!Link to enhancemeng proposal: sktime/enhancement-proposals#39
Context and goals
High-level directions:
pytorch-forecasting 2.0
. We will need to homogenize interfaces, consolidate design ideas, and ensure downwards compatibility.thuml
project, also see [ENH] neural network libraries in thuml time-series-library sktime#7243.sktime
.High-level features for 2.0 with MoSCoW analysis:
sktime
and DSIPTS, but as closely to thepytorch
level as possible. The API need not cover forecasters in general, onlytorch
based forecasters.skbase
can be used to curate the forecasters as records, with tags, etcthuml
Meeting notes
Summary of discussion on Dec 20, 2024 and prior
FYI @agobbifbk, @thawn, @sktime/core-developers.
High-level directions:
pytorch-forecasting 2.0
. We will need to homogenize interfaces, consolidate design ideas, and ensure downwards compatibility.thuml
project, also see [ENH] neural network libraries in thuml time-series-library sktime#7243.sktime
.High-level features for 2.0 with MoSCoW analysis:
sktime
and DSIPTS, but as closely to thepytorch
level as possible. The API need not cover forecasters in general, onlytorch
based forecasters.skbase
can be used to curate the forecasters as records, with tags, etcthuml
Todos:
0. update documentation on dsipts to signpost the above. README etc.
Roadmap planning Jan 15, 2025
👋 Attendees
Prioritization
👍👍👍👍👍👍👍👍👍👍👍👍👍
✔️✔️✔️
💬 💬 💬
data layer - dataset, dataloader 👍👍👍👍👍👍👍 💬 ✔️✔️✔️
model layer - base classes, configs, unified API 👍👍👍👍 ✔️
foundation models, model hubs 👍 👍 👍
documentation👍✔️
benchmarking 👍 👍💬
mlops and scaling (distributed, cluster etc)👍 👍
more learning tasks supported
Tech meeting Jan 20, 2025
Attendees:
Agenda
__getitem__
output convention__init__
input convention(s)References
Umbrella issue design
#1736
Notes
number of classes, dataset, dataloader, "bottleneck" idea
AG: should be making it as modular as possible
TimeSeriesDataSet
that does everything)__init__
- more memory intensive, clear distinction between train and inference; naive implementation needs to load everything in memory__getitem__
time (dataset or dataloader) - feels this might be compute intensive, if we are recomputing and not caching etc__getitem__
should be as general as possibleFHN:
__getitem__
time, option 2.PB:
S:
T:
__getitem__
protocolFK:
idea of "bottleneck" or "least common denominator" did not come up, surprised (came up before)
think we need at least one class, likely a
DataSet
for "raw time series" (collection of, with all metadata)Benedikt (not here today) also suggested this idea, and that
DataSet
-s could depend on each othercurrent best guess for a structure:
DataSet
-s, these inherit from a common base and handle pandas as well as hard drive dataDataSet
-s, these could add re-sampling on top, normalization etcDataLoader
-s, these are specific to data sets and classes of neural networksalternative structure
DataSet
-s only have minimal representation of "time series"DataLoader
-s that adapt data sets to neural networksT: one of the "final layer" classes - or middle layer classes - could be adapter to 1.X API of pytorch-forecasting (current), ensuring downwards compatibility.
FK: big question for me is how many "layers" to have, e.g., two dataset layers and one data loader layer, or single dataset layer and one data loader layer (where data loaders do more).
T: had assumed we will use standard pytorch dataloader - if that is the case, we will need two datasets for downwards compatibility.
FHN: if we keep using vanilla torch dataloader, we need two data set layers
* is this a contradiciton to the dataset to be "minimal"?
* FK: thinks not a contradiction, since there are two layers of datasets
* lower layer is "minimal" as discussed
* 2nd layer is specialized and specific to neural network(s)
FK: feels there is convergence but with two open questions:
(
__getitem__
format to be handled in next agenda point)AG: there is one more complication - "stacked models", which are composites that use other models and their outputs to generate improved outputs
FK - we could have both options with a flag or two classes, this is really about internals of the class and does not impact
T: commenting about "stacked models"
strong opinions on using vanilla dataloader vs two dataset layers, vs custom dataloader and one dataset
__getitem__
output conventionFHN: unsure
T: "as simple as possible"
dict
and arrays (tensors etc) insideS: do we have a clear picture of what should be there?
T: would prefer pure tensors
Tech meeting Jan 24, 2025
Attendees:Notes
Recap
need to define dataset/dataloader layers
__init__
, and "output API",__getitem__
FK: suggest to focus on on output first
__getitem__
designs based on last timesAG design suggestion
Current DSIPTS str
FK comments:
this looks like the top layer. It is closer to the "raw" or "bottleneck" layer, but it already has the data resampled.
The "sample" index is the first index in the input to
__init__
.FK opinion: the resampling should be part of a pipeline to prepare a data loader.
So we have different artefacts
DataSet
. Obtained from raw data via resampling/normalization utilityDataLoader
using the output of__getitem__
observation:
pytorch-forecasting covers A-C in single DataSet
DSIPTS covers B-C in single DataSet, and A-B in utilities
FK: last time, agreed we should have two layers DataSet
but none of current solutions has the "bottleneck" layer
ptf should take DataSet instead of DataFrame
DSIPTS has A-B outside torch idiomatic structures
alternatively, we could have a custom class handle conversions up to the dataloader format, or the input required for the dataset closest to the model
FK design suggestion
The text was updated successfully, but these errors were encountered: