Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle large datasets efficienlty #582

Open
dalonsoa opened this issue Oct 9, 2024 · 6 comments
Open

Handle large datasets efficienlty #582

dalonsoa opened this issue Oct 9, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@dalonsoa
Copy link
Collaborator

dalonsoa commented Oct 9, 2024

  • Some models are going to require data at much higher temporal resolution than the wider model update tick. An example here is sub-daily or daily inputs to the Abiotic model.
  • The input data files for this use case can be very large – not something we really want to ingest into the Data object at model startup and try and store in RAM.
  • So, where do we store this kind of data, and is there a way to lazily load the data as required. This might be something that dask is well-suited to as this handles lazy loading of chunked data.
@dalonsoa dalonsoa added the enhancement New feature or request label Oct 9, 2024
@dalonsoa
Copy link
Collaborator Author

dalonsoa commented Oct 9, 2024

@vgro , we will need an example simulation with, at least, one BIG file and some indication to where it is used, so we can explore how to best handle that memory wise.

@alexdewar alexdewar self-assigned this Oct 9, 2024
@alexdewar
Copy link
Collaborator

@vgro Do you happen to have a big file like this lying around? No pressure -- I've got lots to be getting on elsewhere -- but I won't be able to start on this until there's some data for me to work with, so if you do have a chance to look at it over the next few weeks, that'd be great.

@davidorme davidorme added this to the Core structures milestone Jan 14, 2025
@vgro
Copy link
Collaborator

vgro commented Jan 14, 2025

@alexdewar I'm terribly sorry that I haven't replied to this, I never received an email about the issue and we haven't checked the issues systematically in a while. I have a few urgent tasks this week, I'll try and get something for you by the end of next week or so

@alexdewar
Copy link
Collaborator

Nw @vgro. If it had been really urgent I'd have sent an email... Whenever you can send it through is fine.

@vgro
Copy link
Collaborator

vgro commented Jan 16, 2025

Nw @vgro. If it had been really urgent I'd have sent an email... Whenever you can send it through is fine.

If you want to run a simulation, you would probably need all the input data to have the same dimensions? For example you would need the climate data to have the same spatial extent and time steps? Or would it be enough to provide one variable, say precipitation?

@alexdewar
Copy link
Collaborator

Ideally it would have the same dimensions. I haven't looked into it enough to know exactly what I'd need, but big files with somewhat realistic input data should do the trick. Don't spend too long on this -- feel free to just send it through once you've got something and I can let you know if I need anything different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants