Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate storage use #6614

Closed
normanrz opened this issue Nov 7, 2022 · 4 comments · Fixed by #6685
Closed

Calculate storage use #6614

normanrz opened this issue Nov 7, 2022 · 4 comments · Fixed by #6685

Comments

@normanrz
Copy link
Member

normanrz commented Nov 7, 2022

Detailed Description

In order to implement storage quotas, we need to capture the storage use of datasets on-disk. I would suggest to calculate that on import and perhaps as a regular cron job. I think it would be useful to store the storage use on a mag-level.
The aggregated storage use per organization needs to be exposed via an API so that the frontend can display it in the organization page and use it to enforce upload blocks.
Remote datasets and symlinked layers should not be counted towards the storage quota.

@fm3
Copy link
Member

fm3 commented Nov 14, 2022

Do you think it is fair to rely on du being present on the host system for this feature? There do not seem to be reliable pure-java APIs for this.

@normanrz
Copy link
Member Author

I guess du would be ok, because we typically deploy in a Docker container.
However, I think that a async/parallelized file walk in java might be even faster.

@fm3
Copy link
Member

fm3 commented Nov 21, 2022

A few more questions have come up

  • You mention storing the storage use on a per-mag basis. What do we want to do with this fine-grained information? For organization storage use it seems that a by-dataset (or even aggregated by datastore) would be sufficient? Certainly easier to manage
  • Do we want to count the contents of the .uploading, .forConversion, .converting (i.e. worker working dir) and .trash directories?

@normanrz
Copy link
Member Author

  • You mention storing the storage use on a per-mag basis. What do we want to do with this fine-grained information? For organization storage use it seems that a by-dataset (or even aggregated by datastore) would be sufficient? Certainly easier to manage

I don't see why it is harder to manage on a per-mag basis. Aggregating to by-dataset or by-datastore is just a simple SQL query.

Do we want to count the contents of the .uploading, .forConversion, .converting (i.e. worker working dir) and .trash directories?

Temp data should not count against the storage quota and does not need to be counted.

@fm3 fm3 mentioned this issue Dec 7, 2022
4 tasks
@fm3 fm3 self-assigned this Dec 9, 2022
@fm3 fm3 closed this as completed in #6685 Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants