Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out optimal upload/download strategy #1

Open
vsoch opened this issue Apr 2, 2020 · 4 comments
Open

Figure out optimal upload/download strategy #1

vsoch opened this issue Apr 2, 2020 · 4 comments

Comments

@vsoch
Copy link
Owner

vsoch commented Apr 2, 2020

  • look into how standard clients do it, or perhaps in the spec.
@ad-m
Copy link

ad-m commented Oct 27, 2020

The basic requirement should be to transfer as much data transfer load as possible to the back-end storage and direct download by the client from the back-end storage, eg AWS S3 / GCP GCS / Minio. Otherwise, given the use of synchronous Django and Python, it may have low performance, although some optimizations are possible, eg Nginx X-Sendfile, with moderate scalability.

@vsoch
Copy link
Owner Author

vsoch commented Oct 27, 2020

Agreed! For Singularity Hub, for example, we just generate signed URLs for Google Storage and offload there, same with the Minio backend for Singularity Registry Server (both are not OCI, however, hence why I'm creating this module). You are correct that Django + Python tends to be a bottle neck - the use case is less a huge industry registry, and more a smaller research-oriented one that serves many fewer requests.

@ad-m
Copy link

ad-m commented Oct 27, 2020

From my experience with the Distribution Registry, it is relatively easy to lead to sudden bursts of high load, even in small environments (a team of 2-3 people), in the era of CI systems enabling the simultaneous launch of multiple workers connected to a high-speed link.

For example, it is easy to run 20 parallel jobs in GitHub Actions (see usage limit of free plan https://docs.github.com/en/free-pro-team@latest/actions/reference/usage-limits-billing-and-administration#usage-limits ), each running on a separate virtual machine, and many of them at the beginning of their work will require some data to be downloaded from registry and finally sent to the registry. In order to trigger so many jobs, it is enough to have a few commits in a short period of time and/or a matrix that checks various configurations, e.g. different Python versions, in the case of a library.

I understand that Python and Django are good enough for your needs. I would like points to some elements based on experiences that were not expected for me at the beginning.

@vsoch
Copy link
Owner Author

vsoch commented Oct 27, 2020

Thanks for the feedback! If we have a storage backend with signed urls for upload/download, the core registry running on Django (but offloading to that storage) I don't think would be an issue. But you are totally right that the filesystem storage, for example running alongside the action, might be too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants