-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build: improve upload
step
#9448
Comments
I created a Metabase question to know what are the project that are more affected by this problem. These projects are those that outputs a lot of files when building their documentation. We track the amount of HTML files, not all the files, but at least it gives some insights: https://ethicalads.metabaseapp.com/question/254-projects-with-many-html-files |
Does rclone do anything different here? I ask because rclone uses the same API endpoints storages uses, though I'm not familiar with what rclone is actually doing in those calls. But when I've used rclone and S3, it seems to performs similar API calls for every file in a directory to see if the file exists, and then to see if the file needs updated, before finally sending the file. It could provide some benefit with threading, or perhaps checksum matching, but as far as I know, rclone still has to send at least one API call per file in a directory. I don't believe rclone does anything special to upload a directory in bulk. I believe this is what S3 Batch Operations are for, but they seemed rather complex (it involves Lambda). |
It does work differently to some degree. We are not performing any kind of timestamp/hash check before uploading a file. So, we are always uploading a file even if it didn't changed readthedocs.org/readthedocs/builds/storage.py Lines 94 to 103 in 7de0e35
Our current We could do a quick test with one of the projects listed on the Metabase question using our current approach and compare the time using |
Start collecting how much time (in seconds) it takes to upload all the build artifacts for each project. Related #9448
Start collecting how much time (in seconds) it takes to upload all the build artifacts for each project. Related #9448
Yeah, the overall API count will be about equal I imagine, but checksum matching could be a place where there are gains. Doing a manual test first would be a great idea. |
Start collecting how much time (in seconds) it takes to upload all the build artifacts for each project. Related #9448
We just deployed the log line that will give us this data. I created a query in New Relic that gives us the projects spending more than 60 seconds on the "Uploading" step: https://onenr.io/02wdVM279jE For now, ~3 minutes is the maximum we have. Which doesn't seem terrible. I think it makes sense comparing that time with the build time (that I don't have in that log line) to get the % of time used on each step. We could get that time, but it will require some extra work. I will take another look at this in the following days when we have more data and see if I find projects with worse scenarios. Then, we could use those projects for the |
I executed this query again taking into account 7 days ago and I found that "Read the Docs for Business" users are the most affected by this: https://onenr.io/0PwJKzg4gR7. I checked the first result that took ~900 seconds in the "Uploading" step. The build took 2631 seconds in total and it has a lot of images. On these cases, using |
This feature may be of particular interested for the Python's documentation community. See python/docs-community#10 (comment) |
For next steps, we should see what rclone timing looks like. An easy first step would be manually replicating the rclone command from production to our dev/prod s3 bucket -- perhaps for one of the problematic repos. I think we can assume that rclone run manually would be faster, though the next question would be how much faster is the rclone-storages implementation? I suspect there might be additional operations in that approach, compared to just a raw rclone command. |
rclone storages uses "subprocess. Popen", so I think it won't be differences 😬 |
It runs in multiple processes, so it will definitely be faster. We need something that uploads with a decent bit of concurrency, whether it's Python or Go. |
I wrote that comment in the original post about deleting multiple files and I don't think there is a way to efficiently do it in django-storages. However, the AWS API definitely supports deleting multiple objects in a single API call (docs ref) so it should be possible to do this more efficiently. Likely it's possible to upload multiple files efficiently as well. |
So, are we in favor of using rclone or improving our code to handle bulk operations? S3 has the option of bulk deletion (up to 1k per request) https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_objects, it doesn't have option for bulk upload, but we can make use of multi threading for that. Also, aws-cli also has a sync command https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html. |
We haven't decided yet what's the right tool, but we guess Footnotes
|
Yea, I think rclone is worth testing, but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one. |
And the results. setupThe chosen project was astropy, taking 322 seconds to sync in our application (takeing from NR). aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.2.x/ /tmp/astropy/v5.2.x/ --recursive aws s3 synctime aws s3 sync /tmp/astropy/v5.1.x/ s3://readthedocs-media-dev/html/astropy/latest/ --delete All other tests with awscli have been omitted, since they rely on the last modified time,
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html test rclonetime rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 1m4.711s time rclone sync -v /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m44.929s time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m45.047s time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m44.506s time rclone sync -v /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m51.970s time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 1m2.271s time rclone sync -v /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m39.047s time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 1m3.660s time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m4.635s test rclone --transfers=8time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m29.927s time rclone sync -v --transfers=8 /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m22.516s time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m20.110s time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m22.167s time rclone sync -v --transfers=8 /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m26.751s time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m29.770s time rclone sync -v --transfers=8 /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m17.271s time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m33.835s time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ real 0m4.676s veredictrclone increasing the number of parallel uploads ( We could also test what happens if we improve our code:
I'm in favor of giving it a try. Using an external command has its cons:
|
@stsewd Great work. I do think it's worth moving forward with using rclone for this. I agree with your limitations though. I do know in the past we had a Syncer abstraction that we could reuse for this: https://github.com/readthedocs/readthedocs.org/pull/6535/files#diff-369f6b076f78f7e41570254c681a3871b455a192428d138f2eeda28dc2eaf8c3 -- that would allow us to keep things working as normal in local tests. But I think rclone can also easily support local <-> local file transfers, so maybe we don't need anything that complex? I generally trust your ideas around implementation, so happy to move forward with what you think is best. |
This is great!
Why not giving a try at this? It still executed |
I also assumed we were talking about this solution as the first to try. I'm rather hesitant to fiddle with threading in our own application implementation, or really trying to get too technically correct here. We don't have many projects being negatively affected by this issue, so our solution should match the severity. A drop in replacement is a great option. Even if it's only twice as fast, that's good value for little effort. |
django-rclone doesn't have a sync option, it shouldn't be hard to implement that option, but also, that package is really just a thin wrapper around rclone (https://github.com/ElnathMojo/django-rclone-storage/), so not sure if we need to introduce a new dependency just for that, I was thinking in using rclone just for the sync operation, and have the other parts rely on django-storages (which I kind of prefer, since it has a whole community behind it compated to the other package) |
I haven't customized storages, so don't know how hard the replacement would be, but this still seems like a fair approach. A package does still seems easiest though, and I'm not dependency adverse here unless the project is unmaintained or something. I'll defer to one of you all that have customized storages more. I'd have questions on our implementation if we're considering our own threading though. That's a rather expensive can of worms. |
@agjohnson we would just need override (or replace it) this method
in
All other operations would be handled by django-storage as usual |
I think shelling out to rsync is probably fine? Especially since the existing package doesn't do what we want. It does seem a bit more explicit. We can try it out behind a feature flag, and if we find issues with it we can contribute something to the package? |
Noting here that we should take into account the symlink issue that we found in the last weeks when swapping the backend. In #9800 all the logic was moved into Footnotes
|
- Put this new feature under a feature flag. - Works out of the box with our current settings, no rclone configuration file required. - Uses the local filesystem when running tests, uses minion during dev. - We need to install rclone in our builders for this to work. - I'm using the checks implemented in #9890, that needs to be merged first. - If we want even faster upload times for sphinx, we can merge readthedocs/readthedocs-sphinx-ext#119, since right now we are re-uploading all files. To test this, you need to re-build your docker containers. Closes #9448
We are using Django storages to delete/upload/sync the files from the builders into S3. This is good because it uses the same Django storages' API, without matter what's the backend behind the scenes.
However, S3/the API does not support "in bulk" uploads/deletion. So, there a lot of API requests have to be done to delete/upload a full directory. The code that does this is at
readthedocs.org/readthedocs/builds/storage.py
Lines 48 to 136 in 0a9afda
This amount of API requests make the
upload
process slow. In particular, when there are many files. We talked about improving this by using something likerclone
(https://rclone.org/) or similar. There is adjango-rclone-storage
(https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.Slightly related to: #9179
The text was updated successfully, but these errors were encountered: