Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: improve upload step #9448

Closed
humitos opened this issue Jul 25, 2022 · 23 comments · Fixed by #9842
Closed

Build: improve upload step #9448

humitos opened this issue Jul 25, 2022 · 23 comments · Fixed by #9842
Assignees
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code

Comments

@humitos
Copy link
Member

humitos commented Jul 25, 2022

We are using Django storages to delete/upload/sync the files from the builders into S3. This is good because it uses the same Django storages' API, without matter what's the backend behind the scenes.

However, S3/the API does not support "in bulk" uploads/deletion. So, there a lot of API requests have to be done to delete/upload a full directory. The code that does this is at

def delete_directory(self, path):
"""
Delete all files under a certain path from storage.
Many storage backends (S3, Azure storage) don't care about "directories".
The directory effectively doesn't exist if there are no files in it.
However, in these backends, there is no "rmdir" operation so you have to recursively
delete all files.
:param path: the path to the directory to remove
"""
if path in ('', '/'):
raise SuspiciousFileOperation('Deleting all storage cannot be right')
log.debug('Deleting path from media storage', path=path)
folders, files = self.listdir(self._dirpath(path))
for folder_name in folders:
if folder_name:
# Recursively delete the subdirectory
self.delete_directory(self.join(path, folder_name))
for filename in files:
if filename:
self.delete(self.join(path, filename))
def copy_directory(self, source, destination):
"""
Copy a directory recursively to storage.
:param source: the source path on the local disk
:param destination: the destination path in storage
"""
log.debug(
'Copying source directory to media storage',
source=source,
destination=destination,
)
source = Path(source)
for filepath in source.iterdir():
sub_destination = self.join(destination, filepath.name)
if filepath.is_dir():
# Recursively copy the subdirectory
self.copy_directory(filepath, sub_destination)
elif filepath.is_file():
with filepath.open('rb') as fd:
self.save(sub_destination, fd)
def sync_directory(self, source, destination):
"""
Sync a directory recursively to storage.
Overwrites files in remote storage with files from ``source`` (no timstamp/hash checking).
Removes files and folders in remote storage that are not present in ``source``.
:param source: the source path on the local disk
:param destination: the destination path in storage
"""
if destination in ('', '/'):
raise SuspiciousFileOperation('Syncing all storage cannot be right')
log.debug(
'Syncing to media storage.',
source=source,
destination=destination,
)
source = Path(source)
copied_files = set()
copied_dirs = set()
for filepath in source.iterdir():
sub_destination = self.join(destination, filepath.name)
if filepath.is_dir():
# Recursively sync the subdirectory
self.sync_directory(filepath, sub_destination)
copied_dirs.add(filepath.name)
elif filepath.is_file():
with filepath.open('rb') as fd:
self.save(sub_destination, fd)
copied_files.add(filepath.name)
# Remove files that are not present in ``source``
dest_folders, dest_files = self.listdir(self._dirpath(destination))
for folder in dest_folders:
if folder not in copied_dirs:
self.delete_directory(self.join(destination, folder))
for filename in dest_files:
if filename not in copied_files:
filepath = self.join(destination, filename)
log.debug('Deleting file from media storage.', filepath=filepath)
self.delete(filepath)

This amount of API requests make the upload process slow. In particular, when there are many files. We talked about improving this by using something like rclone (https://rclone.org/) or similar. There is a django-rclone-storage (https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.

Slightly related to: #9179

@humitos humitos added Improvement Minor improvement to code Needed: design decision A core team decision is required labels Jul 25, 2022
@humitos humitos moved this to Planned in 📍Roadmap Aug 23, 2022
@humitos
Copy link
Member Author

humitos commented Aug 30, 2022

I created a Metabase question to know what are the project that are more affected by this problem. These projects are those that outputs a lot of files when building their documentation. We track the amount of HTML files, not all the files, but at least it gives some insights: https://ethicalads.metabaseapp.com/question/254-projects-with-many-html-files

@agjohnson
Copy link
Contributor

Does rclone do anything different here? I ask because rclone uses the same API endpoints storages uses, though I'm not familiar with what rclone is actually doing in those calls. But when I've used rclone and S3, it seems to performs similar API calls for every file in a directory to see if the file exists, and then to see if the file needs updated, before finally sending the file.

It could provide some benefit with threading, or perhaps checksum matching, but as far as I know, rclone still has to send at least one API call per file in a directory. I don't believe rclone does anything special to upload a directory in bulk. I believe this is what S3 Batch Operations are for, but they seemed rather complex (it involves Lambda).

@humitos
Copy link
Member Author

humitos commented Aug 31, 2022

It does work differently to some degree. We are not performing any kind of timestamp/hash check before uploading a file. So, we are always uploading a file even if it didn't changed

def sync_directory(self, source, destination):
"""
Sync a directory recursively to storage.
Overwrites files in remote storage with files from ``source`` (no timstamp/hash checking).
Removes files and folders in remote storage that are not present in ``source``.
:param source: the source path on the local disk
:param destination: the destination path in storage
"""

Our current BuildMediaStorageMixin.sync_directory is pretty simple and not too smart. It seems that rclone works in a similar way, but we can reduce some time by avoid re-uploading already existent files. They have a whole section about "Reducing Costs" that we can take advantage of https://rclone.org/s3/#reducing-costs

We could do a quick test with one of the projects listed on the Metabase question using our current approach and compare the time using rclone. That will give us some insights about the this.

humitos added a commit that referenced this issue Aug 31, 2022
Start collecting how much time (in seconds) it takes to upload all the build
artifacts for each project.

Related #9448
humitos added a commit that referenced this issue Aug 31, 2022
Start collecting how much time (in seconds) it takes to upload all the build
artifacts for each project.

Related #9448
@agjohnson
Copy link
Contributor

Yeah, the overall API count will be about equal I imagine, but checksum matching could be a place where there are gains. Doing a manual test first would be a great idea.

humitos added a commit that referenced this issue Aug 31, 2022
Start collecting how much time (in seconds) it takes to upload all the build
artifacts for each project.

Related #9448
@humitos
Copy link
Member Author

humitos commented Sep 6, 2022

We just deployed the log line that will give us this data. I created a query in New Relic that gives us the projects spending more than 60 seconds on the "Uploading" step: https://onenr.io/02wdVM279jE

For now, ~3 minutes is the maximum we have. Which doesn't seem terrible. I think it makes sense comparing that time with the build time (that I don't have in that log line) to get the % of time used on each step. We could get that time, but it will require some extra work.

I will take another look at this in the following days when we have more data and see if I find projects with worse scenarios. Then, we could use those projects for the rclone tests we talked and compare times.

@humitos
Copy link
Member Author

humitos commented Sep 12, 2022

I executed this query again taking into account 7 days ago and I found that "Read the Docs for Business" users are the most affected by this: https://onenr.io/0PwJKzg4gR7.

I checked the first result that took ~900 seconds in the "Uploading" step. The build took 2631 seconds in total and it has a lot of images. On these cases, using rclone will be pretty helpful since we won't upload these images over and over again.

@humitos
Copy link
Member Author

humitos commented Sep 15, 2022

This feature may be of particular interested for the Python's documentation community. See python/docs-community#10 (comment)

@agjohnson
Copy link
Contributor

For next steps, we should see what rclone timing looks like. An easy first step would be manually replicating the rclone command from production to our dev/prod s3 bucket -- perhaps for one of the problematic repos.

I think we can assume that rclone run manually would be faster, though the next question would be how much faster is the rclone-storages implementation? I suspect there might be additional operations in that approach, compared to just a raw rclone command.

@humitos
Copy link
Member Author

humitos commented Sep 15, 2022

rclone storages uses "subprocess. Popen", so I think it won't be differences 😬

@ericholscher
Copy link
Member

ericholscher commented Nov 8, 2022

It runs in multiple processes, so it will definitely be faster. We need something that uploads with a decent bit of concurrency, whether it's Python or Go.

@humitos humitos added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Nov 9, 2022
@davidfischer
Copy link
Contributor

I wrote that comment in the original post about deleting multiple files and I don't think there is a way to efficiently do it in django-storages. However, the AWS API definitely supports deleting multiple objects in a single API call (docs ref) so it should be possible to do this more efficiently. Likely it's possible to upload multiple files efficiently as well.

@stsewd
Copy link
Member

stsewd commented Nov 29, 2022

So, are we in favor of using rclone or improving our code to handle bulk operations?
Have we done the manual test with rclone yet?

S3 has the option of bulk deletion (up to 1k per request) https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_objects, it doesn't have option for bulk upload, but we can make use of multi threading for that.

Also, aws-cli also has a sync command https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html.

@humitos
Copy link
Member Author

humitos commented Nov 29, 2022

@stsewd

So, are we in favor of using rclone or improving our code to handle bulk operations?
Have we done the manual test with rclone yet?

We haven't decided yet what's the right tool, but we guess rclone will be way better. We haven't done the tests, no. The next step is to do those manual tests with rclone 1 using some of projects that takes the most time uploading the files from the New Relic query. Let me know if need some help to figure this out.

Footnotes

  1. and maybe aws-cli if you want try it in case you think it would be better/simpler

@stsewd stsewd moved this from Planned to In progress in 📍Roadmap Nov 29, 2022
@ericholscher
Copy link
Member

Yea, I think rclone is worth testing, but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one.

@stsewd
Copy link
Member

stsewd commented Nov 29, 2022

And the results.

setup

The chosen project was astropy, taking 322 seconds to sync in our application (takeing from NR).
The tests start with a clean upload, and then test doing a resync with other versions to test performance when files change or are deleted, and even re-uploading the same version to see what happens when no files change.

aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.2.x/ /tmp/astropy/v5.2.x/ --recursive
aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.1.x/ /tmp/astropy/v5.1.x/ --recursive
aws s3 cp s3://readthedocs-media-prod/html/astropy/v2.0.x/ /tmp/astropy/v2.0.x/ --recursive
aws s3 cp s3://readthedocs-media-prod/html/astropy/v1.0.x/ /tmp/astropy/v1.0.x/ --recursive
aws s3 cp s3://readthedocs-media-prod/html/astropy/latest/ /tmp/astropy/latest/ --recursive

aws s3 sync

time aws s3 sync /tmp/astropy/v5.1.x/ s3://readthedocs-media-dev/html/astropy/latest/ --delete
real 0m23.395s
user 0m13.512s
sys 0m2.168s

All other tests with awscli have been omitted, since they rely on the last modified time,
and given that files will be re-generated on each build, the tool will re-upload everything.

The following sync command syncs files to a local directory from objects in a specified bucket and prefix by downloading > s3 objects. An s3 object will require downloading if one of the following conditions is true:

The s3 object does not exist in the local directory.

The size of the s3 object differs from the size of the local file.

The last modified time of the s3 object is older than the last modified time of the local file.

https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html

test rclone

time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 112.658 MiB / 112.658 MiB, 100%, 1.019 MiB/s, ETA 0s
Transferred: 3942 / 3942, 100%
Elapsed time: 1m4.7s

real 1m4.711s
user 0m3.860s
sys 0m1.399s

time rclone sync -v /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 111.094 MiB / 111.094 MiB, 100%, 3.058 MiB/s, ETA 0s
Checks: 3942 / 3942, 100%
Deleted: 3 (files), 0 (dirs)
Transferred: 2539 / 2539, 100%
Elapsed time: 44.9s

real 0m44.929s
user 0m4.151s
sys 0m1.443s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 96.770 MiB / 96.770 MiB, 100%, 2.376 MiB/s, ETA 0s
Checks: 3980 / 3980, 100%
Deleted: 7 (files), 0 (dirs)
Transferred: 2111 / 2111, 100%
Elapsed time: 45.0s

real 0m45.047s
user 0m4.235s
sys 0m1.377s

time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 108.828 MiB / 108.828 MiB, 100%, 2.707 MiB/s, ETA 0s
Checks: 3975 / 3975, 100%
Deleted: 36 (files), 0 (dirs)
Transferred: 2503 / 2503, 100%
Elapsed time: 44.4s

real 0m44.506s
user 0m4.267s
sys 0m1.380s

time rclone sync -v /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 123.839 MiB / 123.839 MiB, 100%, 1.569 MiB/s, ETA 0s
Checks: 3942 / 3942, 100%
Deleted: 1203 (files), 0 (dirs)
Transferred: 2724 / 2724, 100%
Elapsed time: 51.9s

real 0m51.970s
user 0m3.943s
sys 0m1.165s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 114.121 MiB / 114.121 MiB, 100%, 1.786 MiB/s, ETA 0s
Checks: 3252 / 3252, 100%
Deleted: 514 (files), 0 (dirs)
Transferred: 3452 / 3452, 100%
Elapsed time: 1m2.2s

real 1m2.271s
user 0m4.418s
sys 0m1.458s

time rclone sync -v /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 35.893 MiB / 35.893 MiB, 100%, 716.606 KiB/s, ETA 0s
Checks: 3975 / 3975, 100%
Deleted: 2472 (files), 0 (dirs)
Transferred: 1560 / 1560, 100%
Elapsed time: 39.0s

real 0m39.047s
user 0m2.709s
sys 0m0.803s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 114.163 MiB / 114.163 MiB, 100%, 1.669 MiB/s, ETA 0s
Checks: 1881 / 1881, 100%
Deleted: 378 (files), 0 (dirs)
Transferred: 3654 / 3654, 100%
Elapsed time: 1m3.6s

real 1m3.660s
user 0m4.358s
sys 0m1.314s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 3975 / 3975, 100%
Elapsed time: 4.6s

real 0m4.635s
user 0m1.316s
sys 0m0.316s

test rclone --transfers=8

time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 112.658 MiB / 112.658 MiB, 100%, 3.396 MiB/s, ETA 0s
Transferred: 3942 / 3942, 100%
Elapsed time: 29.9s

real 0m29.927s
user 0m3.699s
sys 0m1.075s

time rclone sync -v --transfers=8 /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 111.094 MiB / 111.094 MiB, 100%, 5.113 MiB/s, ETA 0s
Checks: 3942 / 3942, 100%
Deleted: 3 (files), 0 (dirs)
Transferred: 2539 / 2539, 100%
Elapsed time: 22.5s

real 0m22.516s
user 0m4.220s
sys 0m1.229s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 96.770 MiB / 96.770 MiB, 100%, 4.924 MiB/s, ETA 0s
Checks: 3980 / 3980, 100%
Deleted: 7 (files), 0 (dirs)
Transferred: 2111 / 2111, 100%
Elapsed time: 20.0s

real 0m20.110s
user 0m4.015s
sys 0m1.100s

time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 108.828 MiB / 108.828 MiB, 100%, 5.049 MiB/s, ETA 0s
Checks: 3975 / 3975, 100%
Deleted: 36 (files), 0 (dirs)
Transferred: 2503 / 2503, 100%
Elapsed time: 22.1s

real 0m22.167s
user 0m4.173s
sys 0m1.265s

time rclone sync -v --transfers=8 /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 123.839 MiB / 123.839 MiB, 100%, 4.412 MiB/s, ETA 0s
Checks: 3942 / 3942, 100%
Deleted: 1203 (files), 0 (dirs)
Transferred: 2724 / 2724, 100%
Elapsed time: 26.7s

real 0m26.751s
user 0m3.666s
sys 0m1.246s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 114.121 MiB / 114.121 MiB, 100%, 3.966 MiB/s, ETA 0s
Checks: 3252 / 3252, 100%
Deleted: 514 (files), 0 (dirs)
Transferred: 3452 / 3452, 100%
Elapsed time: 29.7s

real 0m29.770s
user 0m4.287s
sys 0m1.322s

time rclone sync -v --transfers=8 /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 35.893 MiB / 35.893 MiB, 100%, 2.103 MiB/s, ETA 0s
Checks: 3975 / 3975, 100%
Deleted: 2472 (files), 0 (dirs)
Transferred: 1560 / 1560, 100%
Elapsed time: 17.2s

real 0m17.271s
user 0m2.586s
sys 0m0.713s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 114.163 MiB / 114.163 MiB, 100%, 3.150 MiB/s, ETA 0s
Checks: 1881 / 1881, 100%
Deleted: 378 (files), 0 (dirs)
Transferred: 3654 / 3654, 100%
Elapsed time: 33.8s

real 0m33.835s
user 0m4.224s
sys 0m1.277s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 3975 / 3975, 100%
Elapsed time: 4.6s

real 0m4.676s
user 0m1.410s
sys 0m0.213s

veredict

rclone increasing the number of parallel uploads (--transfers=8 ) is fast, obviously.
But since aws-cli doesn't work as expected, we don't have anything else to compare to.

We could also test what happens if we improve our code:

  • Checking the file hash, so we don't re-upload it (we could even do this using threads).
  • Doing deletions un bulk
  • Using multiple threads to upload files

but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one.

I'm in favor of giving it a try.

Using an external command has its cons:

  • Extra setup / installation steps
  • Can't be easily tested (we would be mocking the whole command)
  • Logging won't be that nice, since we would only have access to stdout
  • Could be problematic using that command from some things and using the rest of the code for others (like if we make changes to our code, those won't be reflected in the external tool, like reading/writing/deleting a file)

@ericholscher
Copy link
Member

@stsewd Great work. I do think it's worth moving forward with using rclone for this. I agree with your limitations though. I do know in the past we had a Syncer abstraction that we could reuse for this: https://github.com/readthedocs/readthedocs.org/pull/6535/files#diff-369f6b076f78f7e41570254c681a3871b455a192428d138f2eeda28dc2eaf8c3 -- that would allow us to keep things working as normal in local tests. But I think rclone can also easily support local <-> local file transfers, so maybe we don't need anything that complex?

I generally trust your ideas around implementation, so happy to move forward with what you think is best.

@humitos
Copy link
Member Author

humitos commented Nov 30, 2022

This is great! rclone is ~15x faster than our current approach! 💪🏼 Maybe it's possible to get the most from both approaches?

There is a django-rclone-storage (https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.

Why not giving a try at this? It still executed rclone command behind the scenes, but at least gives us a layer of abstraction using a Django storage that integrates great with our code and reduce the cons you have mentioned.

@agjohnson
Copy link
Contributor

agjohnson commented Nov 30, 2022

There is a django-rclone-storage

I also assumed we were talking about this solution as the first to try.

I'm rather hesitant to fiddle with threading in our own application implementation, or really trying to get too technically correct here. We don't have many projects being negatively affected by this issue, so our solution should match the severity.

A drop in replacement is a great option. Even if it's only twice as fast, that's good value for little effort.

@stsewd
Copy link
Member

stsewd commented Dec 1, 2022

django-rclone doesn't have a sync option, it shouldn't be hard to implement that option, but also, that package is really just a thin wrapper around rclone (https://github.com/ElnathMojo/django-rclone-storage/), so not sure if we need to introduce a new dependency just for that, I was thinking in using rclone just for the sync operation, and have the other parts rely on django-storages (which I kind of prefer, since it has a whole community behind it compated to the other package)

@agjohnson
Copy link
Contributor

I haven't customized storages, so don't know how hard the replacement would be, but this still seems like a fair approach. A package does still seems easiest though, and I'm not dependency adverse here unless the project is unmaintained or something.

I'll defer to one of you all that have customized storages more. I'd have questions on our implementation if we're considering our own threading though. That's a rather expensive can of worms.

@stsewd
Copy link
Member

stsewd commented Dec 1, 2022

@agjohnson we would just need override (or replace it) this method

def sync_directory(self, source, destination):

in

class S3BuildMediaStorage(BuildMediaStorageMixin, OverrideHostnameMixin, S3Boto3Storage):

All other operations would be handled by django-storage as usual

@ericholscher
Copy link
Member

ericholscher commented Dec 1, 2022

I think shelling out to rsync is probably fine? Especially since the existing package doesn't do what we want. It does seem a bit more explicit. We can try it out behind a feature flag, and if we find issues with it we can contribute something to the package?

@humitos
Copy link
Member Author

humitos commented Dec 13, 2022

Noting here that we should take into account the symlink issue that we found in the last weeks when swapping the backend. In #9800 all the logic was moved into safe_open (used for opening user's configuration files and also when uploading the files to the storage) which will help on this. We will need to use the same logic 1 immediately before uploading the files with rclone

Footnotes

  1. check the symlink target is inside DOCROOT (or more specifically, the project's path if possible)

@stsewd stsewd moved this from In progress to Needs review in 📍Roadmap Jan 17, 2023
@github-project-automation github-project-automation bot moved this from Needs review to Done in 📍Roadmap Jan 25, 2023
stsewd added a commit that referenced this issue Jan 25, 2023
- Put this new feature under a feature flag.
- Works out of the box with our current settings, no rclone configuration file required.
- Uses the local filesystem when running tests, uses minion during dev.
- We need to install rclone in our builders for this to work.
- I'm using the checks implemented in #9890, that needs to be merged first.
- If we want even faster upload times for sphinx, we can merge readthedocs/readthedocs-sphinx-ext#119,
  since right now we are re-uploading all files.

To test this, you need to re-build your docker containers.

Closes #9448
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants