-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-thread hashing of large S3 files #1788
Conversation
34c38d6
to
f46af33
Compare
Codecov Report
@@ Coverage Diff @@
## master #1788 +/- ##
==========================================
+ Coverage 89.44% 89.53% +0.09%
==========================================
Files 62 62
Lines 7190 7284 +94
==========================================
+ Hits 6431 6522 +91
- Misses 759 762 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern here is introducing asyncio
for just one function, while using regular threadpools everywhere else.
I think asyncio
is great and we should probably switch everything to it - but it's a fairly big and risky change, so we should do it separately, not as part of a performance fix for one function. We should think about proper design, and possibly even expose async functions to the user. But, that will require lots of testing. For now, I think it's best to fix the download the "old" way.
One potential problem with parallel downloads: because hashing has to be done sequentially and cannot be parallelized (for a given file), there is a possibility of running out of memory. If the network is very fast and CPU cannot keep up with the download, this can potentially load every chunk of the file into memory at the same time.
@sir-sigurd I talked a bit more with Dima about this and we're OK with |
408f619
to
550ed9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on large complex package and got the same top-hash (nice work); a few comments inline
s3_client_provider = S3ClientProvider() | ||
# This controls how many parts can be stored in the memory. | ||
# This includes the ones that are being downloaded or hashed. | ||
s3_max_pending_parts = s3_transfer_config.max_request_concurrency * 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why * 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted; it should be a named constant; might even make sense to adjust it based on available memory
I got the following error in Jupyter (can you repro in iPython as well)? A lot of our users are using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Jupyter: that's my concern about adding asyncio
:-\ We don't know what's going to break.
This might be solved in the new Jupyter. Still seems kinda strict to require a specific version of Jupyter. How different exactly would this code like with |
09d6cd5
to
896f870
Compare
02b6709
to
1056a19
Compare
1056a19
to
c1bfcda
Compare
Codecov Report
@@ Coverage Diff @@
## master #1788 +/- ##
==========================================
+ Coverage 88.72% 88.85% +0.12%
==========================================
Files 93 93
Lines 11772 11958 +186
==========================================
+ Hits 10445 10625 +180
- Misses 1327 1333 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident in my ability to review these changes, but I don't see any issues. It would be great to get @dimaryaz to take a look.
else: | ||
stopped_event.set() | ||
finally: | ||
if not f.cancelled(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the semaphore release only called if not f.cancelled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that so the hashing of that part will be retried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future can be cancelled only before being runned, so semaphore wasn't acquired when f.cancelled()
is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not great to have acquire
and release
in different places - if nothing else, it's hard to prove that the logic is correct.
Theoretically, if logger.debug('%r part %s: download enqueued', src, part_number)
throws an exception, then release
will be called without an acquire
. It's pretty difficult to make that happen, so kind of a negligible risk, but still...
c1bfcda
to
1d63b2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a bit more complicated than I'd like, but I think it's correct, so don't want to block on that. We can improve it in the future.
else: | ||
stopped_event.set() | ||
finally: | ||
if not f.cancelled(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not great to have acquire
and release
in different places - if nothing else, it's hard to prove that the logic is correct.
Theoretically, if logger.debug('%r part %s: download enqueued', src, part_number)
throws an exception, then release
will be called without an acquire
. It's pretty difficult to make that happen, so kind of a negligible risk, but still...
Benchmarks on m5a.large
Single/several large files of the same size
Before
After
Multiple small files, a single large file
Before
After
Multiple small files, two large files
Before
After
This PR changes the number of parallel S3 requests: previously it was the same as
max_workers
ofThreadPoolExecutor
, now it's the same as for other multi-thread operations (10).TODO
clean up code inget rid of asyncioutil