-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362
Comments
Are you sure that actually reads the file? self._body = _get(self._object, self._version_id, Range=range_string)['Body'] I think in this case, the _body in this case is a file object. Until you actual .read() from it, nothing gets transferred over the wire. If you have sample code that demonstrates the opposite, I'd be interested in having a look. |
Here reads the API call to |
sample code under work. |
import boto3
from moto import mock_s3
from smart_open import open
if __name__ == '__main__':
with mock_s3():
c = boto3.client('s3')
c.create_bucket(Bucket='bucket')
c.put_object(Bucket='bucket', Key='key', Body=b'value')
f = open('s3://bucket/key', 'r')
f.seek(0) # if bytes get transferred, `value` is in buffer, otherwise, `value` should not be in memory yet.
# we exited mock_s3, which means s3 is not available now
print(f.read()) # value |
I think you're misunderstanding how this works. First, just because you've exited the context manager doesn't mean "S3 is unavailable now". Try this: import boto3
from moto import mock_s3
with mock_s3():
c = boto3.client('s3')
c.create_bucket(Bucket='bucket')
c.put_object(Bucket='bucket', Key='key', Body=b'value')
session = boto3.Session()
s3 = session.resource('s3')
obj = s3.Object('bucket', 'key')
body = obj.get()['Body'] # At this stage, we've opened a stream, but have not read any bytes
print(body.read()) Second, you can verify that smart_open isn't loading anything into its buffers by slightly modifying your original example: import boto3
from moto import mock_s3
from smart_open import open
if __name__ == '__main__':
with mock_s3():
c = boto3.client('s3')
c.create_bucket(Bucket='bucket')
c.put_object(Bucket='bucket', Key='key', Body=b'value')
f = open('s3://bucket/key', 'r')
f.seek(0)
print("f.stream: %r buffer length: stream._buffer: %r" % (f.stream, len(f.stream._buffer))) Running the above code:
So, after seeking, you can see that the buffer is empty. |
Yes, you opened the streams but not read from it yet, but the network traffic is made, s3 server has sent the bytes to client already. s3 server do not have an API that send the bytes individually (like passive mode in FTP). Other than exiting mock_s3, you could suspend the network as well. import boto3
from moto import mock_s3
from smart_open import open
import socket
def guard(*args, **kwargs):
raise Exception("I told you not to use the Internet!")
if __name__ == '__main__':
with mock_s3():
c = boto3.client('s3')
c.create_bucket(Bucket='bucket')
c.put_object(Bucket='bucket', Key='key', Body=b'value')
f = open('s3://bucket/key', 'r')
f.seek(0) # if bytes get transferred, `value` is in buffer, otherwise, `value` should not be in memory yet.
# we suspend the network and fake s3
socket.socket = guard
print(f.read()) # value
f.seek(0) # network error, just to prove network block works. |
OK, so what's the actual problem here? Something out there (not smart_open) may be doing some buffering, but are you sure it reads the full file? Have you tried opening a large file on actual S3 and seeking? On my end, seeking e.g. to the end of a large file is much faster than reading it. |
I will try that and come back with more detailed info, thanks for helping so far :) |
I just uploaded a large file (s3://oss-playground/big.bin) (~1 GB) to s3 and try to capture the network traffic. So, the Then the issue becomes, when you call The propose is still the same: move the (or, some kind of better, keep an active HTTP streaming connection and fire another one if aws closed it :) |
OK, thank you for investigating. I think your proposal is OK. We can delay this block until read. Are you able to make a PR? We can probably make a pre-read callback or something to call that code just before reading happens. |
Here makes API call to fetch rest of file into buffer when calling
seek
, which makesseek
very slow. The API call may be put inread
method instead.The text was updated successfully, but these errors were encountered: