Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362

Closed
vuryleo opened this issue Oct 10, 2019 · 10 comments · Fixed by #389
Milestone

Comments

@vuryleo
Copy link

vuryleo commented Oct 10, 2019

Here makes API call to fetch rest of file into buffer when calling seek, which makes seek very slow. The API call may be put in read method instead.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 10, 2019

Are you sure that actually reads the file?

self._body = _get(self._object, self._version_id, Range=range_string)['Body']

I think in this case, the _body in this case is a file object. Until you actual .read() from it, nothing gets transferred over the wire.

If you have sample code that demonstrates the opposite, I'd be interested in having a look.

@vuryleo
Copy link
Author

vuryleo commented Oct 10, 2019

Here reads the API call to get_object is made, and the bytes are transferred.

@vuryleo
Copy link
Author

vuryleo commented Oct 10, 2019

sample code under work.

@vuryleo
Copy link
Author

vuryleo commented Oct 10, 2019

import boto3
from moto import mock_s3
from smart_open import open

if __name__ == '__main__':
    with mock_s3():
        c = boto3.client('s3')
        c.create_bucket(Bucket='bucket')
        c.put_object(Bucket='bucket', Key='key', Body=b'value')
        f = open('s3://bucket/key', 'r')
        f.seek(0)  # if bytes get transferred, `value` is in buffer, otherwise, `value` should not be in memory yet.
    # we exited mock_s3, which means s3 is not available now
    print(f.read())  # value

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 10, 2019

I think you're misunderstanding how this works.

First, just because you've exited the context manager doesn't mean "S3 is unavailable now". Try this:

import boto3
from moto import mock_s3

with mock_s3():
    c = boto3.client('s3')
    c.create_bucket(Bucket='bucket')
    c.put_object(Bucket='bucket', Key='key', Body=b'value')

    session = boto3.Session()
    s3 = session.resource('s3')
    obj = s3.Object('bucket', 'key')
    body = obj.get()['Body']  # At this stage, we've opened a stream, but have not read any bytes

print(body.read())

Second, you can verify that smart_open isn't loading anything into its buffers by slightly modifying your original example:

import boto3
from moto import mock_s3
from smart_open import open

if __name__ == '__main__':
    with mock_s3():
        c = boto3.client('s3')
        c.create_bucket(Bucket='bucket')
        c.put_object(Bucket='bucket', Key='key', Body=b'value')
        f = open('s3://bucket/key', 'r')
        f.seek(0)
        print("f.stream: %r buffer length: stream._buffer: %r" % (f.stream, len(f.stream._buffer)))

Running the above code:

(smart_open) misha@cabron:~/git/smart_open$ python seek.py 
f.stream: <smart_open.s3.SeekableBufferedInputBase object at 0x7fd2d1d72e48> buffer length: stream._buffer: 0

So, after seeking, you can see that the buffer is empty.

@vuryleo
Copy link
Author

vuryleo commented Oct 10, 2019

Yes, you opened the streams but not read from it yet, but the network traffic is made, s3 server has sent the bytes to client already. s3 server do not have an API that send the bytes individually (like passive mode in FTP). Other than exiting mock_s3, you could suspend the network as well.

import boto3
from moto import mock_s3
from smart_open import open

import socket

def guard(*args, **kwargs):
    raise Exception("I told you not to use the Internet!")

if __name__ == '__main__':
    with mock_s3():
        c = boto3.client('s3')
        c.create_bucket(Bucket='bucket')
        c.put_object(Bucket='bucket', Key='key', Body=b'value')
        f = open('s3://bucket/key', 'r')
        f.seek(0)  # if bytes get transferred, `value` is in buffer, otherwise, `value` should not be in memory yet.

    # we suspend the network and fake s3
    socket.socket = guard
    print(f.read())  # value

    f.seek(0)  # network error, just to prove network block works.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 10, 2019

OK, so what's the actual problem here? Something out there (not smart_open) may be doing some buffering, but are you sure it reads the full file? Have you tried opening a large file on actual S3 and seeking? On my end, seeking e.g. to the end of a large file is much faster than reading it.

@vuryleo
Copy link
Author

vuryleo commented Oct 10, 2019

I will try that and come back with more detailed info, thanks for helping so far :)

@vuryleo
Copy link
Author

vuryleo commented Oct 10, 2019

I just uploaded a large file (s3://oss-playground/big.bin) (~1 GB) to s3 and try to capture the network traffic.

So, the get_object call won't read the whole file content into memory as a whole directly, but keeps an HTTP streaming connection to s3 server, which has some kind of buffer (~4K? IIRC 128K). And that's why socket trick does not prevent the read because the connection has already created and the size is too small.

Then the issue becomes, when you call get_object in seek, it will create an HTTP connection to s3, try to read it through the end. But if the read happened too late, the s3 server will close the connection on their side, and you will receive 'Connection reset by peer' error.

The propose is still the same: move the get_object call to read func other than seek func. Otherwise if the time between seek - read or read - read is way too long to aws s3, which is kind of common if you try to do something chunk by chunk, s3 server will reset the connection and you can not read from it anymore.

(or, some kind of better, keep an active HTTP streaming connection and fire another one if aws closed it :)

@vuryleo vuryleo changed the title s3 open seek operation read rest of file into buffer, which makes seek O(N) s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk Oct 10, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Oct 11, 2019

OK, thank you for investigating.

I think your proposal is OK. We can delay this block until read.

Are you able to make a PR? We can probably make a pre-read callback or something to call that code just before reading happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants