s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362

vuryleo · 2019-10-10T05:26:18Z

Here makes API call to fetch rest of file into buffer when calling seek, which makes seek very slow. The API call may be put in read method instead.

The text was updated successfully, but these errors were encountered:

mpenkov · 2019-10-10T05:36:59Z

Are you sure that actually reads the file?

self._body = _get(self._object, self._version_id, Range=range_string)['Body']

I think in this case, the _body in this case is a file object. Until you actual .read() from it, nothing gets transferred over the wire.

If you have sample code that demonstrates the opposite, I'd be interested in having a look.

vuryleo · 2019-10-10T05:39:29Z

Here reads the API call to get_object is made, and the bytes are transferred.

vuryleo · 2019-10-10T05:45:21Z

sample code under work.

vuryleo · 2019-10-10T05:55:39Z

import boto3
from moto import mock_s3
from smart_open import open

if __name__ == '__main__':
    with mock_s3():
        c = boto3.client('s3')
        c.create_bucket(Bucket='bucket')
        c.put_object(Bucket='bucket', Key='key', Body=b'value')
        f = open('s3://bucket/key', 'r')
        f.seek(0)  # if bytes get transferred, `value` is in buffer, otherwise, `value` should not be in memory yet.
    # we exited mock_s3, which means s3 is not available now
    print(f.read())  # value

mpenkov · 2019-10-10T08:31:56Z

I think you're misunderstanding how this works.

First, just because you've exited the context manager doesn't mean "S3 is unavailable now". Try this:

import boto3
from moto import mock_s3

with mock_s3():
    c = boto3.client('s3')
    c.create_bucket(Bucket='bucket')
    c.put_object(Bucket='bucket', Key='key', Body=b'value')

    session = boto3.Session()
    s3 = session.resource('s3')
    obj = s3.Object('bucket', 'key')
    body = obj.get()['Body']  # At this stage, we've opened a stream, but have not read any bytes

print(body.read())

Second, you can verify that smart_open isn't loading anything into its buffers by slightly modifying your original example:

import boto3
from moto import mock_s3
from smart_open import open

if __name__ == '__main__':
    with mock_s3():
        c = boto3.client('s3')
        c.create_bucket(Bucket='bucket')
        c.put_object(Bucket='bucket', Key='key', Body=b'value')
        f = open('s3://bucket/key', 'r')
        f.seek(0)
        print("f.stream: %r buffer length: stream._buffer: %r" % (f.stream, len(f.stream._buffer)))

Running the above code:

(smart_open) misha@cabron:~/git/smart_open$ python seek.py 
f.stream: <smart_open.s3.SeekableBufferedInputBase object at 0x7fd2d1d72e48> buffer length: stream._buffer: 0

So, after seeking, you can see that the buffer is empty.

vuryleo · 2019-10-10T08:49:46Z

Yes, you opened the streams but not read from it yet, but the network traffic is made, s3 server has sent the bytes to client already. s3 server do not have an API that send the bytes individually (like passive mode in FTP). Other than exiting mock_s3, you could suspend the network as well.

import boto3
from moto import mock_s3
from smart_open import open

import socket

def guard(*args, **kwargs):
    raise Exception("I told you not to use the Internet!")

if __name__ == '__main__':
    with mock_s3():
        c = boto3.client('s3')
        c.create_bucket(Bucket='bucket')
        c.put_object(Bucket='bucket', Key='key', Body=b'value')
        f = open('s3://bucket/key', 'r')
        f.seek(0)  # if bytes get transferred, `value` is in buffer, otherwise, `value` should not be in memory yet.

    # we suspend the network and fake s3
    socket.socket = guard
    print(f.read())  # value

    f.seek(0)  # network error, just to prove network block works.

mpenkov · 2019-10-10T09:01:30Z

OK, so what's the actual problem here? Something out there (not smart_open) may be doing some buffering, but are you sure it reads the full file? Have you tried opening a large file on actual S3 and seeking? On my end, seeking e.g. to the end of a large file is much faster than reading it.

vuryleo · 2019-10-10T09:10:00Z

I will try that and come back with more detailed info, thanks for helping so far :)

vuryleo · 2019-10-10T10:24:29Z

I just uploaded a large file (s3://oss-playground/big.bin) (~1 GB) to s3 and try to capture the network traffic.

So, the get_object call won't read the whole file content into memory as a whole directly, but keeps an HTTP streaming connection to s3 server, which has some kind of buffer (~~~4K? IIRC~~ 128K). And that's why socket trick does not prevent the read because the connection has already created and the size is too small.

Then the issue becomes, when you call get_object in seek, it will create an HTTP connection to s3, try to read it through the end. But if the read happened too late, the s3 server will close the connection on their side, and you will receive 'Connection reset by peer' error.

The propose is still the same: move the get_object call to read func other than seek func. Otherwise if the time between seek - read or read - read is way too long to aws s3, which is kind of common if you try to do something chunk by chunk, s3 server will reset the connection and you can not read from it anymore.

(or, some kind of better, keep an active HTTP streaming connection and fire another one if aws closed it :)

mpenkov · 2019-10-11T02:25:12Z

OK, thank you for investigating.

I think your proposal is OK. We can delay this block until read.

Are you able to make a PR? We can probably make a pre-read callback or something to call that code just before reading happens.

vuryleo changed the title ~~s3 open seek operation read rest of file into buffer, which makes seek O(N)~~ s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk Oct 10, 2019

mpenkov added this to the 1.9.0 milestone Nov 3, 2019

This was referenced Nov 7, 2019

delay s3 request until read #379

Closed

Fixes "the connection was closed by the remote peer" error #389

Merged

mpenkov closed this as completed in #389 Nov 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362

s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362

vuryleo commented Oct 10, 2019

mpenkov commented Oct 10, 2019

vuryleo commented Oct 10, 2019

vuryleo commented Oct 10, 2019

vuryleo commented Oct 10, 2019 •

edited

Loading

mpenkov commented Oct 10, 2019

vuryleo commented Oct 10, 2019 •

edited

Loading

mpenkov commented Oct 10, 2019 •

edited

Loading

vuryleo commented Oct 10, 2019 •

edited

Loading

vuryleo commented Oct 10, 2019 •

edited

Loading

mpenkov commented Oct 11, 2019

s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362

s3 open seek operation try read rest of file into buffer, which makes following read has timeout risk #362

Comments

vuryleo commented Oct 10, 2019

mpenkov commented Oct 10, 2019

vuryleo commented Oct 10, 2019

vuryleo commented Oct 10, 2019

vuryleo commented Oct 10, 2019 • edited Loading

mpenkov commented Oct 10, 2019

vuryleo commented Oct 10, 2019 • edited Loading

mpenkov commented Oct 10, 2019 • edited Loading

vuryleo commented Oct 10, 2019 • edited Loading

vuryleo commented Oct 10, 2019 • edited Loading

mpenkov commented Oct 11, 2019

vuryleo commented Oct 10, 2019 •

edited

Loading

vuryleo commented Oct 10, 2019 •

edited

Loading

mpenkov commented Oct 10, 2019 •

edited

Loading

vuryleo commented Oct 10, 2019 •

edited

Loading

vuryleo commented Oct 10, 2019 •

edited

Loading