Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3fs is significantly slower than boto3 for large file uploads #900

Open
b23g5r42i opened this issue Oct 14, 2024 · 5 comments
Open

s3fs is significantly slower than boto3 for large file uploads #900

b23g5r42i opened this issue Oct 14, 2024 · 5 comments

Comments

@b23g5r42i
Copy link

I've noticed that while uploading large files (greater than 1GB), s3fs.write() performs around three times slower than the boto3.upload_file() API.

Is this slower performance expected when using s3fs, and are there any configurations or optimizations that could improve its upload speed?

import boto3
import time

# MinIO server setup
endpoint_url = 'http://localhost:9000'
access_key = ''
secret_key = ''
bucket_name = 'test-bucket'
file_path = 'large_file.bin'
key = 'uploaded_file.bin'

file_size_mb = 4096
with open(file_path, 'wb') as f:
    f.write(os.urandom(file_size_mb * 1024 * 1024))
print(f"Created a {file_size_mb}MB test file.")

fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': endpoint_url}, key=access_key, secret=secret_key)

s3 = boto3.client('s3', endpoint_url=endpoint_url, aws_access_key_id=access_key, aws_secret_access_key=secret_key)

start_time = time.time()
with fs.open(f'{bucket_name}/{key}', 'wb') as f:
    f.write(open(file_path, 'rb').read())
print(f"s3fs upload time: {time.time() - start_time} seconds")

start_time = time.time()
s3.upload_file(Filename=file_path, Bucket=bucket_name, Key=key)
print(f"boto3 upload time: {time.time() - start_time} seconds")
@martindurant
Copy link
Member

In file-like mode (using open()), you are limited by the blocking nature of the API. s3fs will upload one block_size chunk at a time, serially. This is configurable block_size= and only 5MB by default, optimised for minimum memory use. We have discussed greatly increasing the value, but you can do this yourself in the call.

The non-file upload-from-disk method is

fs.put_file(file_path)

and already has a larger 50MB block (called chunksize in the call). These calls are also make the calls concurrently. This is what you want.

@b23g5r42i
Copy link
Author

@martindurant Thanks for the reply! Increasing the block_size can indeed boost speed by up to 50% in my tests, but it's still about twice as slow as boto3, which I believe benefits from better concurrency handling.

The put_file() method looks useful, but it seems to work only with file paths. Is there a version modified with fsspec that could, for example, support pickle.dump() directly? My main goal is to unify I/O operations with upath + fsspec, including S3

@martindurant
Copy link
Member

Would you care to test with #901

It's worth pointing out that s3transfer and maybe boto3 use threads and/or processes for parallelism, which matters in low0latency situations where the CPU time for stream compression might be significant. s3fs is single-threaded.

@martindurant
Copy link
Member

it seems to work only with file paths

I was trying to match your code of writing the whole contents of a file.
To push bytes from memory in one go:

fs.pipe(path, value)

(where you could have value = pickle.dumps(..)).

I also started #901 for you to speed test.

@b23g5r42i
Copy link
Author

b23g5r42i commented Oct 17, 2024

Hi, with pickle.dumps(), it calls fsspec's write() interface, so I think we only need to make change to line to enable the new block size. Meanwhile, in my testing I find max_concurrency has no impact to performance...

Here is my testing code:

with fs.open(f'{bucket_name}/{key}', 'wb', block_size=BLOCK_SIZE, max_concurrency=CONCURRENCY) as f:
    f.write(open(file_path, 'rb').read())

With an 1GB object, default block size 5MB gives me 40MB/s while 50MB gives 70MB/s, and 500MB gives 90MB/s. But boto3 gives 160MB/s.
Changing max_concurrency from 1 to 10 has not impact to the speed at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants