s3fs is significantly slower than boto3 for large file uploads #900

b23g5r42i · 2024-10-14T02:14:19Z

I've noticed that while uploading large files (greater than 1GB), s3fs.write() performs around three times slower than the boto3.upload_file() API.

Is this slower performance expected when using s3fs, and are there any configurations or optimizations that could improve its upload speed?

import boto3
import time

# MinIO server setup
endpoint_url = 'http://localhost:9000'
access_key = ''
secret_key = ''
bucket_name = 'test-bucket'
file_path = 'large_file.bin'
key = 'uploaded_file.bin'

file_size_mb = 4096
with open(file_path, 'wb') as f:
    f.write(os.urandom(file_size_mb * 1024 * 1024))
print(f"Created a {file_size_mb}MB test file.")

fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': endpoint_url}, key=access_key, secret=secret_key)

s3 = boto3.client('s3', endpoint_url=endpoint_url, aws_access_key_id=access_key, aws_secret_access_key=secret_key)

start_time = time.time()
with fs.open(f'{bucket_name}/{key}', 'wb') as f:
    f.write(open(file_path, 'rb').read())
print(f"s3fs upload time: {time.time() - start_time} seconds")

start_time = time.time()
s3.upload_file(Filename=file_path, Bucket=bucket_name, Key=key)
print(f"boto3 upload time: {time.time() - start_time} seconds")

The text was updated successfully, but these errors were encountered:

martindurant · 2024-10-15T14:19:55Z

In file-like mode (using open()), you are limited by the blocking nature of the API. s3fs will upload one block_size chunk at a time, serially. This is configurable block_size= and only 5MB by default, optimised for minimum memory use. We have discussed greatly increasing the value, but you can do this yourself in the call.

The non-file upload-from-disk method is

fs.put_file(file_path)

and already has a larger 50MB block (called chunksize in the call). These calls are also make the calls concurrently. This is what you want.

b23g5r42i · 2024-10-16T04:28:23Z

@martindurant Thanks for the reply! Increasing the block_size can indeed boost speed by up to 50% in my tests, but it's still about twice as slow as boto3, which I believe benefits from better concurrency handling.

The put_file() method looks useful, but it seems to work only with file paths. Is there a version modified with fsspec that could, for example, support pickle.dump() directly? My main goal is to unify I/O operations with upath + fsspec, including S3

martindurant · 2024-10-16T14:48:04Z

Would you care to test with #901

It's worth pointing out that s3transfer and maybe boto3 use threads and/or processes for parallelism, which matters in low0latency situations where the CPU time for stream compression might be significant. s3fs is single-threaded.

martindurant · 2024-10-16T16:11:35Z

it seems to work only with file paths

I was trying to match your code of writing the whole contents of a file.
To push bytes from memory in one go:

fs.pipe(path, value)

(where you could have value = pickle.dumps(..)).

I also started #901 for you to speed test.

b23g5r42i · 2024-10-17T03:11:21Z

Hi, with pickle.dumps(), it calls fsspec's write() interface, so I think we only need to make change to line to enable the new block size. Meanwhile, in my testing I find max_concurrency has no impact to performance...

Here is my testing code:

with fs.open(f'{bucket_name}/{key}', 'wb', block_size=BLOCK_SIZE, max_concurrency=CONCURRENCY) as f:
    f.write(open(file_path, 'rb').read())

With an 1GB object, default block size 5MB gives me 40MB/s while 50MB gives 70MB/s, and 500MB gives 90MB/s. But boto3 gives 160MB/s.
Changing max_concurrency from 1 to 10 has not impact to the speed at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3fs is significantly slower than boto3 for large file uploads #900

s3fs is significantly slower than boto3 for large file uploads #900

b23g5r42i commented Oct 14, 2024

martindurant commented Oct 15, 2024

b23g5r42i commented Oct 16, 2024

martindurant commented Oct 16, 2024

martindurant commented Oct 16, 2024

b23g5r42i commented Oct 17, 2024 •

edited

Loading

s3fs is significantly slower than boto3 for large file uploads #900

s3fs is significantly slower than boto3 for large file uploads #900

Comments

b23g5r42i commented Oct 14, 2024

martindurant commented Oct 15, 2024

b23g5r42i commented Oct 16, 2024

martindurant commented Oct 16, 2024

martindurant commented Oct 16, 2024

b23g5r42i commented Oct 17, 2024 • edited Loading

b23g5r42i commented Oct 17, 2024 •

edited

Loading