Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl s3fs cursor #272

Open
laughingman7743 opened this issue Jan 23, 2022 · 1 comment
Open

Impl s3fs cursor #272

laughingman7743 opened this issue Jan 23, 2022 · 1 comment

Comments

@laughingman7743
Copy link
Owner

laughingman7743 commented Jan 23, 2022

A cursor implementation to read CSV files in S3 without using Pandas.
It would be good to be able to use awsathena+s3fs in SQLAlchemy.

https://github.com/fsspec/s3fs
https://docs.python.org/3/library/csv.html

@laughingman7743
Copy link
Owner Author

laughingman7743 commented Jul 31, 2022

AbstractFileSystem
https://github.com/fsspec/filesystem_spec/blob/2022.7.1/fsspec/spec.py#L92
AbstractBufferedFile
https://github.com/fsspec/filesystem_spec/blob/2022.7.1/fsspec/spec.py#L1299

S3FileSystem
https://github.com/fsspec/s3fs/blob/2022.7.1/s3fs/core.py#L168
S3File
https://github.com/fsspec/s3fs/blob/2022.7.1/s3fs/core.py#L1822

It appears that awswrangler takes the approach of splitting the files into smaller chunk sizes and using ThreadPoolExecutor to retrieve them in parallel.
https://github.com/awslabs/aws-data-wrangler/blob/2.16.1/awswrangler/s3/_fs.py#L262-L300

Since s3fs depends on aiobotocore, and aiobotocore's botocore library has strict version dependencies, it seems like a good idea to create my own S3 file system using ThreadPoolExecutor, a similar approach to awswrangler, instead of asyncio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant