S3 support #426

madsbk · 2024-07-31T15:17:35Z

Implements AWS S3 read support.

Tasks

Make the dependency of AWS SDK optional
Test with a libcudf io source (AWS S3 IO through KvikIO cudf#16499)
Docs

cmake RemoteHandle::read() remove StreamFuture(StreamFuture&&) python bindings and tests read to device memory dependencies: aws-sdk-cpp parse_s3_path cmake: adding AWSSDK to python built RemoteFile.from_url() benchmark benchmark: --api numpy ibenchmark: --api cudf cleanup benchmark: adding --api cudf-fsspec read(): use PushAndPopContext impl. pread ensure_aws_s3_api_is_initalized AwsS3Client S3Context clean up SameThreadExecutor use shared pointer cmake: AWSSDK COMPONENTS s3 transfer create_transfer_manager clean up BufferAsStream test_read: larger data size KVIKIO_NVTX_FUNC_RANGE benchmark: use random RemoteHandle::read_to_host(): print bandwidth benchmark clean up remove the use of the transfer module ci: some more aws-sdk-cpp benchmark: pytest.importorskip("moto") remote_handle.pyx remote_file.py make remote IO optional don't use typing_extensions dependencies: boto3 and moto more aws-sdk-cpp trigger CI error if remote module wasn't built

cpp/include/kvikio/remote_handle.hpp

madsbk · 2024-08-13T11:28:38Z

python/kvikio/tests/test_aws_s3.py

+# TODO: remove before PR merge. Trigger CI error if the remote module wasn't built
+import kvikio._lib.remote_handle  # isort: skip


[debug] remove before PR merging

Contributes to rapidsai/kvikio#426

cpp/include/kvikio/bounce_buffer.hpp

…o s3_support

madsbk · 2024-09-03T11:20:50Z

@KyleFromNVIDIA, what is the status of the building work? Is it ready for review/merge?

…o s3_support

This reverts commit b978323.

bdice

I have some concerns about the handling of the aws-sdk-cpp dependency and how it will constrain our compatibility between RAPIDS and the conda-forge ecosystem. This dependency is used by many packages and it is tightly pinned, meaning that RAPIDS will need to track and continually migrate its dependency to align exactly with the conda-forge ecosystem. See threads in this review.

bdice · 2024-09-06T13:38:31Z

build.sh

   clean                       - remove all existing build artifacts and configuration (start over)
   libkvikio                   - build and install the libkvikio C++ code
   kvikio                      - build and install the kvikio Python package (requires libkvikio)
+   --no-s3                     - build with no AWS S3 support


Can this just be defined as a part of --cmake-args, with no special --no-s3 flag? It is possible for users to specify the option via --cmake-args. If the user gives a value there, we probably want to respect that with higher priority than defaulting to "ON" based on the absence of a --no-s3 flag.

bdice · 2024-09-06T13:48:57Z

conda/recipes/kvikio/meta.yaml

@@ -64,11 +64,13 @@ requirements:
    - rapids-build-backend >=0.3.0,<0.4.0.dev0
    - scikit-build-core >=0.10.0
    - libkvikio ={{ version }}
+    - aws-sdk-cpp>=1.11.267


⚠️ We will have to be very careful with this pinning. We must match conda-forge exactly because aws-sdk-cpp requires a nearly-exact match at runtime (version x.x.x) and we need compatibility with conda-forge. Note that we will need to maintain this pinning very actively and align with conda-forge for every RAPIDS release, or we risk breaking environment compatibility with conda-forge. This is a large part of why I am hesitant on adding this dependency -- it constrains us tightly, and we do not benefit from the automated migrations that are used for conda-forge recipes to "rebuild the world" when pinnings like aws-sdk-cpp are updated.

We always need to be matching this line: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/2e592a612d925988747cd6daa9e271328ceb3bfc/recipe/conda_build_config.yaml#L268

Suggested change

- aws-sdk-cpp>=1.11.267

- aws-sdk-cpp==1.11.379

Sorry for my ignorance, but why do we need an exact match? Naively, I would have thought that a lower-limit would be more flexible? Do all packages in conda-forge needs to maintain this pinning if they use aws-sdk-cpp?

@bdice, would it help if we remove the version requirement?

First I'll explain the migration logic. Then I'll answer the questions you posed about other options.

Migrations and why they're needed

The aws-sdk-cpp release cadence is very frequent. Some of those releases (there's a huge number of them) are picked up by conda-forge, but taking the new version requires "rebuilding the world" because aws-sdk-cpp is a core C++ dependency of many packages (see mamba repoquery whoneeds -c conda-forge aws-sdk-cpp). Updates happen via a conda-forge migration (tracked here) which rebuild everything that depends on aws-sdk-cpp.

See the number of recent AWS-related rebuilds of libarrow, for instance:
https://github.com/conda-forge/arrow-cpp-feedstock/commits/main/recipe/meta.yaml

I'll ignore the aws-crt-cpp and other rebuilds, and only focus on aws-sdk-cpp because that's the dependency we are introducing here.

In the past 12 months, there have been 6 rebuilds of libarrow for new versions of aws-sdk-cpp (1.11.379, 1.11.329, 1.11.267, 1.11.242, 1.11.210, 1.11.182).

Every time that conda-forge migrates to a new version, we must use the same version so that RAPIDS (KvikIO specifically) can be installed alongside the latest conda-forge packages.

Try this:

mamba create -n test --dry-run libarrow

Note that aws-sdk-cpp 1.11.379 is included in the environment because there has been a recent rebuild of libarrow.

Can we use a lower limit (or no version pinning)?

If we were to use a lower limit (or no version pinning), kvikio would use the latest aws-sdk-cpp when it is built. That would impose a runtime pinning on the latest aws-sdk-cpp. However, if a conda-forge migration has not happened yet (or is incomplete), those kvikio packages would be incompatible with the latest conda-forge packages of things like libarrow. We need to match the migrators so we don't get "latest aws-sdk-cpp" and instead align with "current version used by conda-forge's ecosystem-wide pinnings". Typically conda-forge maintainers will start a migration as soon as a new aws-sdk-cpp is released, but that's not guaranteed to happen immediately. Sometimes migrations are delayed or skipped, which would leave our packages unusable if we had picked up an aws-sdk-cpp version that conda-forge did not adopt with a migration.

So what do we do?

RAPIDS is not built with conda-forge infrastructure, but aims to be compatible with the conda-forge ecosystem. We do not benefit from conda-forge's automation for version migrations but we must track them in order to be compatible. (The conda-forge bots that "rebuild the world" do not rebuild RAPIDS recipes, but maybe we could create such a tool.)

To remain compatible with conda-forge, we need to match the exact pinnings for core packages like fmt, spdlog, aws-sdk-cpp, and others. We have a lot of pain from matching conda-forge migrations for fmt and spdlog (see Keith's warning and this problem in cuspatial that comes from us being behind on spdlog/fmt migrations)

Our best course of action for now is to:

pin aws-sdk-cpp to the exact version in this YAML file: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/f291cfdaf8c328337cb8cd1f63c63caceeda8991/recipe/conda_build_config.yaml#L267-L268 (that's a permalink but we should track the latest as it changes)

we cannot pin more loosely -- aws-sdk-cpp imposes run-exports that pin very tightly, which typically indicates that there is no guarantee of stable ABI across versions

manually update that version when conda-forge issues a migration

consider adding tooling to help automate migrations

I opened rapidsai/build-planning#100 to think more about conda-forge migration support in RAPIDS.

Wow, thanks for the explanation @bdice! I now see why you were hesitant from the get-go 😊

What if we move to use libcurl instead? They take API and ABI stability very serious: https://curl.se/libcurl/features.html#stableapi

I was actually thinking of using libcurl from the start since we only need the very basic S3 operations. I don’t think it will be too hard to implement and it will make Azure Blob Storage support straightforward.

That might be a better option than aws-sdk-cpp if we can use libcurl. It seems like libcurl is already a dependency of some packages we depend on in RAPIDS, so that gives me more confidence in it. See $ mamba repoquery whoneeds -c conda-forge libcurl 2>&1 | awk '{print $1;}' | sort | uniq.

Great, I will give libcurl a try

bdice · 2024-09-06T13:49:28Z

conda/recipes/kvikio/meta.yaml

  run:
    - python
    - numpy >=1.23,<3.0a0
    - cupy >=12.0.0
    - zarr
+    - aws-sdk-cpp>=1.11.267


Do not list a run dependency, this will be handled by run-exports: https://github.com/conda-forge/aws-sdk-cpp-feedstock/blob/f82d968670bdb9939ed7604a5fb7bb4885e2e6ba/recipe/meta.yaml#L14

Suggested change

- aws-sdk-cpp>=1.11.267

bdice · 2024-09-06T13:49:53Z

conda/recipes/libkvikio/meta.yaml

@@ -52,6 +52,7 @@ requirements:
    {% else %}
    - libcufile-dev  # [linux]
    {% endif %}
+    - aws-sdk-cpp>=1.11.267


Match conda-forge.

Suggested change

- aws-sdk-cpp>=1.11.267

- aws-sdk-cpp==1.11.379

bdice · 2024-09-06T13:50:02Z

conda/recipes/libkvikio/meta.yaml

@@ -83,6 +84,7 @@ outputs:
        {% else %}
        - libcufile-dev  # [linux]
        {% endif %}
+        - aws-sdk-cpp>=1.11.267


Remove this and add a host dependency on aws-sdk-cpp==1.11.379 here so that we can have compatible runtime pinnings inserted automatically by the run-exports.

python/kvikio/kvikio/remote_file.py

python/kvikio/tests/test_benchmarks.py

bdice · 2024-09-06T16:34:40Z

python/kvikio/tests/test_benchmarks.py

+
+    retcode = run_cmd(
+        cmd=[
+            sys.executable or "python",


I think sys.executable should always be defined? Are there cases where that isn't safe?

bdice · 2024-09-06T16:35:18Z

python/kvikio/tests/test_benchmarks.py

+        ],
+        cwd=benchmarks_path,
+    )
+    assert retcode == 0


Should this assert anything about the stdout/stderr outputs?

Co-authored-by: Bradley Dice <[email protected]>

GregoryKimball · 2024-09-23T19:49:11Z

@madsbk Would you please share the status of this work? Are you planning to include AWS C++ SDK changes in 24.12?

madsbk · 2024-09-24T06:44:39Z

@madsbk Would you please share the status of this work? Are you planning to include AWS C++ SDK changes in 24.12?

The plan is to use libcurl instead of aws-s3-sdk to support S3. First step implemented by #464.

Support read directly from a http server like: ```python import kvikio import cupy with kvikio.RemoteFile.from_http_url("http://127.0.0.1:9000/myfile") as f: ary = cupy.empty(f.nbytes, dtype="uint8") f.read(ary) ``` This PR is the first step to support S3 using libcurl instead of [aws-s3-sdk](#426), which has some pros and cons: * Pros * The [global conda pinning issue](#426 (comment)) is less of a problem. * We can support other protocols such as http, ftp, and Azure’s storage, without much work. * We avoid the [free-after-main issue in aws-s3-sdk](https://github.com/rapidsai/kvikio/blob/000126516db430988ab9af5ee1576ca3fe6afe27/cpp/include/kvikio/remote_handle.hpp#L87-L94). This is huge since we would otherwise have to pass around a `S3Context` in libcudf and cudf to handle shutdown correctly. This is not a problem in libcurl, see https://curl.se/libcurl/c/libcurl.html under `Global constants`. * Cons * Hard to support the AWS configuration file. We will require the user to either specify the options programmatically or through environment variables like `AWS_ACCESS_KEY_ID ` and `AWS_SECRET_ACCESS_KEY `. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Lawrence Mitchell (https://github.com/wence-) URL: #464

vyasr · 2024-10-22T18:10:26Z

Replaced by #479.

Implements AWS S3 read support using libcurl: ```python import kvikio import cupy with kvikio.RemoteFile.from_s3_url("s://my-bucket/my-file") as f: ary = cupy.empty(f.nbytes, dtype="uint8") f.read(ary) ``` Supersedes #426 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #479

madsbk force-pushed the s3_support branch from 9e4887d to 7be1ed4 Compare July 31, 2024 15:18

madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 31, 2024

madsbk mentioned this pull request Aug 6, 2024

AWS S3 IO through KvikIO rapidsai/cudf#16499

Merged

3 tasks

madsbk force-pushed the s3_support branch 2 times, most recently from 25d4fe9 to f76ff32 Compare August 12, 2024 08:43

madsbk added 2 commits August 13, 2024 08:32

benchmark: use cudf.set_option

4a5884c

madsbk force-pushed the s3_support branch from 4638aa8 to 4a5884c Compare August 13, 2024 06:50

madsbk added 9 commits August 13, 2024 09:37

revert some minor changes

804fd99

doc

bbbe637

doc

a752eee

doc

ce4218a

doc

5ba191d

--bundled-server-lifetime

5126b44

clean up

185e687

doc

502f7cf

cleanup

740a15d

madsbk marked this pull request as ready for review August 13, 2024 11:22

madsbk requested review from a team as code owners August 13, 2024 11:22

madsbk requested a review from jameslamb August 13, 2024 11:22

madsbk changed the title ~~[WIP] S3 support~~ S3 support Aug 13, 2024

doc

7fbabdf

madsbk commented Aug 13, 2024

View reviewed changes

cpp/include/kvikio/remote_handle.hpp Outdated Show resolved Hide resolved

madsbk commented Aug 13, 2024

View reviewed changes

doc

5b41e74

KyleFromNVIDIA added a commit to KyleFromNVIDIA/ci-imgs that referenced this pull request Aug 15, 2024

Build and install aws-sdk-cpp in wheel images

11bd577

Contributes to rapidsai/kvikio#426

KyleFromNVIDIA added 5 commits August 29, 2024 11:50

Re-run CI

4e4389f

Install libcurl4-openssl-dev for pip devcontainers

cebc0a8

Link against aws-cpp-sdk-core

61a5d50

Try excluding curl

d55030a

Style

2123666

mhaseeb123 reviewed Aug 30, 2024

View reviewed changes

cpp/include/kvikio/bounce_buffer.hpp Outdated Show resolved Hide resolved

KyleFromNVIDIA and others added 5 commits August 30, 2024 09:40

Merge branch 'branch-24.10' into s3_support

0c2f09b

Try AWS's HTTP client

bb445a3

No need to install libcurl

4701b52

libcudf_s3_io

2bd411e

Merge branch 'branch-24.10' of https://github.com/rapidsai/kvikio int…

b40a0ad

…o s3_support

madsbk added 6 commits September 4, 2024 08:28

Merge branch 'branch-24.10' into s3_support

7bcfe5b

Merge branch 'branch-24.10' of https://github.com/rapidsai/kvikio int…

2a9f934

…o s3_support

test: ignore deprecation warning

b978323

test: remove the kvikio._lib.remote_handle trigger

00304e7

Revert "test: ignore deprecation warning"

f9b49f2

This reverts commit b978323.

pytest: ignore deprecation warning in botocore

0403686

bdice requested changes Sep 6, 2024

View reviewed changes

Apply suggestions from code review

0001265

Co-authored-by: Bradley Dice <[email protected]>

bdice mentioned this pull request Sep 10, 2024

Track conda-forge migrations with automated tooling rapidsai/build-planning#100

Open

madsbk marked this pull request as draft September 24, 2024 06:30

madsbk mentioned this pull request Sep 24, 2024

Remote IO: http support #464

Merged

madsbk mentioned this pull request Oct 9, 2024

Remote IO: S3 support #479

Merged

madsbk closed this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 support #426

S3 support #426

madsbk commented Jul 31, 2024 •

edited

Loading

madsbk Aug 13, 2024

madsbk commented Sep 3, 2024

bdice left a comment

bdice Sep 6, 2024

bdice Sep 6, 2024

madsbk Sep 9, 2024 •

edited

Loading

madsbk Sep 10, 2024

bdice Sep 10, 2024 •

edited

Loading

bdice Sep 10, 2024

madsbk Sep 11, 2024

madsbk Sep 11, 2024

bdice Sep 11, 2024 •

edited

Loading

madsbk Sep 12, 2024 •

edited

Loading

bdice Sep 6, 2024

bdice Sep 6, 2024

bdice Sep 6, 2024

bdice Sep 6, 2024

bdice Sep 6, 2024

GregoryKimball commented Sep 23, 2024

madsbk commented Sep 24, 2024

vyasr commented Oct 22, 2024

		# TODO: remove before PR merge. Trigger CI error if the remote module wasn't built
		import kvikio._lib.remote_handle # isort: skip

S3 support #426

S3 support #426

Conversation

madsbk commented Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

madsbk commented Sep 3, 2024

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Migrations and why they're needed

Can we use a lower limit (or no version pinning)?

So what do we do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

madsbk Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GregoryKimball commented Sep 23, 2024

madsbk commented Sep 24, 2024

vyasr commented Oct 22, 2024

madsbk commented Jul 31, 2024 •

edited

Loading

madsbk Sep 9, 2024 •

edited

Loading

bdice Sep 10, 2024 •

edited

Loading

bdice Sep 11, 2024 •

edited

Loading

madsbk Sep 12, 2024 •

edited

Loading