Skip to content

Commit

Permalink
Add documentation for querying S3 data with CLI (#3631)
Browse files Browse the repository at this point in the history
* Add documentation for querying S3 data with CLI

* add s3 example

* update test

* fix example, use AWS_REGION

* prettier

* toml fmt
  • Loading branch information
andygrove authored Sep 28, 2022
1 parent b4c0601 commit 06a4f79
Show file tree
Hide file tree
Showing 5 changed files with 141 additions and 2 deletions.
35 changes: 35 additions & 0 deletions datafusion-cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,41 @@ DataFusion CLI v12.0.0
1 row in set. Query took 0.017 seconds.
```

## Querying S3 Data Sources

The CLI can query data in S3 if the following environment variables are defined:

- `AWS_REGION`
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`

Note that the region must be set to the region where the bucket exists until the following issue is resolved:

- https://github.com/apache/arrow-rs/issues/2795

Example:

```bash
$ aws s3 cp test.csv s3://my-bucket/
upload: ./test.csv to s3://my-bucket/test.csv

$ export AWS_REGION=us-east-1
$ export AWS_SECRET_ACCESS_KEY=***************************
$ export AWS_ACCESS_KEY_ID=**************

$ ./target/release/datafusion-cli
DataFusion CLI v12.0.0
❯ create external table test stored as csv location 's3://my-bucket/test.csv';
0 rows in set. Query took 0.374 seconds.
select * from test;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1 | 2 |
+----------+----------+
1 row in set. Query took 0.171 seconds.
```
## DataFusion-Cli
Build the `datafusion-cli` by `cd` into the sub-directory:
Expand Down
4 changes: 2 additions & 2 deletions datafusion-cli/src/object_storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,8 @@ mod tests {
.unwrap_err();
assert!(err.to_string().contains("Generic S3 error: Missing region"));

env::set_var("AWS_DEFAULT_REGION", "us-east-1");
env::set_var("AWS_REGION", "us-east-1");
assert!(provider.get_by_url(&Url::from_str(s3).unwrap()).is_ok());
env::remove_var("AWS_DEFAULT_REGION");
env::remove_var("AWS_REGION");
}
}
1 change: 1 addition & 0 deletions datafusion-examples/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ async-trait = "0.1.41"
datafusion = { path = "../datafusion/core" }
futures = "0.3"
num_cpus = "1.13.0"
object_store = { version = "0.5.0", features = ["aws"] }
prost = "0.11.0"
serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.82"
Expand Down
68 changes: 68 additions & 0 deletions datafusion-examples/examples/query-aws-s3.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

use datafusion::error::Result;
use datafusion::prelude::*;
use object_store::aws::AmazonS3Builder;
use std::env;
use std::sync::Arc;

/// This example demonstrates querying data in an S3 bucket.
///
/// The following environment variables must be defined:
///
/// - AWS_ACCESS_KEY_ID
/// - AWS_SECRET_ACCESS_KEY
///
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();

// the region must be set to the region where the bucket exists until the following
// issue is resolved
// https://github.com/apache/arrow-rs/issues/2795
let region = "us-east-1";
let bucket_name = "nyc-tlc";

let s3 = AmazonS3Builder::new()
.with_bucket_name(bucket_name)
.with_region(region)
.with_access_key_id(env::var("AWS_ACCESS_KEY_ID").unwrap())
.with_secret_access_key(env::var("AWS_SECRET_ACCESS_KEY").unwrap())
.build()?;

ctx.runtime_env()
.register_object_store("s3", bucket_name, Arc::new(s3));

// cannot query the parquet files from this bucket because the path contains a whitespace
// and we don't support that yet
// https://github.com/apache/arrow-rs/issues/2799
let path = format!(
"s3://{}/csv_backup/yellow_tripdata_2022-02.csv",
bucket_name
);
ctx.register_csv("trips", &path, CsvReadOptions::default())
.await?;

// execute the query
let df = ctx.sql("SELECT * FROM trips LIMIT 10").await?;

// print the results
df.show().await?;

Ok(())
}
35 changes: 35 additions & 0 deletions docs/source/user-guide/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,41 @@ STORED AS CSV
LOCATION '/path/to/aggregate_test_100.csv';
```
## Querying S3 Data Sources
The CLI can query data in S3 if the following environment variables are defined:
- `AWS_REGION`
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
Note that the region must be set to the region where the bucket exists until the following issue is resolved:
- https://github.com/apache/arrow-rs/issues/2795
Example:
```bash
$ aws s3 cp test.csv s3://my-bucket/
upload: ./test.csv to s3://my-bucket/test.csv
$ export AWS_REGION=us-east-2
$ export AWS_SECRET_ACCESS_KEY=***************************
$ export AWS_ACCESS_KEY_ID=**************
$ ./target/release/datafusion-cli
DataFusion CLI v12.0.0
❯ create external table test stored as csv location 's3://my-bucket/test.csv';
0 rows in set. Query took 0.374 seconds.
select * from test;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1 | 2 |
+----------+----------+
1 row in set. Query took 0.171 seconds.
```
## Commands
Available commands inside DataFusion CLI are:
Expand Down

0 comments on commit 06a4f79

Please sign in to comment.