Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot query file from S3 #559

Closed
skamalj opened this issue Dec 9, 2022 · 7 comments
Closed

Cannot query file from S3 #559

skamalj opened this issue Dec 9, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@skamalj
Copy link

skamalj commented Dec 9, 2022

Describe the bug
I am trying to query a csv file from S3. using ballista-cli. I get "No object store found" error

Ballista CLI v0.10.0

❯ CREATE EXTERNAL TABLE foo2 (a INT, b INT) STORED AS CSV LOCATION 's3://skamalj-s3/data.csv';
0 rows in set. Query took 0.000 seconds.
❯ SELECT * FROM foo2;
[2022-12-09T10:46:59Z ERROR ballista_scheduler::scheduler_server::query_stage_scheduler] Error planning job QnD94VQ: DataFusionError(Execution("No object store available for s3://skamalj-s3/data.csv"))
[2022-12-09T10:46:59Z ERROR ballista_scheduler::scheduler_server::query_stage_scheduler] Job QnD94VQ failed: Error planning job QnD94VQ: DataFusionError(Execution("No object store available for s3://skamalj-s3/data.csv"))
[2022-12-09T10:46:59Z ERROR ballista_core::execution_plans::distributed_query] Job QnD94VQ failed: Error planning job QnD94VQ: DataFusionError(Execution("No object store available for s3://skamalj-s3/data.csv"))
DataFusionError(ArrowError(ExternalError(Execution("Job QnD94VQ failed: Error planning job QnD94VQ: DataFusionError(Execution("No object store available for s3://skamalj-s3/data.csv"))"))))

To Reproduce
Steps are copied above. data.csv file is created using command $ echo "1,2" > data.csv

Expected behavior
Should return results similar to when queried from local csv.

Additional context
This is using cargo install for scheduler and executor. Version 0.10
aws credentials are set on the local machine and aws s3 ls command on same machine returns the listing.

@skamalj skamalj added the bug Something isn't working label Dec 9, 2022
@thinkharderdev
Copy link
Contributor

Hi @skamalj Are you building the cli locally? This should work if the s3 feature is enabled.

@skamalj
Copy link
Author

skamalj commented Dec 13, 2022

Hello @thinkharderdev no I did not build the CLI. I just installed it using cargo. How do I enable s3 feature?

@thinkharderdev
Copy link
Contributor

Hello @thinkharderdev no I did not build the CLI. I just installed it using cargo. How do I enable s3 feature?

I think you should be able to cargo install ballista-cli --features s3 to build with the s3 feature

@skamalj
Copy link
Author

skamalj commented Dec 14, 2022

Thanks @thinkharderdev . works with ballista-cli now. I have now built scheduler and eexecutor as well with flag ballista-core/s3 and coonected to instance with this ballista-cli. I am getting missing region error. I have tried to set AWS_REGION and AWS_DEFAULT_REGION for both scheduler and ballista-cli shells. but same error.

I have tested that this is finding the S3 location ok because create command fails if I give non-existent path.

(base) kamal@Kamal:~/.aws$ ballista-cli --host localhost --port 50050
Ballista CLI v0.10.0
❯ create external table test2 stored as csv location 's3://skamalj-s3/data.csv';
0 rows in set. Query took 0.539 seconds.
❯ select * from test2;
[2022-12-14T18:25:28Z ERROR ballista_core::execution_plans::distributed_query] Job KdKKEnv failed: Error planning job KdKKEnv: DataFusionError(ObjectStore(Generic { store: "S3", source: MissingRegion }))
DataFusionError(ArrowError(ExternalError(Execution("Job KdKKEnv failed: Error planning job KdKKEnv: DataFusionError(ObjectStore(Generic { store: "S3", source: MissingRegion }))"))))

@r4ntix
Copy link
Contributor

r4ntix commented Dec 15, 2022

Thanks @thinkharderdev . works with ballista-cli now. I have now built scheduler and eexecutor as well with flag ballista-core/s3 and coonected to instance with this ballista-cli. I am getting missing region error. I have tried to set AWS_REGION and AWS_DEFAULT_REGION for both scheduler and ballista-cli shells. but same error.

I have tested that this is finding the S3 location ok because create command fails if I give non-existent path.

(base) kamal@Kamal:~/.aws$ ballista-cli --host localhost --port 50050 Ballista CLI v0.10.0 ❯ create external table test2 stored as csv location 's3://skamalj-s3/data.csv'; 0 rows in set. Query took 0.539 seconds. ❯ select * from test2; [2022-12-14T18:25:28Z ERROR ballista_core::execution_plans::distributed_query] Job KdKKEnv failed: Error planning job KdKKEnv: DataFusionError(ObjectStore(Generic { store: "S3", source: MissingRegion })) DataFusionError(ArrowError(ExternalError(Execution("Job KdKKEnv failed: Error planning job KdKKEnv: DataFusionError(ObjectStore(Generic { store: "S3", source: MissingRegion }))")))) ❯

It is also necessary to register S3 related configuration in env when the ballista-executor starts:

> export AWS_ACCESS_KEY_ID=XXXX
> export AWS_SECRET_ACCESS_KEY=XXXX
> export AWS_DEFAULT_REGION=XXXX
> export AWS_ENDPOINT=https://xxxx

@thinkharderdev
Copy link
Contributor

Thanks @thinkharderdev . works with ballista-cli now. I have now built scheduler and eexecutor as well with flag ballista-core/s3 and coonected to instance with this ballista-cli. I am getting missing region error. I have tried to set AWS_REGION and AWS_DEFAULT_REGION for both scheduler and ballista-cli shells. but same error.
I have tested that this is finding the S3 location ok because create command fails if I give non-existent path.
(base) kamal@Kamal:~/.aws$ ballista-cli --host localhost --port 50050 Ballista CLI v0.10.0 ❯ create external table test2 stored as csv location 's3://skamalj-s3/data.csv'; 0 rows in set. Query took 0.539 seconds. ❯ select * from test2; [2022-12-14T18:25:28Z ERROR ballista_core::execution_plans::distributed_query] Job KdKKEnv failed: Error planning job KdKKEnv: DataFusionError(ObjectStore(Generic { store: "S3", source: MissingRegion })) DataFusionError(ArrowError(ExternalError(Execution("Job KdKKEnv failed: Error planning job KdKKEnv: DataFusionError(ObjectStore(Generic { store: "S3", source: MissingRegion }))")))) ❯

It is also necessary to register S3 related configuration in env when the ballista-executor starts:

> export AWS_ACCESS_KEY_ID=XXXX
> export AWS_SECRET_ACCESS_KEY=XXXX
> export AWS_DEFAULT_REGION=XXXX
> export AWS_ENDPOINT=https://xxxx

Yeah, both the scheduler and executor would need credentials for the S3 API.

@skamalj
Copy link
Author

skamalj commented Dec 15, 2022

Thanks both @thinkharderdev and @r4ntix . It works when credentials are set for both..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants