Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler now verifies that file:// ListingTable URLs are accessible #414

Merged
merged 7 commits into from
Oct 23, 2022

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Oct 20, 2022

Which issue does this PR close?

Closes #353

Rationale for this change

My query now fails rather than silently producing the wrong results (empty result set).

Error planning job JfrneIG: 
General(\"logical plan refers to path that is not accessible in scheduler file system: /mnt/bigdata/tpcds/sf100-parquet/store_returns.parquet/: IoError(Os { code: 2, kind: NotFound, message: \\\"No such file or directory\\\" })\")"))))

What changes are included in this PR?

  • Scheduler verifies files exist
  • Logging changes

Are there any user-facing changes?

@andygrove andygrove requested a review from yahoNanJing October 20, 2022 15:50
@andygrove
Copy link
Member Author

@avantgardnerio PTAL

@andygrove andygrove marked this pull request as ready for review October 20, 2022 15:51
@@ -256,11 +260,46 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> SchedulerState<T,
plan: &LogicalPlan,
) -> Result<()> {
let start = Instant::now();

// optimizing the plan here is redundant because the physical planner will do this again
// but it is helpful to see what the optimized plan will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. How about changing this to the following:

if log::max_level() >= log::Level::Debug { let optimized_plan = session_ctx.optimize(plan)?; debug!("Optimized plan: {}", optimized_plan.display_indent()); }
so that we can avoid unnecessary optimization when the log level in set to be info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Maybe we just need to check the first path.

for url in table.table_paths() {
// remove file:/// prefix and verify that the file is accessible
let url = url.as_str();
let url = url.strip_prefix("file:///").unwrap_or(url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it check files on S3 or HDFS? Sometimes, the table_paths may be of tens of thousands. Here, the check logic may be very time consuming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change so that this check is only performed for files on the local file system (starting with file:///) and it now only checks the first file

@andygrove
Copy link
Member Author

Thanks for the review @yahoNanJing. I have pushed changes to address your feedback.

@andygrove andygrove changed the title scheduler now verifies that ListingTable URLs are accessible scheduler now verifies that file:// ListingTable URLs are accessible Oct 22, 2022
@@ -229,6 +229,7 @@ pub fn remove_unresolved_shuffles(
.iter()
.map(|c| c
.iter()
.filter(|l| !l.path.is_empty())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change but this avoids printing out lots of newlines in the log

@@ -71,7 +71,7 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> SchedulerGrpc
task_status,
} = request.into_inner()
{
debug!("Received poll_work request for {:?}", metadata);
trace!("Received poll_work request for {:?}", metadata);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was too verbose to be debug

@andygrove
Copy link
Member Author

I am going to go ahead and merge this so that I can cut the RC today.

@andygrove andygrove merged commit 0a518ae into apache:master Oct 23, 2022
@andygrove andygrove deleted the validate-listing-table-urls branch October 23, 2022 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler silently replaces ParquetExec with EmptyExec if data path is not correctly mounted in container
2 participants