Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect and report task failure related statistics in QueryCompletedEvent #10734

Closed
Tracked by #9101
losipiuk opened this issue Jan 21, 2022 · 1 comment · Fixed by #11327
Closed
Tracked by #9101

Collect and report task failure related statistics in QueryCompletedEvent #10734

losipiuk opened this issue Jan 21, 2022 · 1 comment · Fixed by #11327
Assignees

Comments

@losipiuk
Copy link
Member

losipiuk commented Jan 21, 2022

High level goal

The goal for this issue is to have more visibility on the nature of query execution with regard to failure recovery. Specifically, we want to be able to understand:

  • how many tasks failed through query
  • how much extra work (CPU, network) was performed due to task failures
  • Possibly how much more wall-time did query take due to failures (may be hard)
  • With what memory requirements were the query tasks run (change TaskInfo?)

We want this extra information to be available to end-users via SPI's QueryCompletedEvent and via UI (tracked separately)

SPI

At the SPI level, we should extend QueryCompletedEvent in to contain extra fields.
Preferably we should not break the backward compatibility of SPI too much. So we should rather just add new fields (not rename old ones or restructure whole classes).

Initially let's focus on top-level statistic fields:

The fields which we already have should correspond to all tasks which were executed during the query, including failed ones (this is actually the semantics we have right now).
We should add new fields which will filter out the statistics for failed tasks (just count the tasks which completed successfully).

  • for cpuTime -> cpuTimeNoFailed
  • for peakUserMemoryBytes -> peakUserMemoryBytesNoFailed
  • for outputBytes -> outputBytesNoFailed

QQ: is the naming convention ok

QQ: maybe more natural would be to have stats which count all tasks, and stats which count just failed tasks?

QueryStats

SPI's QueryCompletedEvent is constructed on top of internal QueryStats class. QueryStats is also directly consumed by web UI (via QueryInfo).
When it comes to changes to QueryStats/QueryInfo we are not constrained with backward compatibility that much, so we can restructure it a bit if needed.
The simplest approach could be to not modify QueryStats at all and just have two fields in QueryInfo:

  • QueryInfo.queryStats
  • QueryInfo.queryStatsNoFailedTasks
    For some of the fields within QueryStats the failed/non-failed distinction is not relevant (e.g. createTime or lastHeartbeat) . We can just keep the same value for those fields in both instances.

The QueryInfo object is constructed from collection of StageInfo objects. StageInfo will also need to contain separate statistics build from all the tasks for the given stage, and just non-failed ones.
The code structure in QueryInfo and StageInfo should be very similar.

UI (followup)

On top of this work we should extend UI (tracked by #10754).
Details to be defined but at very least we should be able to extend the "Resource Utilization Summary" so both values (including and excluding failed tasks) are reported
image

@losipiuk losipiuk mentioned this issue Jan 21, 2022
31 tasks
@losipiuk losipiuk changed the title Collect and report failures and additional statistics in QueryCompletedEvent Collect and report task failure related statistics in QueryCompletedEvent Jan 21, 2022
@losipiuk losipiuk self-assigned this Jan 21, 2022
@losipiuk
Copy link
Member Author

@arhimondr, @findepi I put high level attack plan for this issue in the description. Let's discuss if it does make sense, and I will make necessary corrections.

cc: @martint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant