You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal for this issue is to have more visibility on the nature of query execution with regard to failure recovery. Specifically, we want to be able to understand:
how many tasks failed through query
how much extra work (CPU, network) was performed due to task failures
Possibly how much more wall-time did query take due to failures (may be hard)
With what memory requirements were the query tasks run (change TaskInfo?)
We want this extra information to be available to end-users via SPI's QueryCompletedEvent and via UI (tracked separately)
SPI
At the SPI level, we should extend QueryCompletedEvent in to contain extra fields.
Preferably we should not break the backward compatibility of SPI too much. So we should rather just add new fields (not rename old ones or restructure whole classes).
Initially let's focus on top-level statistic fields:
The fields which we already have should correspond to all tasks which were executed during the query, including failed ones (this is actually the semantics we have right now).
We should add new fields which will filter out the statistics for failed tasks (just count the tasks which completed successfully).
for cpuTime -> cpuTimeNoFailed
for peakUserMemoryBytes -> peakUserMemoryBytesNoFailed
for outputBytes -> outputBytesNoFailed
QQ: is the naming convention ok
QQ: maybe more natural would be to have stats which count all tasks, and stats which count just failed tasks?
QueryStats
SPI's QueryCompletedEvent is constructed on top of internal QueryStats class. QueryStats is also directly consumed by web UI (via QueryInfo).
When it comes to changes to QueryStats/QueryInfo we are not constrained with backward compatibility that much, so we can restructure it a bit if needed.
The simplest approach could be to not modify QueryStats at all and just have two fields in QueryInfo:
QueryInfo.queryStats
QueryInfo.queryStatsNoFailedTasks
For some of the fields within QueryStats the failed/non-failed distinction is not relevant (e.g. createTime or lastHeartbeat) . We can just keep the same value for those fields in both instances.
The QueryInfo object is constructed from collection of StageInfo objects. StageInfo will also need to contain separate statistics build from all the tasks for the given stage, and just non-failed ones.
The code structure in QueryInfo and StageInfo should be very similar.
UI (followup)
On top of this work we should extend UI (tracked by #10754).
Details to be defined but at very least we should be able to extend the "Resource Utilization Summary" so both values (including and excluding failed tasks) are reported
The text was updated successfully, but these errors were encountered:
losipiuk
changed the title
Collect and report failures and additional statistics in QueryCompletedEvent
Collect and report task failure related statistics in QueryCompletedEventJan 21, 2022
@arhimondr, @findepi I put high level attack plan for this issue in the description. Let's discuss if it does make sense, and I will make necessary corrections.
High level goal
The goal for this issue is to have more visibility on the nature of query execution with regard to failure recovery. Specifically, we want to be able to understand:
TaskInfo
?)We want this extra information to be available to end-users via SPI's
QueryCompletedEvent
and via UI (tracked separately)SPI
At the SPI level, we should extend
QueryCompletedEvent
in to contain extra fields.Preferably we should not break the backward compatibility of SPI too much. So we should rather just add new fields (not rename old ones or restructure whole classes).
Initially let's focus on top-level statistic fields:
trino/core/trino-spi/src/main/java/io/trino/spi/eventlistener/QueryStatistics.java
Lines 39 to 53 in 5b90642
trino/core/trino-spi/src/main/java/io/trino/spi/eventlistener/QueryStatistics.java
Line 30 in 5b90642
The fields which we already have should correspond to all tasks which were executed during the query, including failed ones (this is actually the semantics we have right now).
We should add new fields which will filter out the statistics for failed tasks (just count the tasks which completed successfully).
cpuTime
->cpuTimeNoFailed
peakUserMemoryBytes
->peakUserMemoryBytesNoFailed
outputBytes
->outputBytesNoFailed
QQ: is the naming convention ok
QQ: maybe more natural would be to have stats which count all tasks, and stats which count just failed tasks?
QueryStats
SPI's
QueryCompletedEvent
is constructed on top of internalQueryStats
class.QueryStats
is also directly consumed by web UI (viaQueryInfo
).When it comes to changes to
QueryStats
/QueryInfo
we are not constrained with backward compatibility that much, so we can restructure it a bit if needed.The simplest approach could be to not modify
QueryStats
at all and just have two fields inQueryInfo
:QueryInfo.queryStats
QueryInfo.queryStatsNoFailedTasks
For some of the fields within
QueryStats
the failed/non-failed distinction is not relevant (e.g.createTime
orlastHeartbeat
) . We can just keep the same value for those fields in both instances.The
QueryInfo
object is constructed from collection ofStageInfo
objects.StageInfo
will also need to contain separate statistics build from all the tasks for the given stage, and just non-failed ones.The code structure in
QueryInfo
andStageInfo
should be very similar.UI (followup)
On top of this work we should extend UI (tracked by #10754).
Details to be defined but at very least we should be able to extend the "Resource Utilization Summary" so both values (including and excluding failed tasks) are reported
The text was updated successfully, but these errors were encountered: