Collect and report task failure related statistics in `QueryCompletedEvent` #10734

losipiuk · 2022-01-21T15:44:45Z

High level goal

The goal for this issue is to have more visibility on the nature of query execution with regard to failure recovery. Specifically, we want to be able to understand:

how many tasks failed through query
how much extra work (CPU, network) was performed due to task failures
Possibly how much more wall-time did query take due to failures (may be hard)
With what memory requirements were the query tasks run (change TaskInfo?)

We want this extra information to be available to end-users via SPI's QueryCompletedEvent and via UI (tracked separately)

SPI

At the SPI level, we should extend QueryCompletedEvent in to contain extra fields.
Preferably we should not break the backward compatibility of SPI too much. So we should rather just add new fields (not rename old ones or restructure whole classes).

Initially let's focus on top-level statistic fields:

trino/core/trino-spi/src/main/java/io/trino/spi/eventlistener/QueryStatistics.java

Lines 39 to 53 in 5b90642

    
           private final long peakUserMemoryBytes; 
        
           // peak of user + system memory 
        
           private final long peakTotalNonRevocableMemoryBytes; 
        
           private final long peakTaskUserMemory; 
        
           private final long peakTaskTotalMemory; 
        
           private final long physicalInputBytes; 
        
           private final long physicalInputRows; 
        
           private final long internalNetworkBytes; 
        
           private final long internalNetworkRows; 
        
           private final long totalBytes; 
        
           private final long totalRows; 
        
           private final long outputBytes; 
        
           private final long outputRows; 
        
           private final long writtenBytes; 
        
           private final long writtenRows;

trino/core/trino-spi/src/main/java/io/trino/spi/eventlistener/QueryStatistics.java

Line 30 in 5b90642

private final Duration cpuTime;

The fields which we already have should correspond to all tasks which were executed during the query, including failed ones (this is actually the semantics we have right now).
We should add new fields which will filter out the statistics for failed tasks (just count the tasks which completed successfully).

for cpuTime -> cpuTimeNoFailed
for peakUserMemoryBytes -> peakUserMemoryBytesNoFailed
for outputBytes -> outputBytesNoFailed

QQ: is the naming convention ok

QQ: maybe more natural would be to have stats which count all tasks, and stats which count just failed tasks?

QueryStats

SPI's QueryCompletedEvent is constructed on top of internal QueryStats class. QueryStats is also directly consumed by web UI (via QueryInfo).
When it comes to changes to QueryStats/QueryInfo we are not constrained with backward compatibility that much, so we can restructure it a bit if needed.
The simplest approach could be to not modify QueryStats at all and just have two fields in QueryInfo:

QueryInfo.queryStats
QueryInfo.queryStatsNoFailedTasks
For some of the fields within QueryStats the failed/non-failed distinction is not relevant (e.g. createTime or lastHeartbeat) . We can just keep the same value for those fields in both instances.

The QueryInfo object is constructed from collection of StageInfo objects. StageInfo will also need to contain separate statistics build from all the tasks for the given stage, and just non-failed ones.
The code structure in QueryInfo and StageInfo should be very similar.

UI (followup)

On top of this work we should extend UI (tracked by #10754).
Details to be defined but at very least we should be able to extend the "Resource Utilization Summary" so both values (including and excluding failed tasks) are reported

The text was updated successfully, but these errors were encountered:

losipiuk · 2022-01-24T09:35:54Z

@arhimondr, @findepi I put high level attack plan for this issue in the description. Let's discuss if it does make sense, and I will make necessary corrections.

cc: @martint

losipiuk mentioned this issue Jan 21, 2022

Support Failure Recovery #9101

Closed

31 tasks

losipiuk changed the title ~~Collect and report failures and additional statistics in QueryCompletedEvent~~ Collect and report task failure related statistics in QueryCompletedEvent Jan 21, 2022

losipiuk self-assigned this Jan 21, 2022

losipiuk mentioned this issue Jan 24, 2022

Allow assigning exclusive node for task execution #10432

Merged

losipiuk assigned losipiuk and unassigned losipiuk Jan 25, 2022

losipiuk mentioned this issue Mar 4, 2022

Collect separate statistics for failed tasks #11327

Merged

losipiuk closed this as completed in #11327 Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect and report task failure related statistics in `QueryCompletedEvent` #10734

Collect and report task failure related statistics in `QueryCompletedEvent` #10734

losipiuk commented Jan 21, 2022 •

edited

Loading

losipiuk commented Jan 24, 2022

Collect and report task failure related statistics in QueryCompletedEvent #10734

Collect and report task failure related statistics in QueryCompletedEvent #10734

Comments

losipiuk commented Jan 21, 2022 • edited Loading

High level goal

SPI

QueryStats

UI (followup)

losipiuk commented Jan 24, 2022

Collect and report task failure related statistics in `QueryCompletedEvent` #10734

Collect and report task failure related statistics in `QueryCompletedEvent` #10734

losipiuk commented Jan 21, 2022 •

edited

Loading