[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

shessane · 2024-08-20T11:30:47Z

Describe the bug
We had an incident with one node on our Service Fabric cluster. The system date of the server was changed to the future. We fixed this issue, but Service Fabric still has an OK event in the future that cause an error on the partition that run the system service fabric:/System/FailoverManagerService.

On the 14/08/2024 one of the node had the date changed to 29/09/2024 for about 5 hours before fixing the date. The node is a normal node (not seed).

Error :

PS D:\> Get-ServiceFabricPartitionHealth

cmdlet Get-ServiceFabricPartitionHealth at command pipeline position 1
Supply values for the following parameters:


PartitionId           : 00000000-0000-0000-0000-000000000001
AggregatedHealthState : Error
UnhealthyEvaluations  :
                        The OK reported by 'System.FMM' for property 'State' is expired. The report was applied at 2024-08-14 01:00:27.218 with TTL 15:00.000.
                        Partition is healthy.

ReplicaHealthStates   :
                        ReplicaId             : 132601204022355426
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204008971912
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204022355427
                        AggregatedHealthState : Ok

HealthEvents          :
                        SourceId              : System.FMM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 133720681240404006
                        SentAt                : 29/09/2024 07:22:04
                        ReceivedAt            : 14/08/2024 01:00:27
                        TTL                   : 00:15:00
                        RemoveWhenExpired     : False
                        IsExpired             : True
                        HealthReportID        : FMM_7.0_1009
                        Transitions           : Warning->Ok = 12/07/2024 22:50:20, LastError = 01/01/0001 00:00:00

HealthStatistics      :
                        Replica               : 3 Ok, 0 Warning, 0 Error

Area/Component:
Partition that run the system service fabric:/System/FailoverManagerService

To Reproduce
Steps to reproduce the behavior:

Update the date of one node fare to the future.
The cluster should receive events from this node with the date on the future.
Fix the date on this node.
The partition will have the error : The OK reported by 'System.FMM' for property 'State' is expired...

Expected behavior
Fixing the date on the node should generate new events that fixes the error event.

Observed behavior:
The cluster status is error. This block Service Fabric package updates.

Screenshots

Service Fabric Runtime Version:
9.1.1390.9590

Environment:

Standalone
OS: Windows Server 2016
Version 9.1.1390.9590

If this is a regression, which version did it regress from?

Additional context
We tried to restart VMs.
We tried also to send a partition health report Send-ServiceFabricPartitionHealthReport -PartitionId 00000000-0000-0000-0000-000000000001 -SourceId "System.FMM" -HealthProperty "State" -HealthState Ok -TimeToLiveSec 30 -RemoveWhenExpired
Also a repair : Repair-ServiceFabricPartition -PartitionId 00000000-0000-0000-0000-000000000001

There's no way we can reset the event status

Assignees: /cc @microsoft/service-fabric-triage

The text was updated successfully, but these errors were encountered:

dribblor · 2024-10-01T13:35:07Z

Same here. It is present on our 10.0.1949.9590 single node cluster and is still there after updating it to 10.1.2338.9590.

shessane added the type-code-defect Something isn't working label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

shessane commented Aug 20, 2024

dribblor commented Oct 1, 2024

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

Comments

shessane commented Aug 20, 2024

dribblor commented Oct 1, 2024