Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong date on one node creates an error expired event which can't be deleted #1508

Open
shessane opened this issue Aug 20, 2024 · 1 comment
Labels
type-code-defect Something isn't working

Comments

@shessane
Copy link

Describe the bug
We had an incident with one node on our Service Fabric cluster. The system date of the server was changed to the future. We fixed this issue, but Service Fabric still has an OK event in the future that cause an error on the partition that run the system service fabric:/System/FailoverManagerService.

On the 14/08/2024 one of the node had the date changed to 29/09/2024 for about 5 hours before fixing the date. The node is a normal node (not seed).

Error :

PS D:\> Get-ServiceFabricPartitionHealth

cmdlet Get-ServiceFabricPartitionHealth at command pipeline position 1
Supply values for the following parameters:


PartitionId           : 00000000-0000-0000-0000-000000000001
AggregatedHealthState : Error
UnhealthyEvaluations  :
                        The OK reported by 'System.FMM' for property 'State' is expired. The report was applied at 2024-08-14 01:00:27.218 with TTL 15:00.000.
                        Partition is healthy.

ReplicaHealthStates   :
                        ReplicaId             : 132601204022355426
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204008971912
                        AggregatedHealthState : Ok

                        ReplicaId             : 132601204022355427
                        AggregatedHealthState : Ok

HealthEvents          :
                        SourceId              : System.FMM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 133720681240404006
                        SentAt                : 29/09/2024 07:22:04
                        ReceivedAt            : 14/08/2024 01:00:27
                        TTL                   : 00:15:00
                        RemoveWhenExpired     : False
                        IsExpired             : True
                        HealthReportID        : FMM_7.0_1009
                        Transitions           : Warning->Ok = 12/07/2024 22:50:20, LastError = 01/01/0001 00:00:00

HealthStatistics      :
                        Replica               : 3 Ok, 0 Warning, 0 Error

Area/Component:
Partition that run the system service fabric:/System/FailoverManagerService

To Reproduce
Steps to reproduce the behavior:

  1. Update the date of one node fare to the future.
  2. The cluster should receive events from this node with the date on the future.
  3. Fix the date on this node.
  4. The partition will have the error : The OK reported by 'System.FMM' for property 'State' is expired...

Expected behavior
Fixing the date on the node should generate new events that fixes the error event.

Observed behavior:
The cluster status is error. This block Service Fabric package updates.

Screenshots
image

Service Fabric Runtime Version:
9.1.1390.9590

Environment:

  • Standalone
  • OS: Windows Server 2016
  • Version 9.1.1390.9590

If this is a regression, which version did it regress from?

Additional context
We tried to restart VMs.
We tried also to send a partition health report Send-ServiceFabricPartitionHealthReport -PartitionId 00000000-0000-0000-0000-000000000001 -SourceId "System.FMM" -HealthProperty "State" -HealthState Ok -TimeToLiveSec 30 -RemoveWhenExpired
Also a repair : Repair-ServiceFabricPartition -PartitionId 00000000-0000-0000-0000-000000000001

There's no way we can reset the event status


Assignees: /cc @microsoft/service-fabric-triage

@shessane shessane added the type-code-defect Something isn't working label Aug 20, 2024
@dribblor
Copy link

dribblor commented Oct 1, 2024

Same here. It is present on our 10.0.1949.9590 single node cluster and is still there after updating it to 10.1.2338.9590.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-code-defect Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants