Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(nucleus): shadow manager stops syncing on upgrade from 2.9.6 to 2.11.1 #1510

Closed
crispinboylan-int opened this issue Aug 3, 2023 · 12 comments
Labels
bug Something isn't working needs-triage Needs eyeballs

Comments

@crispinboylan-int
Copy link

Describe the bug
When upgrading from nucleus 2.9.6 to 2.11.1 every device i've tried so far stopped syncing shadows with the cloud. A manual restart of greengrass is required to bring back the sync.

To Reproduce
Upgrade device from 2.9.6 to 2.11.1 (also requires updating shadow manager from 2.3.1 to 2.3.3)

Expected behavior

Syncing resumes.

Actual behavior
No syncing, or errors.

Environment

  • OS: Ubuntu 22.04
  • JDK version: OpenJDK Runtime Environment (build 17.0.8+7-Ubuntu-122.04)
  • Nucleus version: 2.11.1

Additional context

You can see in the logs that when the device is finished updating the connection gets interrupted, the syncing stops and never resumes.

The sync operations only involve the local shadow version and never mention 'cloud-version' subsequently.

Relevant log lines:

2023-08-03T10:11:38.797Z [INFO] (Thread-4) com.aws.greengrass.deployment.IotJobsHelper: Job status update was accepted. {Status=SUCCEEDED, ThingName=home-XXXXXXXXXXXX, JobId=60c75523-5e00-4c1c-ac1f-1c004bc17ba3}
2023-08-03T10:11:38.873Z [WARN] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Connection interrupted. {clientId=home-XXXXXXXXXXXX#2, error=Mqtt5 client connection interrupted by user request.}
2023-08-03T10:11:38.874Z [INFO] (pool-2-thread-29) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=THING_GROUP_DEPLOYMENT, serviceName=FleetStatusService, currentState=RUNNING}
2023-08-03T10:11:38.875Z [INFO] (Thread-8) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Stop syncing. {}
2023-08-03T10:11:38.876Z [INFO] (Thread-8) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Cancel 1 sync thread(s). {}
2023-08-03T10:11:38.876Z [INFO] (Thread-8) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Waiting for in-flight sync requests to finish. {}
2023-08-03T10:11:38.876Z [INFO] (pool-2-thread-29) com.aws.greengrass.deployment.DeploymentDirectoryManager: Persist link to last deployment. {link=/greengrass/v2/deployments/previous-success}
2023-08-03T10:11:38.877Z [INFO] (pool-2-thread-29) com.aws.greengrass.deployment.DeploymentDirectoryManager: Clean up link to earlier deployment. {link=/greengrass/v2/deployments/previous-success}
2023-08-03T10:11:38.998Z [INFO] (Thread-7) com.aws.greengrass.shadowmanager.ipc.UpdateThingShadowRequestHandler: Successfully updated shadow. {service-name=co.mgh.ShadowPortal, thing name=home-XXXXXXXXXXXX, shadow name=Lighting, local-version=62}
2023-08-03T10:11:39.017Z [INFO] (Thread-7) com.aws.greengrass.shadowmanager.ipc.UpdateThingShadowRequestHandler: Successfully updated shadow. {service-name=co.mgh.ShadowPortal, thing name=home-XXXXXXXXXXXX, shadow name=Shading, local-version=94}
2023-08-03T10:11:39.015Z [INFO] (pool-2-thread-43) com.aws.greengrass.shadowmanager.ShadowManagerDAOImpl: Updating sync info. {thing name=home-XXXXXXXXXXXX, shadow name=Lighting, cloud-version=1577, local-version=61}
2023-08-03T10:11:39.032Z [INFO] (Thread-8) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Finished waiting for sync requests to finish. {}
2023-08-03T10:11:39.039Z [WARN] (pool-2-thread-43) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Interrupted while waiting for sync requests. {}
2023-08-03T10:11:39.041Z [INFO] (pool-2-thread-43) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Finished processing sync requests. {}
2023-08-03T10:11:39.070Z [INFO] (Thread-7) com.aws.greengrass.shadowmanager.ipc.UpdateThingShadowRequestHandler: Successfully updated shadow. {service-name=co.mgh.ShadowPortal, thing name=home-XXXXXXXXXXXX, shadow name=Energy, local-version=69551}
2023-08-03T10:11:39.083Z [INFO] (Thread-7) com.aws.greengrass.shadowmanager.ipc.UpdateThingShadowRequestHandler: Successfully updated shadow. {service-name=co.mgh.ShadowPortal, thing name=home-XXXXXXXXXXXX, shadow name=Shading, local-version=95}
2023-08-03T10:11:39.097Z [INFO] (Thread-7) com.aws.greengrass.shadowmanager.ipc.UpdateThingShadowRequestHandler: Successfully updated shadow. {service-name=co.mgh.ShadowPortal, thing name=home-XXXXXXXXXXXX, shadow name=Climate, local-version=293258}

@crispinboylan-int crispinboylan-int added bug Something isn't working needs-triage Needs eyeballs labels Aug 3, 2023
@crispinboylan-int
Copy link
Author

this is similar to the intermittent issue I saw on greengrass losing internet connectivity:

aws-greengrass/aws-greengrass-shadow-manager#164

@crispinboylan-int
Copy link
Author

with debug enabled in greengrass:

2023-08-03T09:09:27.761Z [DEBUG] (pool-2-thread-47) com.aws.greengrass.mqttclient.MqttClient: Getting new MQTT connection. {clientId=home-XXXXXXXXXXXX#2}
2023-08-03T09:09:27.761Z [DEBUG] (pool-2-thread-34) software.amazon.awssdk.http.apache.internal.conn.SdkTlsSocketFactory: socket.getSupportedProtocols(): [TLSv1.3, TLSv1.2, TLSv1.1, TLSv1, SSLv3, SSLv2Hello], socket.getEnabledProtocols(): [TLSv1.3, TLSv1.2].
{}
2023-08-03T09:09:27.765Z [DEBUG] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Attempting to connect to AWS IoT Core. {clientId=home-XXXXXXXXXXXX#2}
2023-08-03T09:09:27.775Z [DEBUG] (pool-2-thread-34) software.amazon.awssdk.http.apache.internal.net.SdkSslSocket: created: logs.eu-west-2.amazonaws.com/18.133.45.16:443.
{}
2023-08-03T09:09:27.822Z [DEBUG] (pool-2-thread-34) software.amazon.awssdk.requestId: Received successful response: 200, Request ID: 4cc704f3-925c-4a15-bc80-f326e6b420f1, Extended Request ID: not available.
{}
2023-08-03T09:09:27.822Z [DEBUG] (pool-2-thread-34) software.amazon.awssdk.request: Received successful response: 200, Request ID: 4cc704f3-925c-4a15-bc80-f326e6b420f1, Extended Request ID: not available.
{}
2023-08-03T09:09:27.877Z [INFO] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Successfully connected to AWS IoT Core. {clientId=home-XXXXXXXXXXXX#2, sessionPresent=false}
2023-08-03T09:09:27.877Z [DEBUG] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Subscribing to topic. {clientId=home-XXXXXXXXXXXX#2, qos=AT_LEAST_ONCE, topic=$aws/things/home-XXXXXXXXXXXX/jobs/6b087888-6826-448c-8191-5ea0436a3796/namespace-aws-gg-deployment/update/rejected}
2023-08-03T09:09:27.913Z [DEBUG] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Successfully subscribed to topic. {clientId=home-XXXXXXXXXXXX#2, qos=AT_LEAST_ONCE, topic=$aws/things/home-XXXXXXXXXXXX/jobs/6b087888-6826-448c-8191-5ea0436a3796/namespace-aws-gg-deployment/update/rejected}
2023-08-03T09:09:27.932Z [INFO] (Thread-4) com.aws.greengrass.deployment.IotJobsHelper: Job status update was accepted. {Status=SUCCEEDED, ThingName=home-XXXXXXXXXXXX, JobId=6b087888-6826-448c-8191-5ea0436a3796}
2023-08-03T09:09:27.946Z [DEBUG] (pool-2-thread-47) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Unsubscribing from topic. {clientId=home-XXXXXXXXXXXX, topic=$aws/things/home-XXXXXXXXXXXX/jobs/6b087888-6826-448c-8191-5ea0436a3796/namespace-aws-gg-deployment/update/accepted}
2023-08-03T09:09:27.970Z [DEBUG] (pool-2-thread-47) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Unsubscribing from topic. {clientId=home-XXXXXXXXXXXX#2, topic=$aws/things/home-XXXXXXXXXXXX/jobs/6b087888-6826-448c-8191-5ea0436a3796/namespace-aws-gg-deployment/update/rejected}
2023-08-03T09:09:27.985Z [DEBUG] (pool-2-thread-47) com.aws.greengrass.status.FleetStatusService: Updating Fleet Status service for deployment. {serviceName=FleetStatusService, currentState=RUNNING, deployment details={ConfigurationArn=arn:aws:greengrass:eu-west-2:163040377999:configuration:thinggroup/testHomesTopLevel:509, DeploymentId=6b087888-6826-448c-8191-5ea0436a3796, DeploymentRootPackages=[co.mgh.ConfigurationManager, co.mgh.DomesticRulesEngine, com.docker.postgres, aws.greengrass.LogManager, co.mgh.CoreMetricsForwarder, co.mgh.SubscriptionService, aws.greengrass.clientdevices.IPDetector, co.mgh.ShadowPortal, aws.greengrass.SecureTunneling, aws.greengrass.ShadowManager, com.docker.RabbitMQ-Management, co.mgh.OnPremApiService, aws.greengrass.Cloudwatch, aws.greengrass.Nucleus, aws.greengrass.telemetry.NucleusEmitter, co.mgh.DeviceController], DeploymentStatus=SUCCEEDED, DeploymentStatusDetails={detailed-deployment-status=SUCCESSFUL}, DeploymentType=IOT_JOBS, GreengrassDeploymentId=4e35af1e-8ddc-4f41-84e6-4f16af2f8564}}
2023-08-03T09:09:27.997Z [DEBUG] (pool-2-thread-47) com.aws.greengrass.util.MqttChunkedPayloadPublisher: MQTT publish succeeded. {topic=$aws/things/home-XXXXXXXXXXXX/greengrassv2/health/json}
2023-08-03T09:09:27.998Z [DEBUG] (pool-2-thread-14) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Disconnecting from AWS IoT Core. {clientId=home-XXXXXXXXXXXX#2}
2023-08-03T09:09:28.018Z [INFO] (pool-2-thread-47) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=THING_GROUP_DEPLOYMENT, serviceName=FleetStatusService, currentState=RUNNING}
2023-08-03T09:09:28.018Z [DEBUG] (pool-2-thread-47) com.aws.greengrass.lifecyclemanager.KernelAlternatives: Cleaning up previous kernel launch directory. {oldDir=/greengrass/v2/alts/bbc1f45d-568b-4af4-84b4-a2a43c20dde7}
2023-08-03T09:09:28.019Z [WARN] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Connection interrupted. {clientId=home-XXXXXXXXXXXX#2, error=Mqtt5 client connection interrupted by user request.}
2023-08-03T09:09:28.019Z [INFO] (pool-2-thread-47) com.aws.greengrass.deployment.DeploymentDirectoryManager: Persist link to last deployment. {link=/greengrass/v2/deployments/previous-success}
2023-08-03T09:09:28.019Z [INFO] (pool-2-thread-47) com.aws.greengrass.deployment.DeploymentDirectoryManager: Clean up link to earlier deployment. {link=/greengrass/v2/deployments/previous-success}

@crispinboylan-int
Copy link
Author

its like shadowmanager is using connection #2 and something pulls that out from under it, but shadowmanager never bothers to reconnect?

@MikeDombo
Copy link
Member

MikeDombo commented Aug 3, 2023

Hello,

The logs you provided show logs from Nucleus version 2.11, which means that Greengrass has restarted at least once (upgrading from 2.9.6 to 2.11.1 requires a Nucleus restart). The logs do not show enough information to understand what the issue is.

Please provide full logs, debug level would be great as well, showing the full time period from deployment start to shadow stopping, and then some time afterward to show that shadow does not start up again.

The first log you showed does say "BaseSyncStrategy: sync. Finished processing sync requests.". Are you making further changes to the shadow either locally or in the cloud? How do you know that the shadow is not syncing?

@crispinboylan-int
Copy link
Author

crispinboylan-int commented Aug 3, 2023

ok have attached the log from before. the deploy of 2.11 starts at 09:04:12 and finishes at 09:09. 1510.csv.gz

as you should see, the logs show we're updating regularly

"2023-08-03T08:55:00.155Z [DEBUG] (pool-2-thread-27884) com.aws.greengrass.shadowmanager.sync.model.CloudUpdateSyncRequest: Updating cloud shadow document. {thing name=home-XXXXX, shadow name=AirQuality, cloud-version=1310876, local-version=4059826}"

the last CloudUpdateSyncRequest happens at 09:04:

"2023-08-03T09:04:00.049Z [DEBUG] (pool-2-thread-27884) com.aws.greengrass.shadowmanager.sync.model.CloudUpdateSyncRequest: Updating cloud shadow document. {thing name=home-XXXXX, shadow name=AirQuality, cloud-version=1310893, local-version=4059844}"

the update job completes at:

"2023-08-03T09:09:27.728Z [INFO] (pool-2-thread-47) com.aws.greengrass.deployment.IotJobsHelper: Updating status of persisted deployment. {Status=SUCCEEDED, StatusDetails={detailed-deployment-status=SUCCESSFUL}, ThingName=home-XXXXX, JobId=6b087888-6826-448c-8191-5ea0436a3796}"

we see the sync stop:

"2023-08-03T09:09:28.019Z [WARN] (Thread-8) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Connection interrupted. {clientId=home-XXXXX#2, error=Mqtt5 client connection interrupted by user request.}"
2023-08-03T09:09:28.019Z [INFO] (pool-2-thread-47) com.aws.greengrass.deployment.DeploymentDirectoryManager: Persist link to last deployment. {link=/greengrass/v2/deployments/previous-success}
2023-08-03T09:09:28.019Z [INFO] (pool-2-thread-47) com.aws.greengrass.deployment.DeploymentDirectoryManager: Clean up link to earlier deployment. {link=/greengrass/v2/deployments/previous-success}
2023-08-03T09:09:28.061Z [INFO] (Thread-8) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Stop syncing. {}
2023-08-03T09:09:28.061Z [INFO] (Thread-8) com.aws.greengrass.shadowmanager.sync.strategy.BaseSyncStrategy: sync. Cancel 1 sync thread(s). {}

there after greengrass/shadowmanager is only updating the local shadow and nothing is being synced.

"2023-08-03T09:10:34.456Z [INFO] (Thread-7) com.aws.greengrass.shadowmanager.ipc.UpdateThingShadowRequestHandler: Successfully updated shadow. {service-name=co.mgh.ShadowPortal, thing name=home-XXXXX, shadow name=AirQuality, local-version=4059849}"

@saranyailla
Copy link
Member

From the logs you shared, this is how the IoT Core connectivity is working for the client with id: clientId=home-XXXXX#2

2023-08-03T09:04:25.306Z xxxx Connecting to AWS IoT Core. {clientId=home-XXXXX#2}
2023-08-03T09:04:25.424Z xxxx Successfully disconnected from AWS IoT Core. {clientId=home-XXXXX#2}

2023-08-03T09:04:36.275Z xxxx Connecting to AWS IoT Core. {clientId=home-XXXXX#2} 
 ....
"2023-08-03T09:04:36.364Z xxxx Updating cloud shadow document. {thing name=home-XXXXX, shadow name=AirQuality}"
....
2023-08-03T09:04:36.414Z xxxx Successfully connected to AWS IoT Core {clientId=home-XXXXX#2, sessionPresent=false}

2023-08-03T09:05:24.244Z xxxx Disconnecting from AWS IoT Core. {clientId=home-XXXXX#2}
2023-08-03T09:09:27.877Z xxxx Successfully connected to AWS IoT Core {clientId=home-XXXXX#2, sessionPresent=false}
2023-08-03T09:09:27.998Z xxxx Disconnecting from AWS IoT Core. {clientId=home-XXXXX#2}

The client home-XXXXX#2 never reconnected after it was disconnected at 2023-08-03T09:09:27.998Z. So this behavior of shadow manager is expected where cloud shadow syncs don't work when the device is offline.

@crispinboylan-int
Copy link
Author

But the device was never offline. Manually restarting the greengrass service resulted in shadows syncing with cloud again. I've tried this with 3 devices, all devices had internet connectivity the whole time.

@saranyailla
Copy link
Member

Yes, we're still investigating the root cause for why this is happening. That note was just for the context. Thank you for your patience.

@MikeDombo
Copy link
Member

Thank you for your report. It appears that this is a bug in the Nucleus MQTT 5 client (which was introduced in 2.10) which is happening when you have more than 50 MQTT subscriptions (which spawns a second client) and then drop below 50 subscriptions (which closes the now unnecessary second client).

We have a fix for this in #1511.

The workarounds would be to remain on Nucleus version before 2.10 or configure syncing additional shadows such that there are consistently more than 50 subscriptions. You could also configure Nucleus 2.11 to use MQTT 3 client which would also avoid the problem.

We're working on fully validating the fix and then releasing it as soon as we're able.

@crispinboylan-int
Copy link
Author

thanks guys!

@parrottsquawk
Copy link

We are releasing the fix over the next 24 hours. Stay tuned!

@MikeDombo
Copy link
Member

Resolved with Nucleus 2.11.2 please let us know if that's working for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Needs eyeballs
Projects
None yet
Development

No branches or pull requests

4 participants