-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Our team has been running Greengrass within a Snap that is based on the one worked on by Canonical/AWS earlier this year.
We are running into an issue that is unique to Greengrass running inside of Snap. This issue manifests as the Greengrass boot-looping during a deployment where an update was made to multiple components that that require a Nucleus restart. This issue is preventing us from upgrading components in our deployment that require a Nucleus restart. We are able to reproduce this with versions 2.5.4 and 2.5.6.
Given:
The Greengrass root directory is inside of a directory which has a parent directory which is symlinked, ex. /var/snap/device-agent/current/greengrass/v2 where current is a symlink.
Symptoms
Nucleus restarts in order to finish the bootstrap step, and continues to restart in a boot looping fashion. Logs are below:
2022-06-07T17:12:42.158Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: system-shutdown. {main=null} 2022-06-07T17:12:42.189Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: effective-config-dump-complete. {file=/var/snap/device-agent/current/greengrass/v2/config/effectiveConfig.yaml} 2022-06-07T17:12:42.192Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: Waiting for services to shutdown. {} 2022-06-07T17:12:42.194Z [INFO] (Serialized listener processor) com.aws.greengrass.lifecyclemanager.KernelLifecycle: executor-service-shutdown-initiated. {} 2022-06-07T17:12:42.194Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: Waiting for executors to shutdown. {} 2022-06-07T17:12:42.194Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: executor-service-shutdown-complete. {executor-terminated=true, scheduled-executor-terminated=true} 2022-06-07T17:12:42.194Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: context-shutdown-initiated. {} 2022-06-07T17:12:42.195Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: context-shutdown-complete. {} 2022-06-07T17:12:42.884Z [INFO] (main) com.aws.greengrass.util.platforms.Platform: Getting platform instance com.aws.greengrass.util.platforms.unix.linux.LinuxPlatform.. {} 2022-06-07T17:12:44.957Z [INFO] (main) com.aws.greengrass.util.platforms.Platform: Getting platform instance com.aws.greengrass.util.platforms.unix.linux.LinuxPlatform.. {} 2022-06-07T17:12:45.152Z [INFO] (main) com.aws.greengrass.config.Configuration: config-loading. Read configuration from a file path. {path=/var/snap/device-agent/current/greengrass/v2/deployments/535acb5e-0f0f-48b8-8f3b-aeed67a0a99f/target_config.tlog} 2022-06-07T17:12:45.415Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: effective-config-dump-complete. {file=/var/snap/device-agent/current/greengrass/v2/config/effectiveConfig.yaml} 2022-06-07T17:12:45.566Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: Resume deployment. {deploymentStage=BOOTSTRAP} 2022-06-07T17:12:45.567Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: Completed all bootstrap tasks. Continue to activate deployment changes. {} 2022-06-07T17:12:45.567Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: Nucleus restart requested to complete bootstrap task. {} 2022-06-07T17:12:45.567Z [INFO] (main) com.aws.greengrass.lifecyclemanager.KernelLifecycle: system-shutdown. {main=null} 2022-06-07T17:12:45.604Z [INFO] (main) com.aws.greengrass.lifecyclemanager.Kernel: effective-config-dump-complete. {file=/var/snap/device-agent/current/greengrass/v2/config/effectiveConfig.yaml}
This occurs when Greengrass receives a deployment in which multiple components run a Bootstrap step, and any component that is not the last component to be bootstrapped returns a 100 exit code (100 exit code signals to Nucleus that it must restart). For example, this can happen when we are upgrading two components that have a bootstrap step that requires a restart: Nucleus and LogManager.
Greengrass Nucleus provides a mechanism for the Components it runs to request that the Nucleus itself restarts as part of the Component installation process during a Deployment. When a Component requests this restart and other Component Bootstrap scripts (run on initial installation and version changes) are still pending, the Nucleus will restart and then resume executing the remaining Bootstrap scripts. We found that on Ubuntu Core, or any installation where the Greengrass root directory is a child of a symlinked directory, the Nucleus fails to read the remaining Bootstrap tasks to complete and perpetually restarts itself.
Potential fixes/workarounds
-
Move the Greengrass root directory into
common
, or some other place that is not a child of a linked directory. This is not ideal, since we would need to migrate all devices’ root Greengrass directory to another location -
A fix in a new version of Nucleus can process bootstrap tasks correctly when it is installed within a symlinked directory.
We've reproduced this issue in IoT Greengrass Core software version v2.5.4 and v2.5.6.