Rollback and Recovery
Introduction
This page documents the rollback and recovery mechanisms in AosCore — how the system detects failed deployments, reverts to known-good states, and recovers from crashes that interrupt an update in progress. These mechanisms ensure that Units in the field remain operational even when updates fail, and that no partial deployment leaves the system in an inconsistent state.
AosCore provides rollback at multiple levels:
- SOTA rollback — the cloud sends a new desired state with the previous service version
- FOTA rollback — CM sends a
RevertUpdatecommand to the Update Manager before the update is committed - Boot runtime rollback — A/B partition switching with health-check-based automatic rollback
- Rootfs runtime rollback — action-file-based state machine with health-check-driven revert
- CM crash recovery — persisted update state enables resumption after process restart
SOTA Rollback
SOTA (Software Over The Air) rollback is handled implicitly through the desired-state convergence model. There is no explicit "revert" command for software updates.
How It Works
- The Service Manager (SM) deploys a new service version via
RunInstances - If the new instance fails to start or crashes, the error is reported to CM
- CM reports the failure to AosCloud as part of the Unit status
- AosCloud sends a new desired state specifying the previous service version
- CM processes the new desired state and instructs SM to run the old version
- Previous service images remain in the local image store — no re-download is needed
Characteristics
| Aspect | Detail |
|---|---|
| Trigger | AosCloud sends a new desired state with the previous version |
| Mechanism | SM stops failed instances, starts previous version |
| Reboot required | No |
| Rollback window | Unlimited — previous images remain in the local store |
| Automatic | No — requires AosCloud to decide and send the rollback desired state |
| Scope | Individual service instances on specific Nodes |
Failure Detection
For SOTA, failure is detected when:
- A service instance fails to reach the running state within the WaitingActive timeout (10 minutes)
- A service instance crashes after starting
- The SM reports an instance with
eFailedstate
The CM aggregates these statuses and reports them to AosCloud, which can then decide whether to roll back.
FOTA Rollback
FOTA (Firmware Over The Air) rollback uses an explicit two-phase commit model through the Update Manager (UM) protocol.
The firmware update is not permanent until CM sends ApplyUpdate.
How It Works
- CM sends
PrepareUpdate— the UM downloads and verifies firmware images - CM sends
StartUpdate— the UM applies the firmware (system may reboot) - The UM reports
UPDATEDstate — firmware is applied but not committed - If Verification succeeds, CM sends
ApplyUpdate— the update becomes permanent - If Verification fails or the UM reports
FAILED, CM sendsRevertUpdate - The UM restores the previous firmware version and returns to
IDLE
RevertUpdate Command
The RevertUpdate command is sent by CM when:
- The UM reports
FAILEDstate afterStartUpdate - CM determines the update should be abandoned (e.g., a new desired state arrives)
- The system fails health checks after firmware is applied
The UM is responsible for implementing the actual revert operation — typically switching back to the previous boot partition or restoring the previous firmware image.
Rollback Window
| Phase | Rollback Possible |
|---|---|
Before StartUpdate | Yes — simply don't proceed |
After StartUpdate, before ApplyUpdate | Yes — RevertUpdate restores previous version |
After ApplyUpdate | No — previous version is discarded |
Multi-UM Coordination
When multiple Update Managers are registered (e.g., one for rootfs and one for MCU firmware), CM coordinates rollback across all of them:
- If any UM reports
FAILEDduring the update cycle, CM sendsRevertUpdateto all UMs that have already applied their updates - This ensures the system returns to a fully consistent state — no partial firmware updates remain
Boot Runtime Rollback
The boot runtime manages system-level components (such as the kernel or bootloader) using an A/B partition scheme. It provides automatic rollback based on health checks after reboot.
A/B Partition Architecture
The boot runtime maintains two boot partitions (cNumBootPartitions = 2). At any time, one partition is the main
(active) partition and the other holds the previous version:
| Concept | Description |
|---|---|
| Current partition | The partition the system actually booted from |
| Main partition | The partition marked as the default boot target |
| Installed data | Metadata about the currently committed version |
| Pending data | Metadata about an in-progress update |
Update Flow
- SM instructs the boot runtime to start a new instance (new firmware version)
- The runtime writes the new image to the next partition (alternate from current)
- The runtime sets the main boot to the new partition via
SetMainBoot() - The runtime requests a reboot via
RebootRequired() - After reboot, the system boots from the new partition
- The runtime calls
SetBootOK()to confirm successful boot - A health check runs against configured systemd services (
mHealthCheckServices) - If the health check passes, the update is committed — the new partition becomes the installed version
- If the health check fails, the pending update is marked as
eFailedand the system reverts to the previous partition
Automatic Rollback
The boot runtime's rollback is automatic and requires no cloud intervention:
Boot from new partition
→ SetBootOK()
→ Run health check (SystemdUpdateChecker)
→ Health check passes?
YES → SetMainBoot(new partition), commit update
NO → Mark pending as FAILED, keep previous partition as main
If the health check fails, the boot controller does not update the main boot pointer. On the next reboot, the system boots from the previous (still-main) partition, effectively rolling back the firmware.
Partition Synchronization
After a successful update, the boot runtime synchronizes the inactive partition with the active one (SyncPartition()).
This ensures both partitions contain the same version, so the system has a known-good fallback at all times.
Boot Controller Interface
The boot controller abstraction (BootControllerItf) provides the hardware-specific operations:
| Method | Purpose |
|---|---|
GetPartitionDevices() | Returns the list of boot partition device paths |
GetCurrentBoot() | Returns the index of the partition the system booted from |
GetMainBoot() | Returns the index of the default boot partition |
SetMainBoot(index) | Sets which partition to boot from next |
SetBootOK() | Confirms the current boot is successful (prevents watchdog rollback) |
The default implementation uses EFI boot variables (EFIBootController), but OEMs can provide custom implementations
for their hardware platform.
Rootfs Runtime Rollback
The rootfs runtime manages system-level components that are deployed as squashfs images (e.g., the root filesystem). It uses an action-file-based state machine to track update progress across reboots and perform health-check-driven rollback.
Action-Based State Machine
The rootfs runtime persists its state using action files in the working directory. Each action file represents the next operation to perform after a reboot:
| Action File | Meaning |
|---|---|
do_update | An update image is staged and ready to be applied (contains update type: "full" or "incremental") |
updated | The update has been applied; health check should run |
do_apply | Health check passed; the update should be committed |
failed | Health check failed; the update should be reverted |
Update Flow
- SM instructs the rootfs runtime to start a new instance (new rootfs version)
- The runtime copies the image to the working directory and creates a
do_updateaction file - The runtime requests a reboot
- An external update agent (e.g., a systemd service) detects
do_updateand applies the image - After reboot, the runtime reads the action file:
- If
updated→ run health check - If
failed→ revert to previous version - If no action file and pending exists → update was committed externally (
do_applywas processed)
- If
Health Check and Rollback
After reboot with the updated action:
- The runtime starts a health check thread (
RunHealthCheck) - The
SystemdUpdateCheckerverifies that configured systemd services are running correctly - If the health check passes: the runtime writes a
do_applyaction file and requests another reboot to commit - If the health check fails: the runtime writes a
failedaction file and requests a reboot to revert
On the next boot after a failed action:
- The runtime detects the
failedaction file - The pending instance is reported with
eFailedstate - The current (previous) instance is reported as
eActive - Update artifacts are cleaned up
Image Types
The rootfs runtime supports two update image types, determined by the OCI layer media type:
| Type | Media Type Prefix | Description |
|---|---|---|
| Full | vnd.aos.image.component.full | Complete rootfs image replacement |
| Incremental | vnd.aos.image.component.inc | Delta update applied to current rootfs |
CM Crash Recovery
The CM Update Manager persists its state on every transition, enabling recovery after a process crash or system restart.
Persisted State
The StorageItf interface provides two persistence operations:
| Method | What is Stored |
|---|---|
StoreUpdateState() | Current position in the update state machine (Downloading, Pending, Installing, Launching, WaitingActive, Finalizing) |
StoreDesiredStatus() | The complete desired status being processed (Nodes, Unit config, Deployable Items, instances, certificates) |
Recovery Process
When the CM starts (or restarts after a crash):
- The
DesiredStatusHandlercallsGetUpdateState()to check for a persisted state - If the state is not
eNone, an update was in progress when the process stopped - The handler calls
GetDesiredStatus()to retrieve the stored desired status - The update resumes from the persisted state position
CM Start
→ GetUpdateState() returns persisted state
→ State != eNone?
YES → GetDesiredStatus(), resume update from persisted state
NO → Wait for new desired state from cloud
State-Specific Recovery Behavior
| Persisted State | Recovery Action |
|---|---|
Downloading | Restart the download phase — Image Manager re-downloads any incomplete items |
Pending | Proceed to Installing — downloads were already complete |
Installing | Re-apply configuration changes (Node states, Unit config) |
Launching | Re-send RunInstances to SM and/or re-send commands to UMs |
WaitingActive | Resume waiting for instances/components to reach target state |
Finalizing | Re-commit installed items and send Unit status to cloud |
Cancellation During Recovery
If a new desired state arrives while the CM is recovering a previous update:
- The CM compares the new desired state with the stored one
- If they differ, the current recovery is cancelled
- The new desired state is stored and a fresh update cycle begins from Downloading
- If they are the same, the recovery continues normally
Failure Detection Summary
| Layer | Detection Mechanism | Response |
|---|---|---|
| SOTA | Instance fails to start or crashes; WaitingActive timeout (10 min) | Error reported to cloud; cloud decides rollback |
| FOTA | UM reports FAILED state with error details | CM sends RevertUpdate to affected UMs |
| Boot runtime | Systemd health check fails after reboot | Automatic — previous partition remains main |
| Rootfs runtime | Systemd health check fails after reboot | Automatic — failed action triggers revert on next boot |
| CM crash | Process restart with non-None persisted state | Automatic — resume from last persisted state |
Related Pages
- Deployment Flows — section overview with deployment architecture and update orchestration
- Update Flow Overview — end-to-end update sequence showing all phases including error handling
- SOTA vs FOTA — detailed comparison of software and firmware update mechanisms
- Update Handler State Machine — complete FOTA Update Manager state machine with RevertUpdate command details
- Update Manager (CM module) — internal architecture of the CM Update Manager including crash recovery
- Service Instance States — service state machine including failure states
- Update Failure and Rollback — error handling perspective on update failures