Skip to main content
Version: v1.1

Rollback and Recovery

Introduction

This page documents the rollback and recovery mechanisms in AosCore — how the system detects failed deployments, reverts to known-good states, and recovers from crashes that interrupt an update in progress. These mechanisms ensure that Units in the field remain operational even when updates fail, and that no partial deployment leaves the system in an inconsistent state.

AosCore provides rollback at multiple levels:

  • SOTA rollback — the cloud sends a new desired state with the previous service version
  • FOTA rollback — CM sends a RevertUpdate command to the Update Manager before the update is committed
  • Boot runtime rollback — A/B partition switching with health-check-based automatic rollback
  • Rootfs runtime rollback — action-file-based state machine with health-check-driven revert
  • CM crash recovery — persisted update state enables resumption after process restart

SOTA Rollback

SOTA (Software Over The Air) rollback is handled implicitly through the desired-state convergence model. There is no explicit "revert" command for software updates.

How It Works

  1. The Service Manager (SM) deploys a new service version via RunInstances
  2. If the new instance fails to start or crashes, the error is reported to CM
  3. CM reports the failure to AosCloud as part of the Unit status
  4. AosCloud sends a new desired state specifying the previous service version
  5. CM processes the new desired state and instructs SM to run the old version
  6. Previous service images remain in the local image store — no re-download is needed

Characteristics

AspectDetail
TriggerAosCloud sends a new desired state with the previous version
MechanismSM stops failed instances, starts previous version
Reboot requiredNo
Rollback windowUnlimited — previous images remain in the local store
AutomaticNo — requires AosCloud to decide and send the rollback desired state
ScopeIndividual service instances on specific Nodes

Failure Detection

For SOTA, failure is detected when:

  • A service instance fails to reach the running state within the WaitingActive timeout (10 minutes)
  • A service instance crashes after starting
  • The SM reports an instance with eFailed state

The CM aggregates these statuses and reports them to AosCloud, which can then decide whether to roll back.

FOTA Rollback

FOTA (Firmware Over The Air) rollback uses an explicit two-phase commit model through the Update Manager (UM) protocol. The firmware update is not permanent until CM sends ApplyUpdate.

How It Works

  1. CM sends PrepareUpdate — the UM downloads and verifies firmware images
  2. CM sends StartUpdate — the UM applies the firmware (system may reboot)
  3. The UM reports UPDATED state — firmware is applied but not committed
  4. If Verification succeeds, CM sends ApplyUpdate — the update becomes permanent
  5. If Verification fails or the UM reports FAILED, CM sends RevertUpdate
  6. The UM restores the previous firmware version and returns to IDLE

RevertUpdate Command

The RevertUpdate command is sent by CM when:

  • The UM reports FAILED state after StartUpdate
  • CM determines the update should be abandoned (e.g., a new desired state arrives)
  • The system fails health checks after firmware is applied

The UM is responsible for implementing the actual revert operation — typically switching back to the previous boot partition or restoring the previous firmware image.

Rollback Window

PhaseRollback Possible
Before StartUpdateYes — simply don't proceed
After StartUpdate, before ApplyUpdateYes — RevertUpdate restores previous version
After ApplyUpdateNo — previous version is discarded

Multi-UM Coordination

When multiple Update Managers are registered (e.g., one for rootfs and one for MCU firmware), CM coordinates rollback across all of them:

  1. If any UM reports FAILED during the update cycle, CM sends RevertUpdate to all UMs that have already applied their updates
  2. This ensures the system returns to a fully consistent state — no partial firmware updates remain

Boot Runtime Rollback

The boot runtime manages system-level components (such as the kernel or bootloader) using an A/B partition scheme. It provides automatic rollback based on health checks after reboot.

A/B Partition Architecture

The boot runtime maintains two boot partitions (cNumBootPartitions = 2). At any time, one partition is the main (active) partition and the other holds the previous version:

ConceptDescription
Current partitionThe partition the system actually booted from
Main partitionThe partition marked as the default boot target
Installed dataMetadata about the currently committed version
Pending dataMetadata about an in-progress update

Update Flow

  1. SM instructs the boot runtime to start a new instance (new firmware version)
  2. The runtime writes the new image to the next partition (alternate from current)
  3. The runtime sets the main boot to the new partition via SetMainBoot()
  4. The runtime requests a reboot via RebootRequired()
  5. After reboot, the system boots from the new partition
  6. The runtime calls SetBootOK() to confirm successful boot
  7. A health check runs against configured systemd services (mHealthCheckServices)
  8. If the health check passes, the update is committed — the new partition becomes the installed version
  9. If the health check fails, the pending update is marked as eFailed and the system reverts to the previous partition

Automatic Rollback

The boot runtime's rollback is automatic and requires no cloud intervention:

Boot from new partition
→ SetBootOK()
→ Run health check (SystemdUpdateChecker)
→ Health check passes?
YES → SetMainBoot(new partition), commit update
NO → Mark pending as FAILED, keep previous partition as main

If the health check fails, the boot controller does not update the main boot pointer. On the next reboot, the system boots from the previous (still-main) partition, effectively rolling back the firmware.

Partition Synchronization

After a successful update, the boot runtime synchronizes the inactive partition with the active one (SyncPartition()). This ensures both partitions contain the same version, so the system has a known-good fallback at all times.

Boot Controller Interface

The boot controller abstraction (BootControllerItf) provides the hardware-specific operations:

MethodPurpose
GetPartitionDevices()Returns the list of boot partition device paths
GetCurrentBoot()Returns the index of the partition the system booted from
GetMainBoot()Returns the index of the default boot partition
SetMainBoot(index)Sets which partition to boot from next
SetBootOK()Confirms the current boot is successful (prevents watchdog rollback)

The default implementation uses EFI boot variables (EFIBootController), but OEMs can provide custom implementations for their hardware platform.

Rootfs Runtime Rollback

The rootfs runtime manages system-level components that are deployed as squashfs images (e.g., the root filesystem). It uses an action-file-based state machine to track update progress across reboots and perform health-check-driven rollback.

Action-Based State Machine

The rootfs runtime persists its state using action files in the working directory. Each action file represents the next operation to perform after a reboot:

Action FileMeaning
do_updateAn update image is staged and ready to be applied (contains update type: "full" or "incremental")
updatedThe update has been applied; health check should run
do_applyHealth check passed; the update should be committed
failedHealth check failed; the update should be reverted

Update Flow

  1. SM instructs the rootfs runtime to start a new instance (new rootfs version)
  2. The runtime copies the image to the working directory and creates a do_update action file
  3. The runtime requests a reboot
  4. An external update agent (e.g., a systemd service) detects do_update and applies the image
  5. After reboot, the runtime reads the action file:
    • If updated → run health check
    • If failed → revert to previous version
    • If no action file and pending exists → update was committed externally (do_apply was processed)

Health Check and Rollback

After reboot with the updated action:

  1. The runtime starts a health check thread (RunHealthCheck)
  2. The SystemdUpdateChecker verifies that configured systemd services are running correctly
  3. If the health check passes: the runtime writes a do_apply action file and requests another reboot to commit
  4. If the health check fails: the runtime writes a failed action file and requests a reboot to revert

On the next boot after a failed action:

  1. The runtime detects the failed action file
  2. The pending instance is reported with eFailed state
  3. The current (previous) instance is reported as eActive
  4. Update artifacts are cleaned up

Image Types

The rootfs runtime supports two update image types, determined by the OCI layer media type:

TypeMedia Type PrefixDescription
Fullvnd.aos.image.component.fullComplete rootfs image replacement
Incrementalvnd.aos.image.component.incDelta update applied to current rootfs

CM Crash Recovery

The CM Update Manager persists its state on every transition, enabling recovery after a process crash or system restart.

Persisted State

The StorageItf interface provides two persistence operations:

MethodWhat is Stored
StoreUpdateState()Current position in the update state machine (Downloading, Pending, Installing, Launching, WaitingActive, Finalizing)
StoreDesiredStatus()The complete desired status being processed (Nodes, Unit config, Deployable Items, instances, certificates)

Recovery Process

When the CM starts (or restarts after a crash):

  1. The DesiredStatusHandler calls GetUpdateState() to check for a persisted state
  2. If the state is not eNone, an update was in progress when the process stopped
  3. The handler calls GetDesiredStatus() to retrieve the stored desired status
  4. The update resumes from the persisted state position
CM Start
→ GetUpdateState() returns persisted state
→ State != eNone?
YES → GetDesiredStatus(), resume update from persisted state
NO → Wait for new desired state from cloud

State-Specific Recovery Behavior

Persisted StateRecovery Action
DownloadingRestart the download phase — Image Manager re-downloads any incomplete items
PendingProceed to Installing — downloads were already complete
InstallingRe-apply configuration changes (Node states, Unit config)
LaunchingRe-send RunInstances to SM and/or re-send commands to UMs
WaitingActiveResume waiting for instances/components to reach target state
FinalizingRe-commit installed items and send Unit status to cloud

Cancellation During Recovery

If a new desired state arrives while the CM is recovering a previous update:

  1. The CM compares the new desired state with the stored one
  2. If they differ, the current recovery is cancelled
  3. The new desired state is stored and a fresh update cycle begins from Downloading
  4. If they are the same, the recovery continues normally

Failure Detection Summary

LayerDetection MechanismResponse
SOTAInstance fails to start or crashes; WaitingActive timeout (10 min)Error reported to cloud; cloud decides rollback
FOTAUM reports FAILED state with error detailsCM sends RevertUpdate to affected UMs
Boot runtimeSystemd health check fails after rebootAutomatic — previous partition remains main
Rootfs runtimeSystemd health check fails after rebootAutomatic — failed action triggers revert on next boot
CM crashProcess restart with non-None persisted stateAutomatic — resume from last persisted state