Version: v1.1

Rollback and Recovery

Introduction

This page documents the rollback and recovery mechanisms in AosCore — how the system detects failed deployments, reverts to known-good states, and recovers from crashes that interrupt an update in progress. These mechanisms ensure that Units in the field remain operational even when updates fail, and that no partial deployment leaves the system in an inconsistent state.

AosCore provides rollback at multiple levels:

SOTA rollback — the cloud sends a new desired state with the previous service version
FOTA rollback — CM sends a RevertUpdate command to the Update Manager before the update is committed
Boot runtime rollback — A/B partition switching with health-check-based automatic rollback
Rootfs runtime rollback — action-file-based state machine with health-check-driven revert
CM crash recovery — persisted update state enables resumption after process restart

SOTA Rollback

SOTA (Software Over The Air) rollback is handled implicitly through the desired-state convergence model. There is no explicit "revert" command for software updates.

How It Works

The Service Manager (SM) deploys a new service version via RunInstances
If the new instance fails to start or crashes, the error is reported to CM
CM reports the failure to AosCloud as part of the Unit status
AosCloud sends a new desired state specifying the previous service version
CM processes the new desired state and instructs SM to run the old version
Previous service images remain in the local image store — no re-download is needed

Characteristics

Aspect	Detail
Trigger	AosCloud sends a new desired state with the previous version
Mechanism	SM stops failed instances, starts previous version
Reboot required	No
Rollback window	Unlimited — previous images remain in the local store
Automatic	No — requires AosCloud to decide and send the rollback desired state
Scope	Individual service instances on specific Nodes

Failure Detection

For SOTA, failure is detected when:

A service instance fails to reach the running state within the WaitingActive timeout (10 minutes)
A service instance crashes after starting
The SM reports an instance with eFailed state

The CM aggregates these statuses and reports them to AosCloud, which can then decide whether to roll back.

FOTA Rollback

FOTA (Firmware Over The Air) rollback uses an explicit two-phase commit model through the Update Manager (UM) protocol. The firmware update is not permanent until CM sends ApplyUpdate.

How It Works

CM sends PrepareUpdate — the UM downloads and verifies firmware images
CM sends StartUpdate — the UM applies the firmware (system may reboot)
The UM reports UPDATED state — firmware is applied but not committed
If Verification succeeds, CM sends ApplyUpdate — the update becomes permanent
If Verification fails or the UM reports FAILED, CM sends RevertUpdate
The UM restores the previous firmware version and returns to IDLE

RevertUpdate Command

The RevertUpdate command is sent by CM when:

The UM reports FAILED state after StartUpdate
CM determines the update should be abandoned (e.g., a new desired state arrives)
The system fails health checks after firmware is applied

The UM is responsible for implementing the actual revert operation — typically switching back to the previous boot partition or restoring the previous firmware image.

Rollback Window

Phase	Rollback Possible
Before `StartUpdate`	Yes — simply don't proceed
After `StartUpdate`, before `ApplyUpdate`	Yes — `RevertUpdate` restores previous version
After `ApplyUpdate`	No — previous version is discarded

Multi-UM Coordination

When multiple Update Managers are registered (e.g., one for rootfs and one for MCU firmware), CM coordinates rollback across all of them:

If any UM reports FAILED during the update cycle, CM sends RevertUpdate to all UMs that have already applied their updates
This ensures the system returns to a fully consistent state — no partial firmware updates remain

Boot Runtime Rollback

The boot runtime manages system-level components (such as the kernel or bootloader) using an A/B partition scheme. It provides automatic rollback based on health checks after reboot.

A/B Partition Architecture

The boot runtime maintains two boot partitions (cNumBootPartitions = 2). At any time, one partition is the main (active) partition and the other holds the previous version:

Concept	Description
Current partition	The partition the system actually booted from
Main partition	The partition marked as the default boot target
Installed data	Metadata about the currently committed version
Pending data	Metadata about an in-progress update

Update Flow

SM instructs the boot runtime to start a new instance (new firmware version)
The runtime writes the new image to the next partition (alternate from current)
The runtime sets the main boot to the new partition via SetMainBoot()
The runtime requests a reboot via RebootRequired()
After reboot, the system boots from the new partition
The runtime calls SetBootOK() to confirm successful boot
A health check runs against configured systemd services (mHealthCheckServices)
If the health check passes, the update is committed — the new partition becomes the installed version
If the health check fails, the pending update is marked as eFailed and the system reverts to the previous partition

Automatic Rollback

The boot runtime's rollback is automatic and requires no cloud intervention:

Boot from new partition
    → SetBootOK()
    → Run health check (SystemdUpdateChecker)
    → Health check passes?
        YES → SetMainBoot(new partition), commit update
        NO  → Mark pending as FAILED, keep previous partition as main

If the health check fails, the boot controller does not update the main boot pointer. On the next reboot, the system boots from the previous (still-main) partition, effectively rolling back the firmware.

Partition Synchronization

After a successful update, the boot runtime synchronizes the inactive partition with the active one (SyncPartition()). This ensures both partitions contain the same version, so the system has a known-good fallback at all times.

Boot Controller Interface

The boot controller abstraction (BootControllerItf) provides the hardware-specific operations:

Method	Purpose
`GetPartitionDevices()`	Returns the list of boot partition device paths
`GetCurrentBoot()`	Returns the index of the partition the system booted from
`GetMainBoot()`	Returns the index of the default boot partition
`SetMainBoot(index)`	Sets which partition to boot from next
`SetBootOK()`	Confirms the current boot is successful (prevents watchdog rollback)

The default implementation uses EFI boot variables (EFIBootController), but OEMs can provide custom implementations for their hardware platform.

Rootfs Runtime Rollback

The rootfs runtime manages system-level components that are deployed as squashfs images (e.g., the root filesystem). It uses an action-file-based state machine to track update progress across reboots and perform health-check-driven rollback.

Action-Based State Machine

The rootfs runtime persists its state using action files in the working directory. Each action file represents the next operation to perform after a reboot:

Action File	Meaning
`do_update`	An update image is staged and ready to be applied (contains update type: "full" or "incremental")
`updated`	The update has been applied; health check should run
`do_apply`	Health check passed; the update should be committed
`failed`	Health check failed; the update should be reverted

Update Flow

SM instructs the rootfs runtime to start a new instance (new rootfs version)
The runtime copies the image to the working directory and creates a do_update action file
The runtime requests a reboot
An external update agent (e.g., a systemd service) detects do_update and applies the image
After reboot, the runtime reads the action file:
- If updated → run health check
- If failed → revert to previous version
- If no action file and pending exists → update was committed externally (do_apply was processed)

Health Check and Rollback

After reboot with the updated action:

The runtime starts a health check thread (RunHealthCheck)
The SystemdUpdateChecker verifies that configured systemd services are running correctly
If the health check passes: the runtime writes a do_apply action file and requests another reboot to commit
If the health check fails: the runtime writes a failed action file and requests a reboot to revert

On the next boot after a failed action:

The runtime detects the failed action file
The pending instance is reported with eFailed state
The current (previous) instance is reported as eActive
Update artifacts are cleaned up

Image Types

The rootfs runtime supports two update image types, determined by the OCI layer media type:

Type	Media Type Prefix	Description
Full	`vnd.aos.image.component.full`	Complete rootfs image replacement
Incremental	`vnd.aos.image.component.inc`	Delta update applied to current rootfs

CM Crash Recovery

The CM Update Manager persists its state on every transition, enabling recovery after a process crash or system restart.

Persisted State

The StorageItf interface provides two persistence operations:

Method	What is Stored
`StoreUpdateState()`	Current position in the update state machine (Downloading, Pending, Installing, Launching, WaitingActive, Finalizing)
`StoreDesiredStatus()`	The complete desired status being processed (Nodes, Unit config, Deployable Items, instances, certificates)

Recovery Process

When the CM starts (or restarts after a crash):

The DesiredStatusHandler calls GetUpdateState() to check for a persisted state
If the state is not eNone, an update was in progress when the process stopped
The handler calls GetDesiredStatus() to retrieve the stored desired status
The update resumes from the persisted state position

CM Start
    → GetUpdateState() returns persisted state
    → State != eNone?
        YES → GetDesiredStatus(), resume update from persisted state
        NO  → Wait for new desired state from cloud

State-Specific Recovery Behavior

Persisted State	Recovery Action
`Downloading`	Restart the download phase — Image Manager re-downloads any incomplete items
`Pending`	Proceed to Installing — downloads were already complete
`Installing`	Re-apply configuration changes (Node states, Unit config)
`Launching`	Re-send `RunInstances` to SM and/or re-send commands to UMs
`WaitingActive`	Resume waiting for instances/components to reach target state
`Finalizing`	Re-commit installed items and send Unit status to cloud

Cancellation During Recovery

If a new desired state arrives while the CM is recovering a previous update:

The CM compares the new desired state with the stored one
If they differ, the current recovery is cancelled
The new desired state is stored and a fresh update cycle begins from Downloading
If they are the same, the recovery continues normally

Failure Detection Summary

Layer	Detection Mechanism	Response
SOTA	Instance fails to start or crashes; WaitingActive timeout (10 min)	Error reported to cloud; cloud decides rollback
FOTA	UM reports `FAILED` state with error details	CM sends `RevertUpdate` to affected UMs
Boot runtime	Systemd health check fails after reboot	Automatic — previous partition remains main
Rootfs runtime	Systemd health check fails after reboot	Automatic — `failed` action triggers revert on next boot
CM crash	Process restart with non-None persisted state	Automatic — resume from last persisted state

Deployment Flows — section overview with deployment architecture and update orchestration
Update Flow Overview — end-to-end update sequence showing all phases including error handling
SOTA vs FOTA — detailed comparison of software and firmware update mechanisms
Update Handler State Machine — complete FOTA Update Manager state machine with RevertUpdate command details
Update Manager (CM module) — internal architecture of the CM Update Manager including crash recovery
Service Instance States — service state machine including failure states
Update Failure and Rollback — error handling perspective on update failures

Introduction​

SOTA Rollback​

How It Works​

Characteristics​

Failure Detection​

FOTA Rollback​

How It Works​

RevertUpdate Command​

Rollback Window​

Multi-UM Coordination​

Boot Runtime Rollback​

A/B Partition Architecture​

Update Flow​

Automatic Rollback​

Partition Synchronization​

Boot Controller Interface​

Rootfs Runtime Rollback​

Action-Based State Machine​

Update Flow​

Health Check and Rollback​

Image Types​

CM Crash Recovery​

Persisted State​

Recovery Process​

State-Specific Recovery Behavior​

Cancellation During Recovery​

Failure Detection Summary​

Related Pages​

Introduction

SOTA Rollback

How It Works

Characteristics

Failure Detection

FOTA Rollback

How It Works

RevertUpdate Command

Rollback Window

Multi-UM Coordination

Boot Runtime Rollback

A/B Partition Architecture

Update Flow

Automatic Rollback

Partition Synchronization

Boot Controller Interface

Rootfs Runtime Rollback

Action-Based State Machine

Update Flow

Health Check and Rollback

Image Types

CM Crash Recovery

Persisted State

Recovery Process

State-Specific Recovery Behavior

Cancellation During Recovery

Failure Detection Summary

Related Pages