Version: v1.1

Update Failure and Rollback

Introduction

This page documents how AosCore detects, reports, and recovers from update failures — covering every stage of the deployment state machine from download through finalization. Each failure scenario has a specific detection mechanism, error reporting path, and recovery action. Understanding these mechanisms is essential for OEMs diagnosing deployment issues and implementing custom Update Managers that handle failures correctly.

The CM Update Manager drives updates through a sequential state machine (Downloading → Pending → Installing → Launching → WaitingActive → Finalizing). A failure at any stage halts forward progress, reports the error to AosCloud, and — depending on the stage — may trigger automatic rollback or require a new update cycle to recover.

Failure Scenarios by Update Stage

Download Failures (Downloading State)

During the Downloading phase, the Image Manager downloads OCI image blobs for all Deployable Items specified in the desired state. Failures can occur at multiple points:

Failure Type	Detection Mechanism	Error Reported
Network timeout	HTTP request timeout (10s)	`eTimeout` — download timed out
Connection refused	curl connection error	`eRuntime` — wraps system errno
Partial download / interrupted	Retry exhausted after 3 attempts	`eRuntime` — last download error
Insufficient storage	Space allocator rejects allocation	`eNoMemory` or space allocation error
Download cancelled	Cancel flag set during retry loop	`eRuntime` — "download cancelled"

Retry behavior: The Downloader retries each blob download up to 3 times with exponential backoff (1s initial delay, doubling to a maximum of 5s between attempts). If all retries are exhausted, the download is considered failed.

Recovery action: The update state machine transitions to eNone (idle). The system remains on its previous desired state. The failed Deployable Items are marked with eFailed state and the error is reported to AosCloud via the Unit status. A subsequent desired state from the cloud (even if identical) triggers a fresh download attempt.

Verification Failures (Downloading State)

After downloading each blob, the Image Manager verifies its integrity by comparing the computed SHA-256 digest against the expected digest from the OCI image index.

Failure Type	Detection Mechanism	Error Reported
Digest mismatch	SHA-256 of downloaded blob ≠ expected digest	`eInvalidChecksum` — integrity Verification failed
Size mismatch	Downloaded file size ≠ expected size from descriptor	Size Verification error
Corrupted OCI index	Index file cannot be parsed as valid OCI spec	Parse error from OCISpecItf
Decryption failure	CryptoHelper cannot decrypt encrypted blob	Crypto error from CryptoHelperItf

Detection mechanism: The Image Manager calls VerifyBlobChecksum() for each downloaded blob, comparing the file's SHA-256 hash against the digest encoded in the blob's filename (which comes from the OCI content descriptor). Additionally, VerifyBlobsIntegrity() checks all blobs for an item during the install phase.

Recovery action: The corrupted blob is discarded. The affected Deployable Item is set to eFailed state. The overall update transitions to idle and the failure is reported to AosCloud. On the next update cycle, the Image Manager's CleanupDownloadingItems() removes partially downloaded items before retrying.

Installation Failures (Installing State)

The Installing phase applies configuration changes — Node state transitions and Unit configuration updates. These are non-image operations that prepare the environment for new service instances.

Failure Type	Detection Mechanism	Error Reported
Unit config check failure	`CheckUnitConfig()` returns error	Config check error
Unit config apply failure	`UpdateUnitConfig()` returns error	Config application error
Node state change failure	`PauseNode()` or `ResumeNode()` returns error	Node state transition error

Detection mechanism: The InstallDesiredStatus() method calls CheckUnitConfig() before UpdateUnitConfig(). If either fails, the error is captured. For Node state changes, each PauseNode() or ResumeNode() call is checked individually.

Recovery action: Unlike download failures, installation failures do not abort the entire update cycle. The error is recorded per-Node (via SetUpdateNodeStatus()) or per-config (via SetUpdateUnitConfigStatus() with eFailed state), and the state machine continues to the Launching phase. This allows partial success — services can still be launched even if a Node state change or config update failed. The errors are reported to AosCloud as part of the Unit status.

Launch Failures (Launching State)

During the Launching phase, the CM instructs Service Managers to run new instances (SOTA) or sends PrepareUpdate/StartUpdate to Update Managers (FOTA).

SOTA Launch Failures

Failure Type	Detection Mechanism	Error Reported
SM unreachable	gRPC connection failure to SM	`eRuntime` — connection error
Instance start failure	SM reports `eFailed` in `RunInstancesStatus`	Per-instance `ErrorInfo` from SM
Image not available on SM	SM cannot find required service image	Image resolution error
Resource limit exceeded	SM resource manager rejects instance	Resource allocation error

Detection mechanism: The LaunchInstances() method calls RunInstances() on the Launcher interface. The returned InstanceStatus array contains per-instance state — any instance with eFailed state is logged with its error. However, individual instance failures do not abort the update cycle.

Recovery action: Failed instances remain in eFailed state and are reported to AosCloud. The update continues to WaitingActive. The SM applies its own restart policy (configurable restart interval and start burst) to attempt recovery of failed instances.

FOTA Launch Failures

Failure Type	Detection Mechanism	Error Reported
UM `PrepareUpdate` failure	UM reports `FAILED` state in `UpdateStatus`	`ErrorInfo` in UpdateStatus.error
UM `StartUpdate` failure	UM reports `FAILED` state after start command	`ErrorInfo` in UpdateStatus.error
UM unreachable	gRPC stream disconnection	Connection loss detected

Detection mechanism: The CM monitors the UpdateStatus stream from each registered UM. When a UM transitions to FAILED state, the error field contains the failure description.

Recovery action: When a UM reports FAILED, the CM sends RevertUpdate to that UM (and to any other UMs that have already applied their updates), restoring all firmware components to their previous versions. The UM transitions back to IDLE after successful revert. The failure is reported to AosCloud via UpdateFOTAStatus.

WaitingActive Timeout

After launching, the CM enters the WaitingActive state and monitors all instances and components until they reach their target state.

Failure Type	Detection Mechanism	Error Reported
Instance stuck in activating	Instance remains in `eActivating` state beyond timeout	Timeout error
Instance failed during activation	Instance transitions to `eFailed` during wait	Per-instance error
UM component not reaching INSTALLED	UM component remains in `INSTALLING` state	Timeout error

Detection mechanism: The WaitInstancesActive() method polls instance statuses via GetInstancesStatuses() and waits on a condition variable with a 10-minute timeout (cWaitActiveTimeout = 10 minutes). If any instance remains in eActivating state when the timeout expires, the wait returns a timeout error.

Recovery action: The timeout error causes the update state machine to transition to eNone (idle) without reaching the Finalizing phase. The Deployable Items are not committed as installed — they remain in their pending state. The failure is reported to AosCloud. A new desired state from the cloud can trigger a fresh update attempt.

Finalization Failures (Finalizing State)

The Finalizing phase commits downloaded items as installed and performs cleanup.

Failure Type	Detection Mechanism	Error Reported
Blob integrity failure	`VerifyBlobsIntegrity()` detects corrupted blob	`eInvalidChecksum`
Storage write failure	Database update fails when setting item to installed	Storage error

Detection mechanism: During InstallUpdateItems() (called in the Finalizing phase), the Image Manager verifies blob integrity one final time before committing items to installed state. Items that fail integrity checks are removed and reported as failed.

Recovery action: Items that fail finalization are set to eFailed state. The update state machine transitions to idle. Successfully finalized items remain installed. The partial failure is reported to AosCloud.

FOTA Rollback Protocol

The FOTA rollback mechanism uses the RevertUpdate command in the UM protocol to restore firmware components to their previous version. Rollback is possible only before ApplyUpdate commits the new firmware permanently.

Rollback Triggers

Trigger	Condition	CM Action
UM reports FAILED after PrepareUpdate	Download or Verification error in UM	Send `RevertUpdate` to recover UM to IDLE
UM reports FAILED after StartUpdate	Firmware apply operation failed	Send `RevertUpdate` to restore previous firmware
Multi-UM coordination failure	One UM fails while others succeeded	Send `RevertUpdate` to all UMs that reached UPDATED state
New desired state during FOTA	Cancellation of current update	Send `RevertUpdate` to UMs in UPDATED state

Rollback Sequence

When a FOTA rollback is triggered:

CM identifies all UMs that need to revert (those in UPDATED or FAILED state)
CM sends RevertUpdate command to each affected UM via the gRPC stream
Each UM restores the previous firmware version (e.g., switches boot partition back)
Each UM reports UpdateStatus with state IDLE to confirm successful revert
CM reports the rollback outcome to AosCloud via UpdateFOTAStatus

Point of No Return

The ApplyUpdate command marks the point of no return for FOTA updates:

Before ApplyUpdate: The previous firmware is preserved (e.g., on the inactive A/B partition). RevertUpdate can restore it.
After ApplyUpdate: The new firmware is committed as permanent. The previous version is removed. Recovery requires a new update cycle with the desired firmware version.

OEMs implementing custom Update Managers must ensure that StartUpdate preserves the previous firmware in a revertible state until ApplyUpdate is received.

SOTA Recovery Behavior

Unlike FOTA, SOTA updates do not have an explicit rollback command. Recovery from SOTA failures relies on the desired-state reconciliation model:

Scenario	Recovery Mechanism
Service instance won't start	SM restart policy (configurable interval and burst)
Service repeatedly crashes	Instance remains in `eFailed` state; reported to cloud
Image corrupted after download	Next update cycle re-downloads and re-verifies
Wrong version deployed	Cloud sends corrected desired state; CM reconciles

The key difference: SOTA operates on a convergence model — the system continuously attempts to match the desired state. If a service fails, the SM retries according to its restart policy. If the desired state itself is wrong, the cloud sends a corrected version and the CM processes it as a new update.

Crash Recovery

The CM Update Manager persists its state on every transition, enabling recovery from process crashes without losing update progress.

Persisted State

Data	Storage Method	Purpose
Current update state	`StoreUpdateState()`	Resume from correct phase after restart
Desired status	`StoreDesiredStatus()`	Know what target to converge toward

Recovery Sequence

When the CM process restarts after a crash:

Start() reads the last persisted update state via GetUpdateState()
If the state is not eNone, the handler reads the stored desired status via GetDesiredStatus()
The update resumes from the persisted state — for example, if the crash occurred during Launching, the handler re-enters the Launching phase
The state machine continues forward from that point

Recovery Behavior by State

Crashed In State	Recovery Behavior
Downloading	Re-enters Downloading; Image Manager resumes partial downloads (blobs already on disk are reused)
Pending	Advances to Installing (all downloads were complete)
Installing	Re-applies configuration changes (idempotent operations)
Launching	Re-sends RunInstances to SMs; re-sends PrepareUpdate/StartUpdate to UMs
WaitingActive	Re-enters wait loop; checks current instance statuses
Finalizing	Re-runs finalization (idempotent commit of installed items)

Cancellation During Recovery

If a new desired state arrives while the CM is recovering from a crash:

The recovery update is marked for cancellation
The new desired state is stored as pending
Once the current state action completes (or is cancelled), the new update begins from Downloading

Error Reporting to AosCloud

Update failures are reported to AosCloud through two channels:

Unit Status

The UnitStatus message includes per-item and per-instance error information:

Deployable Item status — items that failed download, Verification, or installation include their ErrorInfo
Instance status — instances that failed to launch or activate include their error details
Unit config status — configuration apply failures with error description
Node status — per-Node errors for state transition failures

Update Notifications

Active update progress is streamed via the Update Scheduler API:

UpdateSOTAStatus — includes per-service and per-layer status with errors
UpdateFOTAStatus — includes per-component status with errors and overall update state

These notifications allow external systems (HMI, fleet management) to display real-time update progress and failure information.

Error Handling and Recovery — section overview with error structure, propagation model, and recovery strategies
Error Propagation — how errors flow between components including the ErrorInfo structure

Service Failure Handling — service instance failure detection and restart policies

Update Flow Overview — end-to-end update sequence showing all phases
Update Handler State Machine — complete FOTA UM state machine with commands and transitions
Rollback and Recovery — deployment-flows perspective on rollback mechanisms

Image Deployment Pipeline — detailed image download and Verification flow
Desired State Model — how the desired-state reconciliation model drives convergence

Introduction​

Failure Scenarios by Update Stage​

Download Failures (Downloading State)​

Verification Failures (Downloading State)​

Installation Failures (Installing State)​

Launch Failures (Launching State)​

SOTA Launch Failures​

FOTA Launch Failures​

WaitingActive Timeout​

Finalization Failures (Finalizing State)​

FOTA Rollback Protocol​

Rollback Triggers​

Rollback Sequence​

Point of No Return​

SOTA Recovery Behavior​

Crash Recovery​

Persisted State​

Recovery Sequence​

Recovery Behavior by State​

Cancellation During Recovery​

Error Reporting to AosCloud​

Unit Status​

Update Notifications​

Related Pages​