Update Failure and Rollback
Introduction
This page documents how AosCore detects, reports, and recovers from update failures — covering every stage of the deployment state machine from download through finalization. Each failure scenario has a specific detection mechanism, error reporting path, and recovery action. Understanding these mechanisms is essential for OEMs diagnosing deployment issues and implementing custom Update Managers that handle failures correctly.
The CM Update Manager drives updates through a sequential state machine (Downloading → Pending → Installing → Launching → WaitingActive → Finalizing). A failure at any stage halts forward progress, reports the error to AosCloud, and — depending on the stage — may trigger automatic rollback or require a new update cycle to recover.
Failure Scenarios by Update Stage
Download Failures (Downloading State)
During the Downloading phase, the Image Manager downloads OCI image blobs for all Deployable Items specified in the desired state. Failures can occur at multiple points:
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
| Network timeout | HTTP request timeout (10s) | eTimeout — download timed out |
| Connection refused | curl connection error | eRuntime — wraps system errno |
| Partial download / interrupted | Retry exhausted after 3 attempts | eRuntime — last download error |
| Insufficient storage | Space allocator rejects allocation | eNoMemory or space allocation error |
| Download cancelled | Cancel flag set during retry loop | eRuntime — "download cancelled" |
Retry behavior: The Downloader retries each blob download up to 3 times with exponential backoff (1s initial delay, doubling to a maximum of 5s between attempts). If all retries are exhausted, the download is considered failed.
Recovery action: The update state machine transitions to eNone (idle). The system remains on its previous desired
state. The failed Deployable Items are marked with eFailed state and the error is reported to AosCloud via the Unit
status. A subsequent desired state from the cloud (even if identical) triggers a fresh download attempt.
Verification Failures (Downloading State)
After downloading each blob, the Image Manager verifies its integrity by comparing the computed SHA-256 digest against the expected digest from the OCI image index.
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
| Digest mismatch | SHA-256 of downloaded blob ≠ expected digest | eInvalidChecksum — integrity Verification failed |
| Size mismatch | Downloaded file size ≠ expected size from descriptor | Size Verification error |
| Corrupted OCI index | Index file cannot be parsed as valid OCI spec | Parse error from OCISpecItf |
| Decryption failure | CryptoHelper cannot decrypt encrypted blob | Crypto error from CryptoHelperItf |
Detection mechanism: The Image Manager calls VerifyBlobChecksum() for each downloaded blob, comparing the file's
SHA-256 hash against the digest encoded in the blob's filename (which comes from the OCI content descriptor).
Additionally, VerifyBlobsIntegrity() checks all blobs for an item during the install phase.
Recovery action: The corrupted blob is discarded. The affected Deployable Item is set to eFailed state. The
overall update transitions to idle and the failure is reported to AosCloud. On the next update cycle, the Image
Manager's CleanupDownloadingItems() removes partially downloaded items before retrying.
Installation Failures (Installing State)
The Installing phase applies configuration changes — Node state transitions and Unit configuration updates. These are non-image operations that prepare the environment for new service instances.
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
| Unit config check failure | CheckUnitConfig() returns error | Config check error |
| Unit config apply failure | UpdateUnitConfig() returns error | Config application error |
| Node state change failure | PauseNode() or ResumeNode() returns error | Node state transition error |
Detection mechanism: The InstallDesiredStatus() method calls CheckUnitConfig() before UpdateUnitConfig(). If
either fails, the error is captured. For Node state changes, each PauseNode() or ResumeNode() call is checked
individually.
Recovery action: Unlike download failures, installation failures do not abort the entire update cycle. The error
is recorded per-Node (via SetUpdateNodeStatus()) or per-config (via SetUpdateUnitConfigStatus() with eFailed
state), and the state machine continues to the Launching phase. This allows partial success — services can still be
launched even if a Node state change or config update failed. The errors are reported to AosCloud as part of the Unit
status.
Launch Failures (Launching State)
During the Launching phase, the CM instructs Service Managers to run new instances (SOTA) or sends
PrepareUpdate/StartUpdate to Update Managers (FOTA).
SOTA Launch Failures
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
| SM unreachable | gRPC connection failure to SM | eRuntime — connection error |
| Instance start failure | SM reports eFailed in RunInstancesStatus | Per-instance ErrorInfo from SM |
| Image not available on SM | SM cannot find required service image | Image resolution error |
| Resource limit exceeded | SM resource manager rejects instance | Resource allocation error |
Detection mechanism: The LaunchInstances() method calls RunInstances() on the Launcher interface. The returned
InstanceStatus array contains per-instance state — any instance with eFailed state is logged with its error.
However, individual instance failures do not abort the update cycle.
Recovery action: Failed instances remain in eFailed state and are reported to AosCloud. The update continues to
WaitingActive. The SM applies its own restart policy (configurable restart interval and start burst) to attempt recovery
of failed instances.
FOTA Launch Failures
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
UM PrepareUpdate failure | UM reports FAILED state in UpdateStatus | ErrorInfo in UpdateStatus.error |
UM StartUpdate failure | UM reports FAILED state after start command | ErrorInfo in UpdateStatus.error |
| UM unreachable | gRPC stream disconnection | Connection loss detected |
Detection mechanism: The CM monitors the UpdateStatus stream from each registered UM. When a UM transitions to
FAILED state, the error field contains the failure description.
Recovery action: When a UM reports FAILED, the CM sends RevertUpdate to that UM (and to any other UMs that have
already applied their updates), restoring all firmware components to their previous versions. The UM transitions back to
IDLE after successful revert. The failure is reported to AosCloud via UpdateFOTAStatus.
WaitingActive Timeout
After launching, the CM enters the WaitingActive state and monitors all instances and components until they reach their target state.
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
| Instance stuck in activating | Instance remains in eActivating state beyond timeout | Timeout error |
| Instance failed during activation | Instance transitions to eFailed during wait | Per-instance error |
| UM component not reaching INSTALLED | UM component remains in INSTALLING state | Timeout error |
Detection mechanism: The WaitInstancesActive() method polls instance statuses via GetInstancesStatuses() and
waits on a condition variable with a 10-minute timeout (cWaitActiveTimeout = 10 minutes). If any instance remains
in eActivating state when the timeout expires, the wait returns a timeout error.
Recovery action: The timeout error causes the update state machine to transition to eNone (idle) without reaching
the Finalizing phase. The Deployable Items are not committed as installed — they remain in their pending state. The
failure is reported to AosCloud. A new desired state from the cloud can trigger a fresh update attempt.
Finalization Failures (Finalizing State)
The Finalizing phase commits downloaded items as installed and performs cleanup.
| Failure Type | Detection Mechanism | Error Reported |
|---|---|---|
| Blob integrity failure | VerifyBlobsIntegrity() detects corrupted blob | eInvalidChecksum |
| Storage write failure | Database update fails when setting item to installed | Storage error |
Detection mechanism: During InstallUpdateItems() (called in the Finalizing phase), the Image Manager verifies blob
integrity one final time before committing items to installed state. Items that fail integrity checks are removed and
reported as failed.
Recovery action: Items that fail finalization are set to eFailed state. The update state machine transitions to
idle. Successfully finalized items remain installed. The partial failure is reported to AosCloud.
FOTA Rollback Protocol
The FOTA rollback mechanism uses the RevertUpdate command in the UM protocol to restore firmware components to their
previous version. Rollback is possible only before ApplyUpdate commits the new firmware permanently.
Rollback Triggers
| Trigger | Condition | CM Action |
|---|---|---|
| UM reports FAILED after PrepareUpdate | Download or Verification error in UM | Send RevertUpdate to recover UM to IDLE |
| UM reports FAILED after StartUpdate | Firmware apply operation failed | Send RevertUpdate to restore previous firmware |
| Multi-UM coordination failure | One UM fails while others succeeded | Send RevertUpdate to all UMs that reached UPDATED state |
| New desired state during FOTA | Cancellation of current update | Send RevertUpdate to UMs in UPDATED state |
Rollback Sequence
When a FOTA rollback is triggered:
- CM identifies all UMs that need to revert (those in
UPDATEDorFAILEDstate) - CM sends
RevertUpdatecommand to each affected UM via the gRPC stream - Each UM restores the previous firmware version (e.g., switches boot partition back)
- Each UM reports
UpdateStatuswith stateIDLEto confirm successful revert - CM reports the rollback outcome to AosCloud via
UpdateFOTAStatus
Point of No Return
The ApplyUpdate command marks the point of no return for FOTA updates:
- Before ApplyUpdate: The previous firmware is preserved (e.g., on the inactive A/B partition).
RevertUpdatecan restore it. - After ApplyUpdate: The new firmware is committed as permanent. The previous version is removed. Recovery requires a new update cycle with the desired firmware version.
OEMs implementing custom Update Managers must ensure that StartUpdate preserves the previous firmware in a revertible
state until ApplyUpdate is received.
SOTA Recovery Behavior
Unlike FOTA, SOTA updates do not have an explicit rollback command. Recovery from SOTA failures relies on the desired-state reconciliation model:
| Scenario | Recovery Mechanism |
|---|---|
| Service instance won't start | SM restart policy (configurable interval and burst) |
| Service repeatedly crashes | Instance remains in eFailed state; reported to cloud |
| Image corrupted after download | Next update cycle re-downloads and re-verifies |
| Wrong version deployed | Cloud sends corrected desired state; CM reconciles |
The key difference: SOTA operates on a convergence model — the system continuously attempts to match the desired state. If a service fails, the SM retries according to its restart policy. If the desired state itself is wrong, the cloud sends a corrected version and the CM processes it as a new update.
Crash Recovery
The CM Update Manager persists its state on every transition, enabling recovery from process crashes without losing update progress.
Persisted State
| Data | Storage Method | Purpose |
|---|---|---|
| Current update state | StoreUpdateState() | Resume from correct phase after restart |
| Desired status | StoreDesiredStatus() | Know what target to converge toward |
Recovery Sequence
When the CM process restarts after a crash:
Start()reads the last persisted update state viaGetUpdateState()- If the state is not
eNone, the handler reads the stored desired status viaGetDesiredStatus() - The update resumes from the persisted state — for example, if the crash occurred during Launching, the handler re-enters the Launching phase
- The state machine continues forward from that point
Recovery Behavior by State
| Crashed In State | Recovery Behavior |
|---|---|
| Downloading | Re-enters Downloading; Image Manager resumes partial downloads (blobs already on disk are reused) |
| Pending | Advances to Installing (all downloads were complete) |
| Installing | Re-applies configuration changes (idempotent operations) |
| Launching | Re-sends RunInstances to SMs; re-sends PrepareUpdate/StartUpdate to UMs |
| WaitingActive | Re-enters wait loop; checks current instance statuses |
| Finalizing | Re-runs finalization (idempotent commit of installed items) |
Cancellation During Recovery
If a new desired state arrives while the CM is recovering from a crash:
- The recovery update is marked for cancellation
- The new desired state is stored as pending
- Once the current state action completes (or is cancelled), the new update begins from Downloading
Error Reporting to AosCloud
Update failures are reported to AosCloud through two channels:
Unit Status
The UnitStatus message includes per-item and per-instance error information:
- Deployable Item status — items that failed download, Verification, or installation include their
ErrorInfo - Instance status — instances that failed to launch or activate include their error details
- Unit config status — configuration apply failures with error description
- Node status — per-Node errors for state transition failures
Update Notifications
Active update progress is streamed via the Update Scheduler API:
UpdateSOTAStatus— includes per-service and per-layer status with errorsUpdateFOTAStatus— includes per-component status with errors and overall update state
These notifications allow external systems (HMI, fleet management) to display real-time update progress and failure information.
Related Pages
- Error Handling and Recovery — section overview with error structure, propagation model, and recovery strategies
- Error Propagation — how errors flow between components including the ErrorInfo structure
- Service Failure Handling — service instance failure detection and restart policies
- Update Flow Overview — end-to-end update sequence showing all phases
- Update Handler State Machine — complete FOTA UM state machine with commands and transitions
- Rollback and Recovery — deployment-flows perspective on rollback mechanisms
- Image Deployment Pipeline — detailed image download and Verification flow
- Desired State Model — how the desired-state reconciliation model drives convergence