Skip to main content
Version: v1.1

Update Failure and Rollback

Introduction

This page documents how AosCore detects, reports, and recovers from update failures — covering every stage of the deployment state machine from download through finalization. Each failure scenario has a specific detection mechanism, error reporting path, and recovery action. Understanding these mechanisms is essential for OEMs diagnosing deployment issues and implementing custom Update Managers that handle failures correctly.

The CM Update Manager drives updates through a sequential state machine (Downloading → Pending → Installing → Launching → WaitingActive → Finalizing). A failure at any stage halts forward progress, reports the error to AosCloud, and — depending on the stage — may trigger automatic rollback or require a new update cycle to recover.

Failure Scenarios by Update Stage

Download Failures (Downloading State)

During the Downloading phase, the Image Manager downloads OCI image blobs for all Deployable Items specified in the desired state. Failures can occur at multiple points:

Failure TypeDetection MechanismError Reported
Network timeoutHTTP request timeout (10s)eTimeout — download timed out
Connection refusedcurl connection erroreRuntime — wraps system errno
Partial download / interruptedRetry exhausted after 3 attemptseRuntime — last download error
Insufficient storageSpace allocator rejects allocationeNoMemory or space allocation error
Download cancelledCancel flag set during retry loopeRuntime — "download cancelled"

Retry behavior: The Downloader retries each blob download up to 3 times with exponential backoff (1s initial delay, doubling to a maximum of 5s between attempts). If all retries are exhausted, the download is considered failed.

Recovery action: The update state machine transitions to eNone (idle). The system remains on its previous desired state. The failed Deployable Items are marked with eFailed state and the error is reported to AosCloud via the Unit status. A subsequent desired state from the cloud (even if identical) triggers a fresh download attempt.

Verification Failures (Downloading State)

After downloading each blob, the Image Manager verifies its integrity by comparing the computed SHA-256 digest against the expected digest from the OCI image index.

Failure TypeDetection MechanismError Reported
Digest mismatchSHA-256 of downloaded blob ≠ expected digesteInvalidChecksum — integrity Verification failed
Size mismatchDownloaded file size ≠ expected size from descriptorSize Verification error
Corrupted OCI indexIndex file cannot be parsed as valid OCI specParse error from OCISpecItf
Decryption failureCryptoHelper cannot decrypt encrypted blobCrypto error from CryptoHelperItf

Detection mechanism: The Image Manager calls VerifyBlobChecksum() for each downloaded blob, comparing the file's SHA-256 hash against the digest encoded in the blob's filename (which comes from the OCI content descriptor). Additionally, VerifyBlobsIntegrity() checks all blobs for an item during the install phase.

Recovery action: The corrupted blob is discarded. The affected Deployable Item is set to eFailed state. The overall update transitions to idle and the failure is reported to AosCloud. On the next update cycle, the Image Manager's CleanupDownloadingItems() removes partially downloaded items before retrying.

Installation Failures (Installing State)

The Installing phase applies configuration changes — Node state transitions and Unit configuration updates. These are non-image operations that prepare the environment for new service instances.

Failure TypeDetection MechanismError Reported
Unit config check failureCheckUnitConfig() returns errorConfig check error
Unit config apply failureUpdateUnitConfig() returns errorConfig application error
Node state change failurePauseNode() or ResumeNode() returns errorNode state transition error

Detection mechanism: The InstallDesiredStatus() method calls CheckUnitConfig() before UpdateUnitConfig(). If either fails, the error is captured. For Node state changes, each PauseNode() or ResumeNode() call is checked individually.

Recovery action: Unlike download failures, installation failures do not abort the entire update cycle. The error is recorded per-Node (via SetUpdateNodeStatus()) or per-config (via SetUpdateUnitConfigStatus() with eFailed state), and the state machine continues to the Launching phase. This allows partial success — services can still be launched even if a Node state change or config update failed. The errors are reported to AosCloud as part of the Unit status.

Launch Failures (Launching State)

During the Launching phase, the CM instructs Service Managers to run new instances (SOTA) or sends PrepareUpdate/StartUpdate to Update Managers (FOTA).

SOTA Launch Failures

Failure TypeDetection MechanismError Reported
SM unreachablegRPC connection failure to SMeRuntime — connection error
Instance start failureSM reports eFailed in RunInstancesStatusPer-instance ErrorInfo from SM
Image not available on SMSM cannot find required service imageImage resolution error
Resource limit exceededSM resource manager rejects instanceResource allocation error

Detection mechanism: The LaunchInstances() method calls RunInstances() on the Launcher interface. The returned InstanceStatus array contains per-instance state — any instance with eFailed state is logged with its error. However, individual instance failures do not abort the update cycle.

Recovery action: Failed instances remain in eFailed state and are reported to AosCloud. The update continues to WaitingActive. The SM applies its own restart policy (configurable restart interval and start burst) to attempt recovery of failed instances.

FOTA Launch Failures

Failure TypeDetection MechanismError Reported
UM PrepareUpdate failureUM reports FAILED state in UpdateStatusErrorInfo in UpdateStatus.error
UM StartUpdate failureUM reports FAILED state after start commandErrorInfo in UpdateStatus.error
UM unreachablegRPC stream disconnectionConnection loss detected

Detection mechanism: The CM monitors the UpdateStatus stream from each registered UM. When a UM transitions to FAILED state, the error field contains the failure description.

Recovery action: When a UM reports FAILED, the CM sends RevertUpdate to that UM (and to any other UMs that have already applied their updates), restoring all firmware components to their previous versions. The UM transitions back to IDLE after successful revert. The failure is reported to AosCloud via UpdateFOTAStatus.

WaitingActive Timeout

After launching, the CM enters the WaitingActive state and monitors all instances and components until they reach their target state.

Failure TypeDetection MechanismError Reported
Instance stuck in activatingInstance remains in eActivating state beyond timeoutTimeout error
Instance failed during activationInstance transitions to eFailed during waitPer-instance error
UM component not reaching INSTALLEDUM component remains in INSTALLING stateTimeout error

Detection mechanism: The WaitInstancesActive() method polls instance statuses via GetInstancesStatuses() and waits on a condition variable with a 10-minute timeout (cWaitActiveTimeout = 10 minutes). If any instance remains in eActivating state when the timeout expires, the wait returns a timeout error.

Recovery action: The timeout error causes the update state machine to transition to eNone (idle) without reaching the Finalizing phase. The Deployable Items are not committed as installed — they remain in their pending state. The failure is reported to AosCloud. A new desired state from the cloud can trigger a fresh update attempt.

Finalization Failures (Finalizing State)

The Finalizing phase commits downloaded items as installed and performs cleanup.

Failure TypeDetection MechanismError Reported
Blob integrity failureVerifyBlobsIntegrity() detects corrupted blobeInvalidChecksum
Storage write failureDatabase update fails when setting item to installedStorage error

Detection mechanism: During InstallUpdateItems() (called in the Finalizing phase), the Image Manager verifies blob integrity one final time before committing items to installed state. Items that fail integrity checks are removed and reported as failed.

Recovery action: Items that fail finalization are set to eFailed state. The update state machine transitions to idle. Successfully finalized items remain installed. The partial failure is reported to AosCloud.

FOTA Rollback Protocol

The FOTA rollback mechanism uses the RevertUpdate command in the UM protocol to restore firmware components to their previous version. Rollback is possible only before ApplyUpdate commits the new firmware permanently.

Rollback Triggers

TriggerConditionCM Action
UM reports FAILED after PrepareUpdateDownload or Verification error in UMSend RevertUpdate to recover UM to IDLE
UM reports FAILED after StartUpdateFirmware apply operation failedSend RevertUpdate to restore previous firmware
Multi-UM coordination failureOne UM fails while others succeededSend RevertUpdate to all UMs that reached UPDATED state
New desired state during FOTACancellation of current updateSend RevertUpdate to UMs in UPDATED state

Rollback Sequence

When a FOTA rollback is triggered:

  1. CM identifies all UMs that need to revert (those in UPDATED or FAILED state)
  2. CM sends RevertUpdate command to each affected UM via the gRPC stream
  3. Each UM restores the previous firmware version (e.g., switches boot partition back)
  4. Each UM reports UpdateStatus with state IDLE to confirm successful revert
  5. CM reports the rollback outcome to AosCloud via UpdateFOTAStatus

Point of No Return

The ApplyUpdate command marks the point of no return for FOTA updates:

  • Before ApplyUpdate: The previous firmware is preserved (e.g., on the inactive A/B partition). RevertUpdate can restore it.
  • After ApplyUpdate: The new firmware is committed as permanent. The previous version is removed. Recovery requires a new update cycle with the desired firmware version.

OEMs implementing custom Update Managers must ensure that StartUpdate preserves the previous firmware in a revertible state until ApplyUpdate is received.

SOTA Recovery Behavior

Unlike FOTA, SOTA updates do not have an explicit rollback command. Recovery from SOTA failures relies on the desired-state reconciliation model:

ScenarioRecovery Mechanism
Service instance won't startSM restart policy (configurable interval and burst)
Service repeatedly crashesInstance remains in eFailed state; reported to cloud
Image corrupted after downloadNext update cycle re-downloads and re-verifies
Wrong version deployedCloud sends corrected desired state; CM reconciles

The key difference: SOTA operates on a convergence model — the system continuously attempts to match the desired state. If a service fails, the SM retries according to its restart policy. If the desired state itself is wrong, the cloud sends a corrected version and the CM processes it as a new update.

Crash Recovery

The CM Update Manager persists its state on every transition, enabling recovery from process crashes without losing update progress.

Persisted State

DataStorage MethodPurpose
Current update stateStoreUpdateState()Resume from correct phase after restart
Desired statusStoreDesiredStatus()Know what target to converge toward

Recovery Sequence

When the CM process restarts after a crash:

  1. Start() reads the last persisted update state via GetUpdateState()
  2. If the state is not eNone, the handler reads the stored desired status via GetDesiredStatus()
  3. The update resumes from the persisted state — for example, if the crash occurred during Launching, the handler re-enters the Launching phase
  4. The state machine continues forward from that point

Recovery Behavior by State

Crashed In StateRecovery Behavior
DownloadingRe-enters Downloading; Image Manager resumes partial downloads (blobs already on disk are reused)
PendingAdvances to Installing (all downloads were complete)
InstallingRe-applies configuration changes (idempotent operations)
LaunchingRe-sends RunInstances to SMs; re-sends PrepareUpdate/StartUpdate to UMs
WaitingActiveRe-enters wait loop; checks current instance statuses
FinalizingRe-runs finalization (idempotent commit of installed items)

Cancellation During Recovery

If a new desired state arrives while the CM is recovering from a crash:

  1. The recovery update is marked for cancellation
  2. The new desired state is stored as pending
  3. Once the current state action completes (or is cancelled), the new update begins from Downloading

Error Reporting to AosCloud

Update failures are reported to AosCloud through two channels:

Unit Status

The UnitStatus message includes per-item and per-instance error information:

  • Deployable Item status — items that failed download, Verification, or installation include their ErrorInfo
  • Instance status — instances that failed to launch or activate include their error details
  • Unit config status — configuration apply failures with error description
  • Node status — per-Node errors for state transition failures

Update Notifications

Active update progress is streamed via the Update Scheduler API:

  • UpdateSOTAStatus — includes per-service and per-layer status with errors
  • UpdateFOTAStatus — includes per-component status with errors and overall update state

These notifications allow external systems (HMI, fleet management) to display real-time update progress and failure information.