Version: v1.1

Error Handling and Recovery

Introduction

This section documents the error handling and recovery mechanisms in AosCore — how errors are structured, how they propagate between components, how failures are detected and reported to AosCloud, and how the system recovers from faults at various levels. Understanding these mechanisms is essential for OEMs integrating with AosEdge, as they determine how the Unit behaves under failure conditions and what error information is available for diagnostics.

AosCore follows a consistent error handling philosophy across all components: errors are structured with typed codes and human-readable messages, propagated through well-defined interfaces (gRPC between components, JSON WebSocket to the cloud), and reported as part of the Unit's status. Recovery strategies vary by failure type — from automatic retries with exponential backoff for transient failures, to state machine resets for update failures, to full rollback for firmware deployment failures.

Error Structure

All AosCore components use a common error type that carries three pieces of information:

Field	Type	Description
AOS code	Integer	Internal error classification code identifying the error category
Exit code	Integer	Process-level exit code (relevant for service instance failures)
Message	String	Human-readable description of the error condition

This structure is defined in the ErrorInfo protobuf message (common.v2.ErrorInfo) and is embedded in status messages across all inter-component APIs — Service Manager instance status, Update Manager component status, IAM operation responses, and cloud-facing Unit status reports.

Internally, the C++ implementation uses a richer Error class with typed enumerations:

Error Type	Meaning
`eNone`	No error — operation succeeded
`eFailed`	Generic failure
`eRuntime`	Runtime error (often wraps system errno)
`eNoMemory`	Memory allocation failure
`eOutOfRange`	Index or value out of valid range
`eNotFound`	Requested resource not found
`eInvalidArgument`	Invalid parameter passed to a function
`eTimeout`	Operation timed out
`eAlreadyExist`	Resource already exists
`eWrongState`	Operation invalid in current state
`eInvalidChecksum`	Integrity Verification failed
`eNotSupported`	Operation not supported
`eCanceled`	Operation was canceled

These internal error types are mapped to ErrorInfo when crossing component boundaries via gRPC or when reporting status to the cloud.

Error Propagation Model

Errors in AosCore propagate through a layered architecture:

Layer	Propagation Mechanism	Example
Within a component	Return values (`Error` or `RetWithError<T>`)	SM image download returns error to SM launcher
Between components on the same Node	gRPC status codes + `ErrorInfo` in response messages	SM reports instance failure to CM via `InstanceStatus.error`
Between Nodes	gRPC streams with `ErrorInfo` fields	Secondary Node SM reports to primary Node CM
To the cloud	JSON WebSocket `unitStatus` message with error fields	CM aggregates all errors into `UnitStatus` and sends to AosCloud

Component-to-Component Error Flow

When a failure occurs at the service level, it propagates upward:

Service Manager detects the failure (process exit, resource limit exceeded, health check failure) and sets the instance state to Failed with an ErrorInfo describing the cause
SM reports to CM via the InstanceStatus gRPC message, which includes the error field
CM aggregates the instance status into the Unit-wide status, preserving the error information
CM reports to AosCloud via the unitStatus JSON message, which includes per-instance and per-Deployable-Item error details

For update-related failures, the propagation follows the update orchestration path:

Update Manager (UM) reports component-level errors via UpdateStatus with per-component ErrorInfo
CM Update Manager receives the error, transitions the update state machine to an error state, and includes the error in the FOTA status
CM reports the update failure to AosCloud via UpdateFOTAStatus or UpdateSOTAStatus notifications

Recovery Strategies

AosCore employs different recovery strategies depending on the failure type and severity:

Automatic Retry with Backoff

Transient failures (network timeouts, temporary resource unavailability) are handled with automatic retries using exponential backoff. The common retry utility provides:

Configurable maximum attempts — default 3 retries
Exponential backoff — initial delay of 1 second, doubling on each retry up to a configurable maximum (default 1 minute)
Callback notification — callers can be notified on each retry attempt for logging or metrics

This pattern is used for:

Image downloads (Downloader retries with 1s initial delay, 5s max delay, 3 attempts)
gRPC connection establishment (SM Controller reconnects with 10s retry timeout)
Cloud communication (Message Proxy retries with configurable delay and max delay)
IAM certificate operations (retries with 10s timeout, 3 attempts)

Update State Machine Recovery

The CM Update Manager persists its state on every transition. If the CM process crashes during an update:

On restart, the Update Manager reads the last persisted state and desired status from storage
It resumes the update from the last completed state (Downloading, Pending, Installing, Launching, WaitingActive, or Finalizing)
If a new desired state arrives during recovery, the current update is canceled and the new one begins

This ensures that partial deployments never leave the Unit in an inconsistent state.

FOTA Rollback

For firmware updates, the Update Manager protocol supports explicit rollback:

Scenario	Recovery Action
`PrepareUpdate` fails	UM reports error; CM cancels the update cycle
`StartUpdate` fails	CM sends `RevertUpdate`; UM rolls back to previous firmware version
`ApplyUpdate` fails	System remains in revertible state; CM can retry or revert
Post-apply failure detected	Requires new update cycle — `ApplyUpdate` is a point of no return

Service Instance Recovery

When a service instance fails, the Service Manager handles recovery based on the configured run parameters:

Restart interval — configurable delay before restarting a failed instance
Start burst — maximum number of rapid restarts allowed before backing off
Start interval — minimum time between restart attempts

If an instance repeatedly fails beyond the configured limits, it remains in the Failed state and the error is reported to CM and ultimately to AosCloud for operator intervention.

Connection Recovery

All gRPC connections between components implement automatic reconnection:

SM Controller reconnects to Service Managers with a 10-second retry timeout
Cloud connection (Message Proxy) reconnects with configurable retry delays
IAM connections reconnect with retry logic when certificate operations fail

When the cloud connection is lost, the Unit continues operating with its last known desired state. Status updates are queued and sent when connectivity is restored.

Cloud Error Reporting

The cloud receives error information through two channels:

Unit Status Messages

The unitStatus message sent periodically to AosCloud includes error fields at multiple levels:

Per-instance errors — which service instances have failed and why
Per-Deployable-Item errors — which items failed to download, verify, or install
Per-Node errors — Node-level issues (connectivity loss, resource exhaustion)
Unit configuration errors — configuration application failures

Update Notifications

During active updates, the UpdateSchedulerService streams real-time notifications including:

SOTA status with per-item error information
FOTA status with per-component error information and overall update state

Negative Acknowledgments

When the cloud sends a message that cannot be processed, CM responds with a Nack message that includes a retryAfter duration, indicating when the cloud should retry the operation.

Section Contents

This section covers error handling in detail across the following pages:

Error Propagation — detailed documentation of how errors flow between components, including the ErrorInfo structure, gRPC error mapping, and cloud protocol error encoding

Service Failure Handling — how the Service Manager detects, reports, and recovers from service instance failures, including restart policies and resource limit enforcement

Update Failure and Rollback — how update failures are detected at each stage of the deployment state machine, rollback procedures for FOTA, and crash recovery for interrupted updates

Architecture Overview — how CM, SM, IAM, and MP interact at the system level
Deployment Flows — the update orchestration state machine and FOTA protocol
Service Lifecycle — service instance states including the Failed state
Monitoring — how alerts and thresholds relate to error detection
Cloud Communication — the WebSocket JSON protocol used for error reporting

Troubleshooting — practical guidance for diagnosing and resolving common errors

Introduction​

Error Structure​

Error Propagation Model​

Component-to-Component Error Flow​

Recovery Strategies​

Automatic Retry with Backoff​

Update State Machine Recovery​

FOTA Rollback​

Service Instance Recovery​

Connection Recovery​

Cloud Error Reporting​

Unit Status Messages​

Update Notifications​

Negative Acknowledgments​

Section Contents​

Related Pages​