Error Handling and Recovery
Introduction
This section documents the error handling and recovery mechanisms in AosCore — how errors are structured, how they propagate between components, how failures are detected and reported to AosCloud, and how the system recovers from faults at various levels. Understanding these mechanisms is essential for OEMs integrating with AosEdge, as they determine how the Unit behaves under failure conditions and what error information is available for diagnostics.
AosCore follows a consistent error handling philosophy across all components: errors are structured with typed codes and human-readable messages, propagated through well-defined interfaces (gRPC between components, JSON WebSocket to the cloud), and reported as part of the Unit's status. Recovery strategies vary by failure type — from automatic retries with exponential backoff for transient failures, to state machine resets for update failures, to full rollback for firmware deployment failures.
Error Structure
All AosCore components use a common error type that carries three pieces of information:
| Field | Type | Description |
|---|---|---|
| AOS code | Integer | Internal error classification code identifying the error category |
| Exit code | Integer | Process-level exit code (relevant for service instance failures) |
| Message | String | Human-readable description of the error condition |
This structure is defined in the ErrorInfo protobuf message (common.v2.ErrorInfo) and is embedded in status messages
across all inter-component APIs — Service Manager instance status, Update Manager component status, IAM operation
responses, and cloud-facing Unit status reports.
Internally, the C++ implementation uses a richer Error class with typed enumerations:
| Error Type | Meaning |
|---|---|
eNone | No error — operation succeeded |
eFailed | Generic failure |
eRuntime | Runtime error (often wraps system errno) |
eNoMemory | Memory allocation failure |
eOutOfRange | Index or value out of valid range |
eNotFound | Requested resource not found |
eInvalidArgument | Invalid parameter passed to a function |
eTimeout | Operation timed out |
eAlreadyExist | Resource already exists |
eWrongState | Operation invalid in current state |
eInvalidChecksum | Integrity Verification failed |
eNotSupported | Operation not supported |
eCanceled | Operation was canceled |
These internal error types are mapped to ErrorInfo when crossing component boundaries via gRPC or when reporting
status to the cloud.
Error Propagation Model
Errors in AosCore propagate through a layered architecture:
| Layer | Propagation Mechanism | Example |
|---|---|---|
| Within a component | Return values (Error or RetWithError<T>) | SM image download returns error to SM launcher |
| Between components on the same Node | gRPC status codes + ErrorInfo in response messages | SM reports instance failure to CM via InstanceStatus.error |
| Between Nodes | gRPC streams with ErrorInfo fields | Secondary Node SM reports to primary Node CM |
| To the cloud | JSON WebSocket unitStatus message with error fields | CM aggregates all errors into UnitStatus and sends to AosCloud |
Component-to-Component Error Flow
When a failure occurs at the service level, it propagates upward:
- Service Manager detects the failure (process exit, resource limit exceeded, health check failure) and sets the
instance state to
Failedwith anErrorInfodescribing the cause - SM reports to CM via the
InstanceStatusgRPC message, which includes the error field - CM aggregates the instance status into the Unit-wide status, preserving the error information
- CM reports to AosCloud via the
unitStatusJSON message, which includes per-instance and per-Deployable-Item error details
For update-related failures, the propagation follows the update orchestration path:
- Update Manager (UM) reports component-level errors via
UpdateStatuswith per-componentErrorInfo - CM Update Manager receives the error, transitions the update state machine to an error state, and includes the error in the FOTA status
- CM reports the update failure to AosCloud via
UpdateFOTAStatusorUpdateSOTAStatusnotifications
Recovery Strategies
AosCore employs different recovery strategies depending on the failure type and severity:
Automatic Retry with Backoff
Transient failures (network timeouts, temporary resource unavailability) are handled with automatic retries using exponential backoff. The common retry utility provides:
- Configurable maximum attempts — default 3 retries
- Exponential backoff — initial delay of 1 second, doubling on each retry up to a configurable maximum (default 1 minute)
- Callback notification — callers can be notified on each retry attempt for logging or metrics
This pattern is used for:
- Image downloads (Downloader retries with 1s initial delay, 5s max delay, 3 attempts)
- gRPC connection establishment (SM Controller reconnects with 10s retry timeout)
- Cloud communication (Message Proxy retries with configurable delay and max delay)
- IAM certificate operations (retries with 10s timeout, 3 attempts)
Update State Machine Recovery
The CM Update Manager persists its state on every transition. If the CM process crashes during an update:
- On restart, the Update Manager reads the last persisted state and desired status from storage
- It resumes the update from the last completed state (Downloading, Pending, Installing, Launching, WaitingActive, or Finalizing)
- If a new desired state arrives during recovery, the current update is canceled and the new one begins
This ensures that partial deployments never leave the Unit in an inconsistent state.
FOTA Rollback
For firmware updates, the Update Manager protocol supports explicit rollback:
| Scenario | Recovery Action |
|---|---|
PrepareUpdate fails | UM reports error; CM cancels the update cycle |
StartUpdate fails | CM sends RevertUpdate; UM rolls back to previous firmware version |
ApplyUpdate fails | System remains in revertible state; CM can retry or revert |
| Post-apply failure detected | Requires new update cycle — ApplyUpdate is a point of no return |
Service Instance Recovery
When a service instance fails, the Service Manager handles recovery based on the configured run parameters:
- Restart interval — configurable delay before restarting a failed instance
- Start burst — maximum number of rapid restarts allowed before backing off
- Start interval — minimum time between restart attempts
If an instance repeatedly fails beyond the configured limits, it remains in the Failed state and the error is reported
to CM and ultimately to AosCloud for operator intervention.
Connection Recovery
All gRPC connections between components implement automatic reconnection:
- SM Controller reconnects to Service Managers with a 10-second retry timeout
- Cloud connection (Message Proxy) reconnects with configurable retry delays
- IAM connections reconnect with retry logic when certificate operations fail
When the cloud connection is lost, the Unit continues operating with its last known desired state. Status updates are queued and sent when connectivity is restored.
Cloud Error Reporting
The cloud receives error information through two channels:
Unit Status Messages
The unitStatus message sent periodically to AosCloud includes error fields at multiple levels:
- Per-instance errors — which service instances have failed and why
- Per-Deployable-Item errors — which items failed to download, verify, or install
- Per-Node errors — Node-level issues (connectivity loss, resource exhaustion)
- Unit configuration errors — configuration application failures
Update Notifications
During active updates, the UpdateSchedulerService streams real-time notifications including:
- SOTA status with per-item error information
- FOTA status with per-component error information and overall update state
Negative Acknowledgments
When the cloud sends a message that cannot be processed, CM responds with a Nack message that includes a retryAfter
duration, indicating when the cloud should retry the operation.
Section Contents
This section covers error handling in detail across the following pages:
- Error Propagation — detailed documentation of how errors flow between components, including the ErrorInfo structure, gRPC error mapping, and cloud protocol error encoding
- Service Failure Handling — how the Service Manager detects, reports, and recovers from service instance failures, including restart policies and resource limit enforcement
- Update Failure and Rollback — how update failures are detected at each stage of the deployment state machine, rollback procedures for FOTA, and crash recovery for interrupted updates
Related Pages
- Architecture Overview — how CM, SM, IAM, and MP interact at the system level
- Deployment Flows — the update orchestration state machine and FOTA protocol
- Service Lifecycle — service instance states including the Failed state
- Monitoring — how alerts and thresholds relate to error detection
- Cloud Communication — the WebSocket JSON protocol used for error reporting
- Troubleshooting — practical guidance for diagnosing and resolving common errors