Skip to main content
Version: v1.1

Error Handling and Recovery

Introduction

This section documents the error handling and recovery mechanisms in AosCore — how errors are structured, how they propagate between components, how failures are detected and reported to AosCloud, and how the system recovers from faults at various levels. Understanding these mechanisms is essential for OEMs integrating with AosEdge, as they determine how the Unit behaves under failure conditions and what error information is available for diagnostics.

AosCore follows a consistent error handling philosophy across all components: errors are structured with typed codes and human-readable messages, propagated through well-defined interfaces (gRPC between components, JSON WebSocket to the cloud), and reported as part of the Unit's status. Recovery strategies vary by failure type — from automatic retries with exponential backoff for transient failures, to state machine resets for update failures, to full rollback for firmware deployment failures.

Error Structure

All AosCore components use a common error type that carries three pieces of information:

FieldTypeDescription
AOS codeIntegerInternal error classification code identifying the error category
Exit codeIntegerProcess-level exit code (relevant for service instance failures)
MessageStringHuman-readable description of the error condition

This structure is defined in the ErrorInfo protobuf message (common.v2.ErrorInfo) and is embedded in status messages across all inter-component APIs — Service Manager instance status, Update Manager component status, IAM operation responses, and cloud-facing Unit status reports.

Internally, the C++ implementation uses a richer Error class with typed enumerations:

Error TypeMeaning
eNoneNo error — operation succeeded
eFailedGeneric failure
eRuntimeRuntime error (often wraps system errno)
eNoMemoryMemory allocation failure
eOutOfRangeIndex or value out of valid range
eNotFoundRequested resource not found
eInvalidArgumentInvalid parameter passed to a function
eTimeoutOperation timed out
eAlreadyExistResource already exists
eWrongStateOperation invalid in current state
eInvalidChecksumIntegrity Verification failed
eNotSupportedOperation not supported
eCanceledOperation was canceled

These internal error types are mapped to ErrorInfo when crossing component boundaries via gRPC or when reporting status to the cloud.

Error Propagation Model

Errors in AosCore propagate through a layered architecture:

LayerPropagation MechanismExample
Within a componentReturn values (Error or RetWithError<T>)SM image download returns error to SM launcher
Between components on the same NodegRPC status codes + ErrorInfo in response messagesSM reports instance failure to CM via InstanceStatus.error
Between NodesgRPC streams with ErrorInfo fieldsSecondary Node SM reports to primary Node CM
To the cloudJSON WebSocket unitStatus message with error fieldsCM aggregates all errors into UnitStatus and sends to AosCloud

Component-to-Component Error Flow

When a failure occurs at the service level, it propagates upward:

  1. Service Manager detects the failure (process exit, resource limit exceeded, health check failure) and sets the instance state to Failed with an ErrorInfo describing the cause
  2. SM reports to CM via the InstanceStatus gRPC message, which includes the error field
  3. CM aggregates the instance status into the Unit-wide status, preserving the error information
  4. CM reports to AosCloud via the unitStatus JSON message, which includes per-instance and per-Deployable-Item error details

For update-related failures, the propagation follows the update orchestration path:

  1. Update Manager (UM) reports component-level errors via UpdateStatus with per-component ErrorInfo
  2. CM Update Manager receives the error, transitions the update state machine to an error state, and includes the error in the FOTA status
  3. CM reports the update failure to AosCloud via UpdateFOTAStatus or UpdateSOTAStatus notifications

Recovery Strategies

AosCore employs different recovery strategies depending on the failure type and severity:

Automatic Retry with Backoff

Transient failures (network timeouts, temporary resource unavailability) are handled with automatic retries using exponential backoff. The common retry utility provides:

  • Configurable maximum attempts — default 3 retries
  • Exponential backoff — initial delay of 1 second, doubling on each retry up to a configurable maximum (default 1 minute)
  • Callback notification — callers can be notified on each retry attempt for logging or metrics

This pattern is used for:

  • Image downloads (Downloader retries with 1s initial delay, 5s max delay, 3 attempts)
  • gRPC connection establishment (SM Controller reconnects with 10s retry timeout)
  • Cloud communication (Message Proxy retries with configurable delay and max delay)
  • IAM certificate operations (retries with 10s timeout, 3 attempts)

Update State Machine Recovery

The CM Update Manager persists its state on every transition. If the CM process crashes during an update:

  1. On restart, the Update Manager reads the last persisted state and desired status from storage
  2. It resumes the update from the last completed state (Downloading, Pending, Installing, Launching, WaitingActive, or Finalizing)
  3. If a new desired state arrives during recovery, the current update is canceled and the new one begins

This ensures that partial deployments never leave the Unit in an inconsistent state.

FOTA Rollback

For firmware updates, the Update Manager protocol supports explicit rollback:

ScenarioRecovery Action
PrepareUpdate failsUM reports error; CM cancels the update cycle
StartUpdate failsCM sends RevertUpdate; UM rolls back to previous firmware version
ApplyUpdate failsSystem remains in revertible state; CM can retry or revert
Post-apply failure detectedRequires new update cycle — ApplyUpdate is a point of no return

Service Instance Recovery

When a service instance fails, the Service Manager handles recovery based on the configured run parameters:

  • Restart interval — configurable delay before restarting a failed instance
  • Start burst — maximum number of rapid restarts allowed before backing off
  • Start interval — minimum time between restart attempts

If an instance repeatedly fails beyond the configured limits, it remains in the Failed state and the error is reported to CM and ultimately to AosCloud for operator intervention.

Connection Recovery

All gRPC connections between components implement automatic reconnection:

  • SM Controller reconnects to Service Managers with a 10-second retry timeout
  • Cloud connection (Message Proxy) reconnects with configurable retry delays
  • IAM connections reconnect with retry logic when certificate operations fail

When the cloud connection is lost, the Unit continues operating with its last known desired state. Status updates are queued and sent when connectivity is restored.

Cloud Error Reporting

The cloud receives error information through two channels:

Unit Status Messages

The unitStatus message sent periodically to AosCloud includes error fields at multiple levels:

  • Per-instance errors — which service instances have failed and why
  • Per-Deployable-Item errors — which items failed to download, verify, or install
  • Per-Node errors — Node-level issues (connectivity loss, resource exhaustion)
  • Unit configuration errors — configuration application failures

Update Notifications

During active updates, the UpdateSchedulerService streams real-time notifications including:

  • SOTA status with per-item error information
  • FOTA status with per-component error information and overall update state

Negative Acknowledgments

When the cloud sends a message that cannot be processed, CM responds with a Nack message that includes a retryAfter duration, indicating when the cloud should retry the operation.

Section Contents

This section covers error handling in detail across the following pages:

  • Error Propagation — detailed documentation of how errors flow between components, including the ErrorInfo structure, gRPC error mapping, and cloud protocol error encoding
  • Service Failure Handling — how the Service Manager detects, reports, and recovers from service instance failures, including restart policies and resource limit enforcement
  • Update Failure and Rollback — how update failures are detected at each stage of the deployment state machine, rollback procedures for FOTA, and crash recovery for interrupted updates
  • Troubleshooting — practical guidance for diagnosing and resolving common errors