Version: v1.1

Troubleshooting

Introduction

This section provides practical guidance for diagnosing and resolving common issues encountered when operating AosEdge Units. Rather than repeating the error handling theory covered in the Error Handling and Recovery section, these pages focus on the operator's perspective — what symptoms look like, where to find diagnostic information, and what steps to take to resolve specific problems.

The troubleshooting guides are organized by problem domain: connectivity issues, service deployment failures, and Node/Unit health problems. Each guide follows a consistent structure: symptom identification, diagnostic data collection, root cause analysis, and resolution steps.

Troubleshooting Approach

Effective troubleshooting in AosCore relies on three complementary sources of diagnostic information:

1. System Logs (journald)

All AosCore components log to the systemd journal. Each component runs as a systemd service and can be queried independently:

Component	Service Name	Key Log Topics
Communication Manager	`aos-communicationmanager`	Cloud connection, desired-state processing, update orchestration
Service Manager	`aos-servicemanager`	Image downloads, instance lifecycle, resource monitoring
Identity & Access Manager	`aos-iam`	Certificate operations, provisioning, authentication
Message Proxy	`aos-messageproxy`	Inter-Node message routing, connection management

Use journalctl -u <service-name> to view logs for a specific component. Add --since and --until flags to narrow the time window around a known issue. The -p err flag filters to error-level messages only.

Service instance logs are also available through the journal, tagged with the aos-service@<instance-id> unit name.

2. Monitoring Data and Alerts

The Monitoring subsystem provides real-time and historical resource usage data:

Resource metrics — CPU, RAM, disk, and network usage at both Node and per-instance levels
Threshold alerts — automatic notifications when resource usage exceeds configured limits
Journal alerts — error-level log entries forwarded as structured alerts to AosCloud

When investigating performance degradation or resource exhaustion, check the monitoring data for the affected Node and time period. Alert history in AosCloud shows when thresholds were first crossed, which often pinpoints the onset of a problem.

3. Unit Status Reports

The Communication Manager continuously reports Unit status to AosCloud, including:

Instance states — which services are running, failed, or pending
Error information — structured ErrorInfo with error codes and messages for each failed component
Update progress — current state of any active SOTA or FOTA deployment
Node connectivity — which Nodes are reachable and reporting

The unitStatus message in AosCloud provides a snapshot of the entire Unit's health. When a problem is reported, start by examining the most recent status to identify which components or instances are in error states.

Diagnostic Workflow

For most issues, follow this general diagnostic sequence:

Identify the symptom — What is the observable problem? (service not running, update stuck, Node unreachable, etc.)
Check Unit status — Review the latest unitStatus in AosCloud to identify error states and affected components
Narrow the scope — Determine which component(s) are involved based on the error codes and affected instances
Collect logs — Query the journal for the relevant component(s) around the time the issue began
Check monitoring data — Look for resource exhaustion, network anomalies, or threshold alerts that correlate with the issue
Identify the root cause — Match the symptoms and log evidence to a known failure pattern
Apply resolution — Follow the appropriate resolution steps for the identified root cause

Common Error Codes

AosCore uses typed error codes that appear in status reports and logs. These codes help narrow the diagnosis:

Error Code	Meaning	Typical Cause
`eFailed`	Generic failure	Check logs for specific context
`eTimeout`	Operation timed out	Network issues, overloaded Node, unresponsive component
`eNotFound`	Resource not found	Missing image, deleted configuration, invalid reference
`eInvalidArgument`	Invalid parameter	Configuration error, malformed desired state
`eWrongState`	Invalid state transition	Operation attempted at wrong lifecycle stage
`eInvalidChecksum`	Integrity Verification failed	Corrupted download, tampered image, certificate mismatch
`eNotSupported`	Operation not supported	Incompatible runtime, unsupported feature on this Node
`eCanceled`	Operation canceled	Superseded by new desired state, manual cancellation

In This Section

Connectivity Issues — diagnosing and resolving cloud disconnection, inter-Node communication failures, and network configuration problems

Service Deployment Failures — diagnosing and resolving image download errors, launch failures, resource limit violations, and instance crash loops

Node and Unit Health — diagnosing and resolving Node connectivity loss, resource exhaustion, component crashes, and Unit-level degradation

Error Handling and Recovery — how errors are structured, propagated, and recovered from at the system level
Monitoring and Observability — resource metrics collection, alerting, and log access
Service Lifecycle — service instance states and the desired-state reconciliation model
Deployment Flows — update orchestration and the SOTA/FOTA state machines
Configuration Reference — component configuration that affects timeouts, retry behavior, and resource limits

Introduction​

Troubleshooting Approach​

1. System Logs (journald)​

2. Monitoring Data and Alerts​

3. Unit Status Reports​

Diagnostic Workflow​

Common Error Codes​

In This Section​

Related Pages​