Troubleshooting
Introduction
This section provides practical guidance for diagnosing and resolving common issues encountered when operating AosEdge Units. Rather than repeating the error handling theory covered in the Error Handling and Recovery section, these pages focus on the operator's perspective — what symptoms look like, where to find diagnostic information, and what steps to take to resolve specific problems.
The troubleshooting guides are organized by problem domain: connectivity issues, service deployment failures, and Node/Unit health problems. Each guide follows a consistent structure: symptom identification, diagnostic data collection, root cause analysis, and resolution steps.
Troubleshooting Approach
Effective troubleshooting in AosCore relies on three complementary sources of diagnostic information:
1. System Logs (journald)
All AosCore components log to the systemd journal. Each component runs as a systemd service and can be queried independently:
| Component | Service Name | Key Log Topics |
|---|---|---|
| Communication Manager | aos-communicationmanager | Cloud connection, desired-state processing, update orchestration |
| Service Manager | aos-servicemanager | Image downloads, instance lifecycle, resource monitoring |
| Identity & Access Manager | aos-iam | Certificate operations, provisioning, authentication |
| Message Proxy | aos-messageproxy | Inter-Node message routing, connection management |
Use journalctl -u <service-name> to view logs for a specific component. Add --since and --until flags to narrow
the time window around a known issue. The -p err flag filters to error-level messages only.
Service instance logs are also available through the journal, tagged with the aos-service@<instance-id> unit name.
2. Monitoring Data and Alerts
The Monitoring subsystem provides real-time and historical resource usage data:
- Resource metrics — CPU, RAM, disk, and network usage at both Node and per-instance levels
- Threshold alerts — automatic notifications when resource usage exceeds configured limits
- Journal alerts — error-level log entries forwarded as structured alerts to AosCloud
When investigating performance degradation or resource exhaustion, check the monitoring data for the affected Node and time period. Alert history in AosCloud shows when thresholds were first crossed, which often pinpoints the onset of a problem.
3. Unit Status Reports
The Communication Manager continuously reports Unit status to AosCloud, including:
- Instance states — which services are running, failed, or pending
- Error information — structured
ErrorInfowith error codes and messages for each failed component - Update progress — current state of any active SOTA or FOTA deployment
- Node connectivity — which Nodes are reachable and reporting
The unitStatus message in AosCloud provides a snapshot of the entire Unit's health. When a problem is reported, start
by examining the most recent status to identify which components or instances are in error states.
Diagnostic Workflow
For most issues, follow this general diagnostic sequence:
- Identify the symptom — What is the observable problem? (service not running, update stuck, Node unreachable, etc.)
- Check Unit status — Review the latest
unitStatusin AosCloud to identify error states and affected components - Narrow the scope — Determine which component(s) are involved based on the error codes and affected instances
- Collect logs — Query the journal for the relevant component(s) around the time the issue began
- Check monitoring data — Look for resource exhaustion, network anomalies, or threshold alerts that correlate with the issue
- Identify the root cause — Match the symptoms and log evidence to a known failure pattern
- Apply resolution — Follow the appropriate resolution steps for the identified root cause
Common Error Codes
AosCore uses typed error codes that appear in status reports and logs. These codes help narrow the diagnosis:
| Error Code | Meaning | Typical Cause |
|---|---|---|
eFailed | Generic failure | Check logs for specific context |
eTimeout | Operation timed out | Network issues, overloaded Node, unresponsive component |
eNotFound | Resource not found | Missing image, deleted configuration, invalid reference |
eInvalidArgument | Invalid parameter | Configuration error, malformed desired state |
eWrongState | Invalid state transition | Operation attempted at wrong lifecycle stage |
eInvalidChecksum | Integrity Verification failed | Corrupted download, tampered image, certificate mismatch |
eNotSupported | Operation not supported | Incompatible runtime, unsupported feature on this Node |
eCanceled | Operation canceled | Superseded by new desired state, manual cancellation |
In This Section
- Connectivity Issues — diagnosing and resolving cloud disconnection, inter-Node communication failures, and network configuration problems
- Service Deployment Failures — diagnosing and resolving image download errors, launch failures, resource limit violations, and instance crash loops
- Node and Unit Health — diagnosing and resolving Node connectivity loss, resource exhaustion, component crashes, and Unit-level degradation
Related Pages
- Error Handling and Recovery — how errors are structured, propagated, and recovered from at the system level
- Monitoring and Observability — resource metrics collection, alerting, and log access
- Service Lifecycle — service instance states and the desired-state reconciliation model
- Deployment Flows — update orchestration and the SOTA/FOTA state machines
- Configuration Reference — component configuration that affects timeouts, retry behavior, and resource limits