Skip to main content
Version: v1.1

Troubleshooting

Introduction

This section provides practical guidance for diagnosing and resolving common issues encountered when operating AosEdge Units. Rather than repeating the error handling theory covered in the Error Handling and Recovery section, these pages focus on the operator's perspective — what symptoms look like, where to find diagnostic information, and what steps to take to resolve specific problems.

The troubleshooting guides are organized by problem domain: connectivity issues, service deployment failures, and Node/Unit health problems. Each guide follows a consistent structure: symptom identification, diagnostic data collection, root cause analysis, and resolution steps.

Troubleshooting Approach

Effective troubleshooting in AosCore relies on three complementary sources of diagnostic information:

1. System Logs (journald)

All AosCore components log to the systemd journal. Each component runs as a systemd service and can be queried independently:

ComponentService NameKey Log Topics
Communication Manageraos-communicationmanagerCloud connection, desired-state processing, update orchestration
Service Manageraos-servicemanagerImage downloads, instance lifecycle, resource monitoring
Identity & Access Manageraos-iamCertificate operations, provisioning, authentication
Message Proxyaos-messageproxyInter-Node message routing, connection management

Use journalctl -u <service-name> to view logs for a specific component. Add --since and --until flags to narrow the time window around a known issue. The -p err flag filters to error-level messages only.

Service instance logs are also available through the journal, tagged with the aos-service@<instance-id> unit name.

2. Monitoring Data and Alerts

The Monitoring subsystem provides real-time and historical resource usage data:

  • Resource metrics — CPU, RAM, disk, and network usage at both Node and per-instance levels
  • Threshold alerts — automatic notifications when resource usage exceeds configured limits
  • Journal alerts — error-level log entries forwarded as structured alerts to AosCloud

When investigating performance degradation or resource exhaustion, check the monitoring data for the affected Node and time period. Alert history in AosCloud shows when thresholds were first crossed, which often pinpoints the onset of a problem.

3. Unit Status Reports

The Communication Manager continuously reports Unit status to AosCloud, including:

  • Instance states — which services are running, failed, or pending
  • Error information — structured ErrorInfo with error codes and messages for each failed component
  • Update progress — current state of any active SOTA or FOTA deployment
  • Node connectivity — which Nodes are reachable and reporting

The unitStatus message in AosCloud provides a snapshot of the entire Unit's health. When a problem is reported, start by examining the most recent status to identify which components or instances are in error states.

Diagnostic Workflow

For most issues, follow this general diagnostic sequence:

  1. Identify the symptom — What is the observable problem? (service not running, update stuck, Node unreachable, etc.)
  2. Check Unit status — Review the latest unitStatus in AosCloud to identify error states and affected components
  3. Narrow the scope — Determine which component(s) are involved based on the error codes and affected instances
  4. Collect logs — Query the journal for the relevant component(s) around the time the issue began
  5. Check monitoring data — Look for resource exhaustion, network anomalies, or threshold alerts that correlate with the issue
  6. Identify the root cause — Match the symptoms and log evidence to a known failure pattern
  7. Apply resolution — Follow the appropriate resolution steps for the identified root cause

Common Error Codes

AosCore uses typed error codes that appear in status reports and logs. These codes help narrow the diagnosis:

Error CodeMeaningTypical Cause
eFailedGeneric failureCheck logs for specific context
eTimeoutOperation timed outNetwork issues, overloaded Node, unresponsive component
eNotFoundResource not foundMissing image, deleted configuration, invalid reference
eInvalidArgumentInvalid parameterConfiguration error, malformed desired state
eWrongStateInvalid state transitionOperation attempted at wrong lifecycle stage
eInvalidChecksumIntegrity Verification failedCorrupted download, tampered image, certificate mismatch
eNotSupportedOperation not supportedIncompatible runtime, unsupported feature on this Node
eCanceledOperation canceledSuperseded by new desired state, manual cancellation

In This Section

  • Connectivity Issues — diagnosing and resolving cloud disconnection, inter-Node communication failures, and network configuration problems
  • Service Deployment Failures — diagnosing and resolving image download errors, launch failures, resource limit violations, and instance crash loops
  • Node and Unit Health — diagnosing and resolving Node connectivity loss, resource exhaustion, component crashes, and Unit-level degradation