Version: v1.1

Node and Unit Health

Introduction

This page provides practical troubleshooting guidance for Node and Unit health problems — situations where a Node becomes unreachable, runs out of resources, experiences component crashes, or fails to apply configuration updates. These issues affect the overall operational health of the Unit and typically require operator intervention to diagnose and resolve.

Each problem category follows a consistent structure: observable symptoms, diagnostic data collection, common root causes, and resolution steps. For background on how Node lifecycle states work, see Node Lifecycle. For details on the monitoring pipeline that detects resource issues, see Monitoring Pipeline.

Node Disconnection

A Node disconnection occurs when the Main Node loses communication with a Secondary Node. The Communication Manager detects this through the SM connection timeout mechanism — if no data is received from a Node's Service Manager within the configured timeout period, the Node is reported as disconnected and eventually transitions to an error state.

Symptoms

AosCloud dashboard shows a Node with isConnected: false
The Node's state transitions to error with message "SM connection timeout" after the configured timeout expires
Services previously running on the disconnected Node continue running locally but cannot receive new deployments or state changes
The Main Node's CM reports the disconnected Node in the unitStatus message to the cloud

Diagnostic Steps

1. Check the Node's network connectivity:

# From the Main Node, verify network reachability
ping <secondary-node-ip>

# Check if the gRPC port is accessible
nc -zv <secondary-node-ip> 8089

2. Check the IAM registration stream on the Secondary Node:

# On the Secondary Node, check IAM client logs for connection attempts
journalctl -u aos-iam --since "10 minutes ago" | grep -i "connect\|stream\|register"

The IAM client on the Secondary Node maintains the RegisterNode bidirectional gRPC stream to the Main Node. Look for connection errors, TLS handshake failures, or repeated reconnection attempts.

3. Check the SM registration on the Secondary Node:

# On the Secondary Node, check SM logs for CM connection status
journalctl -u aos-sm --since "10 minutes ago" | grep -i "connect\|register\|timeout"

The SM on each Node registers with the CM on the Main Node. If the SM cannot connect, the Main Node's NodeInfoCache will not receive updates and will eventually trigger the SM connection timeout.

4. Check certificate validity:

# On the Secondary Node, check certificate expiration
openssl x509 -in /path/to/node/cert.pem -noout -dates

# Check IAM logs for certificate-related errors
journalctl -u aos-iam --since "1 hour ago" | grep -i "cert\|tls\|x509"

Root Causes

Cause	Evidence	Resolution
Network failure between Nodes	`ping` fails, no route to host	Restore network connectivity; check cables, switches, firewall rules
TLS certificate expired or invalid	IAM logs show TLS handshake errors, `x509: certificate has expired`	Trigger certificate renewal through the provisioning system; see Certificate Architecture
IAM service not running on Secondary Node	`systemctl status aos-iam` shows inactive/failed	Restart the IAM service: `systemctl restart aos-iam`
SM service not running on Secondary Node	`systemctl status aos-sm` shows inactive/failed	Restart the SM service: `systemctl restart aos-sm`
Main Node IAM not accepting connections	Main Node IAM logs show binding errors or resource exhaustion	Check Main Node IAM service health and available file descriptors
DNS resolution failure	IAM logs show hostname resolution errors	Verify DNS configuration and `/etc/hosts` entries for the Main Node address

Resolution

Once the underlying cause is resolved, the Secondary Node's IAM client automatically re-establishes the RegisterNode stream with exponential backoff. No manual reconnection is needed — the system self-heals once connectivity is restored.

To verify recovery:

# On the Secondary Node, confirm IAM reconnected
journalctl -u aos-iam --since "2 minutes ago" | grep -i "connected"

# On the Main Node, confirm the Node is back
journalctl -u aos-cm --since "2 minutes ago" | grep -i "node info changed"

The Node's isConnected state in the cloud dashboard should return to true once the stream is re-established and the SM reports in.

Resource Exhaustion

Resource exhaustion occurs when a Node's CPU, RAM, disk, or network usage exceeds configured thresholds. The monitoring pipeline detects these conditions and raises alerts (SystemQuotaAlert or InstanceQuotaAlert) that are forwarded to AosCloud.

Symptoms

SystemQuotaAlert or InstanceQuotaAlert alerts appear in AosCloud for the affected Node
Services on the Node become slow or unresponsive
New service deployments fail with resource-related errors
The Node's monitoring data shows sustained high usage for one or more resources
In severe cases (disk full), AosCore components may fail to write state and crash

Diagnostic Steps

1. Check monitoring alerts in AosCloud:

Review the alert history for the affected Node. Alerts include the resource type (CPU, RAM, disk partition, network) and whether the alert is in raise, continue, or fall state.

2. Check disk space on the Node:

# Check overall disk usage
df -h

# Check AosCore working directories specifically
du -sh /var/aos/
du -sh /var/aos/sm/
du -sh /var/aos/cm/

# Check for large service images consuming space
du -sh /var/aos/sm/images/

3. Check RAM usage:

# Current memory usage
free -h

# Top memory consumers
ps aux --sort=-%mem | head -20

# Check for OOM killer activity
journalctl -k | grep -i "oom\|out of memory"

4. Check CPU usage:

# Current CPU usage by process
top -bn1 | head -20

# Check for runaway service instances
systemctl list-units 'aos-service@*' --state=running

5. Check monitoring configuration:

Review the Node's alert thresholds in the Unit configuration to understand what limits are configured. See Unit Configuration for the alert rules schema.

Root Causes

Cause	Evidence	Resolution
Service instance consuming excessive resources	`InstanceQuotaAlert` for specific instance; high CPU/RAM in `ps` output	Review the service's resource requirements; adjust resource ratios in Unit configuration; contact the service developer
Accumulated service images filling disk	`/var/aos/sm/images/` consuming significant space	The Image Manager should garbage-collect unused images; check if old versions are being retained due to rollback policies
Log files consuming disk space	Large files in `/var/log/` or journal storage	Configure journal size limits (`SystemMaxUse` in `journald.conf`); rotate or archive old logs
Memory leak in a service instance	RAM usage grows continuously over time for one instance	Restart the affected service instance; report the leak to the service developer
Insufficient resource allocation for workload	Multiple services competing for limited Node resources	Adjust resource ratios in the Unit configuration; redistribute services across Nodes using scheduling labels and priorities

Resolution

For immediate relief from resource pressure:

# If disk is full, identify and remove unnecessary files
journalctl --vacuum-size=100M  # Reduce journal to 100MB

# If a specific service instance is the cause, it can be stopped via cloud
# (send a desired state without that instance) or locally:
systemctl stop aos-service@<instance-id>

For long-term resolution, adjust the Unit configuration through AosCloud to set appropriate alert thresholds and resource ratios that match the Node's hardware capabilities and workload requirements.

Component Crashes

AosCore components (CM, SM, IAM) run as systemd services. When a component crashes, systemd detects the failure and (depending on the unit file configuration) may automatically restart it. The Journal Alerts subsystem monitors the systemd journal for error-level messages from these services and forwards them to AosCloud as CoreAlert messages.

Symptoms

CoreAlert appears in AosCloud identifying the crashed component (CM, SM, or IAM) and the Node
The component's systemd service shows as failed or is in a restart loop
Functionality provided by the crashed component is unavailable:
- CM crash: Cloud communication lost, no status updates, no desired-state processing
- SM crash: Service instances continue running but no new deployments, no monitoring data
- IAM crash: Certificate operations fail, Node registration stream drops, provisioning unavailable

Diagnostic Steps

1. Check the component's service status:

# Check which AosCore services are running
systemctl status aos-cm aos-sm aos-iam

# Check for recent failures
systemctl list-units 'aos-*' --state=failed

2. Examine the journal for crash details:

# For Communication Manager crashes
journalctl -u aos-cm --since "30 minutes ago" -p err

# For Service Manager crashes
journalctl -u aos-sm --since "30 minutes ago" -p err

# For IAM crashes
journalctl -u aos-iam --since "30 minutes ago" -p err

# Check for segfaults or signals
journalctl -u aos-cm --since "1 hour ago" | grep -i "signal\|segfault\|abort\|core dump"

3. Check restart behavior:

# See how many times the service has restarted
systemctl show aos-sm --property=NRestarts

# Check if the service is in a restart loop (rapid restarts)
journalctl -u aos-sm --since "10 minutes ago" | grep -i "start\|stop\|exit"

4. Check for resource-related crash causes:

# Check if OOM killer terminated the process
journalctl -k --since "1 hour ago" | grep -i "oom.*aos"

# Check available disk space (components need to write state)
df -h /var/aos/

Root Causes

Cause	Evidence	Resolution
Out-of-memory kill	Kernel log shows OOM killer targeting `aos-cm`, `aos-sm`, or `aos-iam`	Increase available RAM; reduce service instance count; adjust memory limits in systemd unit file
Corrupted state database	Component logs show database errors on startup	Remove the corrupted state file and restart; the component will rebuild state from the cloud or peer components
Configuration error after update	Crash occurs immediately after a Unit configuration change	Revert the configuration change via AosCloud; check JSON validity of the configuration
Disk full preventing state writes	Logs show write errors, `ENOSPC`	Free disk space (see Resource Exhaustion above)
Certificate corruption	IAM crashes with certificate parsing errors	Re-provision the Node's certificates; see Provisioning Workflow

Resolution

Restart the crashed component:

# Restart a specific component
systemctl restart aos-sm

# If the service is in a failed state and won't restart automatically
systemctl reset-failed aos-sm
systemctl start aos-sm

If the component is in a restart loop, the underlying cause must be resolved first. Check the journal for the error that occurs during startup — this is typically a configuration issue, corrupted state, or missing dependency.

Verify recovery:

# Confirm the service is running
systemctl is-active aos-sm

# Check that it's functioning (SM should register with CM)
journalctl -u aos-sm --since "1 minute ago" | grep -i "register\|connect\|init"

After a CM crash and recovery, the CM re-establishes the WebSocket connection to AosCloud and sends a full (non-delta) unitStatus message, bringing the cloud back in sync with the Unit's actual state.

Unit Configuration Failures

Unit configuration failures occur when the Communication Manager receives a new Unit configuration from AosCloud but cannot successfully apply it. The configuration status is reported back to the cloud with a failed state and an error message describing the failure.

Symptoms

The Unit configuration status in AosCloud shows state failed with an error message
The Node configuration status shows failed for specific Nodes
Alert thresholds, resource ratios, or labels are not updated as expected
The CM logs show configuration processing errors

Diagnostic Steps

1. Check the Unit configuration status in AosCloud:

The unitStatus message includes a unitConfig array with the status of each configuration version. Look for entries with state: "failed" and examine the error field.

2. Check CM logs for configuration processing:

# Check CM logs for unit config handling
journalctl -u aos-cm --since "30 minutes ago" | grep -i "unit.*config\|node.*config"

# Look for JSON parsing errors
journalctl -u aos-cm --since "30 minutes ago" | grep -i "json\|parse\|format"

3. Check Node configuration distribution:

# On the Main Node, check SM controller logs for config distribution
journalctl -u aos-cm --since "30 minutes ago" | grep -i "check.*config\|set.*config"

# On Secondary Nodes, check SM logs for config reception
journalctl -u aos-sm --since "30 minutes ago" | grep -i "config\|version"

4. Verify the configuration JSON:

If you have access to the configuration document, validate its structure:

Ensure formatVersion matches the expected schema version
Ensure version is strictly higher than the currently installed version
Verify that nodes array entries have valid nodeGroupSubject objects
Check that alert rule values are within valid ranges (percentages 0–100)

Root Causes

Cause	Evidence	Resolution
Invalid JSON syntax	CM logs show JSON parse errors	Fix the JSON syntax in the configuration document on the cloud side
Version not higher than current	CM logs show version comparison failure	Ensure the new configuration has a strictly higher `version` string than the currently installed one
Unknown `formatVersion`	CM logs show unsupported format version	Use a `formatVersion` compatible with the installed AosCore version
Node ID mismatch	Node config entry references a Node ID that doesn't exist in the Unit	Verify Node IDs in the configuration match the actual provisioned Node IDs
SM rejects Node configuration	SM logs show config validation errors on the target Node	Check that the Node configuration values are valid for the target Node's capabilities
Network failure during distribution	CM logs show timeout distributing config to Secondary Nodes	Ensure all target Nodes are connected; retry the configuration push

Resolution

Unit configuration failures are non-destructive — the previously installed configuration remains active. To resolve:

Identify the error from the unitStatus report or CM logs
Fix the configuration on the cloud side (correct JSON, bump version, fix Node references)
Push the corrected configuration — the CM will process the new version and report the updated status

The configuration state machine has three states:

absent — no configuration has been installed
installed — configuration successfully applied
failed — configuration could not be applied (error message describes why)

A failed configuration does not prevent subsequent configuration attempts. Pushing a new, corrected configuration with a higher version will be processed normally.

Node Lifecycle — Node state machine, registration, and disconnection handling
Monitoring Pipeline — how resource metrics are collected, averaged, and alerts evaluated
Alerts and Thresholds — alert rule configuration and threshold behavior
Unit Configuration — Unit configuration JSON schema and version management
Error Handling and Recovery — system-level error propagation and recovery mechanisms
Connectivity Issues — cloud connection and inter-Node communication troubleshooting

Service Deployment Failures — image download, launch, and instance failure troubleshooting

Introduction​

Node Disconnection​

Symptoms​

Diagnostic Steps​

Root Causes​

Resolution​

Resource Exhaustion​

Symptoms​

Diagnostic Steps​

Root Causes​

Resolution​

Component Crashes​

Symptoms​

Diagnostic Steps​

Root Causes​

Resolution​

Unit Configuration Failures​

Symptoms​

Diagnostic Steps​

Root Causes​

Resolution​

Related Pages​

Introduction

Node Disconnection

Symptoms

Diagnostic Steps

Root Causes

Resolution

Resource Exhaustion

Symptoms

Diagnostic Steps

Root Causes

Resolution

Component Crashes

Symptoms

Diagnostic Steps

Root Causes

Resolution

Unit Configuration Failures

Symptoms

Diagnostic Steps

Root Causes

Resolution

Related Pages