Node and Unit Health
Introduction
This page provides practical troubleshooting guidance for Node and Unit health problems — situations where a Node becomes unreachable, runs out of resources, experiences component crashes, or fails to apply configuration updates. These issues affect the overall operational health of the Unit and typically require operator intervention to diagnose and resolve.
Each problem category follows a consistent structure: observable symptoms, diagnostic data collection, common root causes, and resolution steps. For background on how Node lifecycle states work, see Node Lifecycle. For details on the monitoring pipeline that detects resource issues, see Monitoring Pipeline.
Node Disconnection
A Node disconnection occurs when the Main Node loses communication with a Secondary Node. The Communication Manager detects this through the SM connection timeout mechanism — if no data is received from a Node's Service Manager within the configured timeout period, the Node is reported as disconnected and eventually transitions to an error state.
Symptoms
- AosCloud dashboard shows a Node with
isConnected: false - The Node's state transitions to
errorwith message "SM connection timeout" after the configured timeout expires - Services previously running on the disconnected Node continue running locally but cannot receive new deployments or state changes
- The Main Node's CM reports the disconnected Node in the
unitStatusmessage to the cloud
Diagnostic Steps
1. Check the Node's network connectivity:
# From the Main Node, verify network reachability
ping <secondary-node-ip>
# Check if the gRPC port is accessible
nc -zv <secondary-node-ip> 8089
2. Check the IAM registration stream on the Secondary Node:
# On the Secondary Node, check IAM client logs for connection attempts
journalctl -u aos-iam --since "10 minutes ago" | grep -i "connect\|stream\|register"
The IAM client on the Secondary Node maintains the RegisterNode bidirectional gRPC stream to the Main Node. Look for
connection errors, TLS handshake failures, or repeated reconnection attempts.
3. Check the SM registration on the Secondary Node:
# On the Secondary Node, check SM logs for CM connection status
journalctl -u aos-sm --since "10 minutes ago" | grep -i "connect\|register\|timeout"
The SM on each Node registers with the CM on the Main Node. If the SM cannot connect, the Main Node's NodeInfoCache will not receive updates and will eventually trigger the SM connection timeout.
4. Check certificate validity:
# On the Secondary Node, check certificate expiration
openssl x509 -in /path/to/node/cert.pem -noout -dates
# Check IAM logs for certificate-related errors
journalctl -u aos-iam --since "1 hour ago" | grep -i "cert\|tls\|x509"
Root Causes
| Cause | Evidence | Resolution |
|---|---|---|
| Network failure between Nodes | ping fails, no route to host | Restore network connectivity; check cables, switches, firewall rules |
| TLS certificate expired or invalid | IAM logs show TLS handshake errors, x509: certificate has expired | Trigger certificate renewal through the provisioning system; see Certificate Architecture |
| IAM service not running on Secondary Node | systemctl status aos-iam shows inactive/failed | Restart the IAM service: systemctl restart aos-iam |
| SM service not running on Secondary Node | systemctl status aos-sm shows inactive/failed | Restart the SM service: systemctl restart aos-sm |
| Main Node IAM not accepting connections | Main Node IAM logs show binding errors or resource exhaustion | Check Main Node IAM service health and available file descriptors |
| DNS resolution failure | IAM logs show hostname resolution errors | Verify DNS configuration and /etc/hosts entries for the Main Node address |
Resolution
Once the underlying cause is resolved, the Secondary Node's IAM client automatically re-establishes the RegisterNode
stream with exponential backoff. No manual reconnection is needed — the system self-heals once connectivity is restored.
To verify recovery:
# On the Secondary Node, confirm IAM reconnected
journalctl -u aos-iam --since "2 minutes ago" | grep -i "connected"
# On the Main Node, confirm the Node is back
journalctl -u aos-cm --since "2 minutes ago" | grep -i "node info changed"
The Node's isConnected state in the cloud dashboard should return to true once the stream is re-established and the
SM reports in.
Resource Exhaustion
Resource exhaustion occurs when a Node's CPU, RAM, disk, or network usage exceeds configured thresholds. The monitoring pipeline detects these conditions and raises alerts (SystemQuotaAlert or InstanceQuotaAlert) that are forwarded to AosCloud.
Symptoms
SystemQuotaAlertorInstanceQuotaAlertalerts appear in AosCloud for the affected Node- Services on the Node become slow or unresponsive
- New service deployments fail with resource-related errors
- The Node's monitoring data shows sustained high usage for one or more resources
- In severe cases (disk full), AosCore components may fail to write state and crash
Diagnostic Steps
1. Check monitoring alerts in AosCloud:
Review the alert history for the affected Node. Alerts include the resource type (CPU, RAM, disk partition, network) and
whether the alert is in raise, continue, or fall state.
2. Check disk space on the Node:
# Check overall disk usage
df -h
# Check AosCore working directories specifically
du -sh /var/aos/
du -sh /var/aos/sm/
du -sh /var/aos/cm/
# Check for large service images consuming space
du -sh /var/aos/sm/images/
3. Check RAM usage:
# Current memory usage
free -h
# Top memory consumers
ps aux --sort=-%mem | head -20
# Check for OOM killer activity
journalctl -k | grep -i "oom\|out of memory"
4. Check CPU usage:
# Current CPU usage by process
top -bn1 | head -20
# Check for runaway service instances
systemctl list-units 'aos-service@*' --state=running
5. Check monitoring configuration:
Review the Node's alert thresholds in the Unit configuration to understand what limits are configured. See Unit Configuration for the alert rules schema.
Root Causes
| Cause | Evidence | Resolution |
|---|---|---|
| Service instance consuming excessive resources | InstanceQuotaAlert for specific instance; high CPU/RAM in ps output | Review the service's resource requirements; adjust resource ratios in Unit configuration; contact the service developer |
| Accumulated service images filling disk | /var/aos/sm/images/ consuming significant space | The Image Manager should garbage-collect unused images; check if old versions are being retained due to rollback policies |
| Log files consuming disk space | Large files in /var/log/ or journal storage | Configure journal size limits (SystemMaxUse in journald.conf); rotate or archive old logs |
| Memory leak in a service instance | RAM usage grows continuously over time for one instance | Restart the affected service instance; report the leak to the service developer |
| Insufficient resource allocation for workload | Multiple services competing for limited Node resources | Adjust resource ratios in the Unit configuration; redistribute services across Nodes using scheduling labels and priorities |
Resolution
For immediate relief from resource pressure:
# If disk is full, identify and remove unnecessary files
journalctl --vacuum-size=100M # Reduce journal to 100MB
# If a specific service instance is the cause, it can be stopped via cloud
# (send a desired state without that instance) or locally:
systemctl stop aos-service@<instance-id>
For long-term resolution, adjust the Unit configuration through AosCloud to set appropriate alert thresholds and resource ratios that match the Node's hardware capabilities and workload requirements.
Component Crashes
AosCore components (CM, SM, IAM) run as systemd services. When a component crashes, systemd detects the failure and
(depending on the unit file configuration) may automatically restart it. The Journal Alerts subsystem monitors the
systemd journal for error-level messages from these services and forwards them to AosCloud as CoreAlert messages.
Symptoms
CoreAlertappears in AosCloud identifying the crashed component (CM, SM, or IAM) and the Node- The component's systemd service shows as
failedor is in a restart loop - Functionality provided by the crashed component is unavailable:
- CM crash: Cloud communication lost, no status updates, no desired-state processing
- SM crash: Service instances continue running but no new deployments, no monitoring data
- IAM crash: Certificate operations fail, Node registration stream drops, provisioning unavailable
Diagnostic Steps
1. Check the component's service status:
# Check which AosCore services are running
systemctl status aos-cm aos-sm aos-iam
# Check for recent failures
systemctl list-units 'aos-*' --state=failed
2. Examine the journal for crash details:
# For Communication Manager crashes
journalctl -u aos-cm --since "30 minutes ago" -p err
# For Service Manager crashes
journalctl -u aos-sm --since "30 minutes ago" -p err
# For IAM crashes
journalctl -u aos-iam --since "30 minutes ago" -p err
# Check for segfaults or signals
journalctl -u aos-cm --since "1 hour ago" | grep -i "signal\|segfault\|abort\|core dump"
3. Check restart behavior:
# See how many times the service has restarted
systemctl show aos-sm --property=NRestarts
# Check if the service is in a restart loop (rapid restarts)
journalctl -u aos-sm --since "10 minutes ago" | grep -i "start\|stop\|exit"
4. Check for resource-related crash causes:
# Check if OOM killer terminated the process
journalctl -k --since "1 hour ago" | grep -i "oom.*aos"
# Check available disk space (components need to write state)
df -h /var/aos/
Root Causes
| Cause | Evidence | Resolution |
|---|---|---|
| Out-of-memory kill | Kernel log shows OOM killer targeting aos-cm, aos-sm, or aos-iam | Increase available RAM; reduce service instance count; adjust memory limits in systemd unit file |
| Corrupted state database | Component logs show database errors on startup | Remove the corrupted state file and restart; the component will rebuild state from the cloud or peer components |
| Configuration error after update | Crash occurs immediately after a Unit configuration change | Revert the configuration change via AosCloud; check JSON validity of the configuration |
| Disk full preventing state writes | Logs show write errors, ENOSPC | Free disk space (see Resource Exhaustion above) |
| Certificate corruption | IAM crashes with certificate parsing errors | Re-provision the Node's certificates; see Provisioning Workflow |
Resolution
Restart the crashed component:
# Restart a specific component
systemctl restart aos-sm
# If the service is in a failed state and won't restart automatically
systemctl reset-failed aos-sm
systemctl start aos-sm
If the component is in a restart loop, the underlying cause must be resolved first. Check the journal for the error that occurs during startup — this is typically a configuration issue, corrupted state, or missing dependency.
Verify recovery:
# Confirm the service is running
systemctl is-active aos-sm
# Check that it's functioning (SM should register with CM)
journalctl -u aos-sm --since "1 minute ago" | grep -i "register\|connect\|init"
After a CM crash and recovery, the CM re-establishes the WebSocket connection to AosCloud and sends a full (non-delta)
unitStatus message, bringing the cloud back in sync with the Unit's actual state.
Unit Configuration Failures
Unit configuration failures occur when the Communication Manager receives a new Unit configuration from AosCloud but
cannot successfully apply it. The configuration status is reported back to the cloud with a failed state and an error
message describing the failure.
Symptoms
- The Unit configuration status in AosCloud shows state
failedwith an error message - The Node configuration status shows
failedfor specific Nodes - Alert thresholds, resource ratios, or labels are not updated as expected
- The CM logs show configuration processing errors
Diagnostic Steps
1. Check the Unit configuration status in AosCloud:
The unitStatus message includes a unitConfig array with the status of each configuration version. Look for entries
with state: "failed" and examine the error field.
2. Check CM logs for configuration processing:
# Check CM logs for unit config handling
journalctl -u aos-cm --since "30 minutes ago" | grep -i "unit.*config\|node.*config"
# Look for JSON parsing errors
journalctl -u aos-cm --since "30 minutes ago" | grep -i "json\|parse\|format"
3. Check Node configuration distribution:
# On the Main Node, check SM controller logs for config distribution
journalctl -u aos-cm --since "30 minutes ago" | grep -i "check.*config\|set.*config"
# On Secondary Nodes, check SM logs for config reception
journalctl -u aos-sm --since "30 minutes ago" | grep -i "config\|version"
4. Verify the configuration JSON:
If you have access to the configuration document, validate its structure:
- Ensure
formatVersionmatches the expected schema version - Ensure
versionis strictly higher than the currently installed version - Verify that
nodesarray entries have validnodeGroupSubjectobjects - Check that alert rule values are within valid ranges (percentages 0–100)
Root Causes
| Cause | Evidence | Resolution |
|---|---|---|
| Invalid JSON syntax | CM logs show JSON parse errors | Fix the JSON syntax in the configuration document on the cloud side |
| Version not higher than current | CM logs show version comparison failure | Ensure the new configuration has a strictly higher version string than the currently installed one |
Unknown formatVersion | CM logs show unsupported format version | Use a formatVersion compatible with the installed AosCore version |
| Node ID mismatch | Node config entry references a Node ID that doesn't exist in the Unit | Verify Node IDs in the configuration match the actual provisioned Node IDs |
| SM rejects Node configuration | SM logs show config validation errors on the target Node | Check that the Node configuration values are valid for the target Node's capabilities |
| Network failure during distribution | CM logs show timeout distributing config to Secondary Nodes | Ensure all target Nodes are connected; retry the configuration push |
Resolution
Unit configuration failures are non-destructive — the previously installed configuration remains active. To resolve:
- Identify the error from the
unitStatusreport or CM logs - Fix the configuration on the cloud side (correct JSON, bump version, fix Node references)
- Push the corrected configuration — the CM will process the new version and report the updated status
The configuration state machine has three states:
absent— no configuration has been installedinstalled— configuration successfully appliedfailed— configuration could not be applied (error message describes why)
A failed configuration does not prevent subsequent configuration attempts. Pushing a new, corrected configuration with a higher version will be processed normally.
Related Pages
- Node Lifecycle — Node state machine, registration, and disconnection handling
- Monitoring Pipeline — how resource metrics are collected, averaged, and alerts evaluated
- Alerts and Thresholds — alert rule configuration and threshold behavior
- Unit Configuration — Unit configuration JSON schema and version management
- Error Handling and Recovery — system-level error propagation and recovery mechanisms
- Connectivity Issues — cloud connection and inter-Node communication troubleshooting
- Service Deployment Failures — image download, launch, and instance failure troubleshooting