Skip to main content
Version: v1.1

Node and Unit Health

Introduction

This page provides practical troubleshooting guidance for Node and Unit health problems — situations where a Node becomes unreachable, runs out of resources, experiences component crashes, or fails to apply configuration updates. These issues affect the overall operational health of the Unit and typically require operator intervention to diagnose and resolve.

Each problem category follows a consistent structure: observable symptoms, diagnostic data collection, common root causes, and resolution steps. For background on how Node lifecycle states work, see Node Lifecycle. For details on the monitoring pipeline that detects resource issues, see Monitoring Pipeline.

Node Disconnection

A Node disconnection occurs when the Main Node loses communication with a Secondary Node. The Communication Manager detects this through the SM connection timeout mechanism — if no data is received from a Node's Service Manager within the configured timeout period, the Node is reported as disconnected and eventually transitions to an error state.

Symptoms

  • AosCloud dashboard shows a Node with isConnected: false
  • The Node's state transitions to error with message "SM connection timeout" after the configured timeout expires
  • Services previously running on the disconnected Node continue running locally but cannot receive new deployments or state changes
  • The Main Node's CM reports the disconnected Node in the unitStatus message to the cloud

Diagnostic Steps

1. Check the Node's network connectivity:

# From the Main Node, verify network reachability
ping <secondary-node-ip>

# Check if the gRPC port is accessible
nc -zv <secondary-node-ip> 8089

2. Check the IAM registration stream on the Secondary Node:

# On the Secondary Node, check IAM client logs for connection attempts
journalctl -u aos-iam --since "10 minutes ago" | grep -i "connect\|stream\|register"

The IAM client on the Secondary Node maintains the RegisterNode bidirectional gRPC stream to the Main Node. Look for connection errors, TLS handshake failures, or repeated reconnection attempts.

3. Check the SM registration on the Secondary Node:

# On the Secondary Node, check SM logs for CM connection status
journalctl -u aos-sm --since "10 minutes ago" | grep -i "connect\|register\|timeout"

The SM on each Node registers with the CM on the Main Node. If the SM cannot connect, the Main Node's NodeInfoCache will not receive updates and will eventually trigger the SM connection timeout.

4. Check certificate validity:

# On the Secondary Node, check certificate expiration
openssl x509 -in /path/to/node/cert.pem -noout -dates

# Check IAM logs for certificate-related errors
journalctl -u aos-iam --since "1 hour ago" | grep -i "cert\|tls\|x509"

Root Causes

CauseEvidenceResolution
Network failure between Nodesping fails, no route to hostRestore network connectivity; check cables, switches, firewall rules
TLS certificate expired or invalidIAM logs show TLS handshake errors, x509: certificate has expiredTrigger certificate renewal through the provisioning system; see Certificate Architecture
IAM service not running on Secondary Nodesystemctl status aos-iam shows inactive/failedRestart the IAM service: systemctl restart aos-iam
SM service not running on Secondary Nodesystemctl status aos-sm shows inactive/failedRestart the SM service: systemctl restart aos-sm
Main Node IAM not accepting connectionsMain Node IAM logs show binding errors or resource exhaustionCheck Main Node IAM service health and available file descriptors
DNS resolution failureIAM logs show hostname resolution errorsVerify DNS configuration and /etc/hosts entries for the Main Node address

Resolution

Once the underlying cause is resolved, the Secondary Node's IAM client automatically re-establishes the RegisterNode stream with exponential backoff. No manual reconnection is needed — the system self-heals once connectivity is restored.

To verify recovery:

# On the Secondary Node, confirm IAM reconnected
journalctl -u aos-iam --since "2 minutes ago" | grep -i "connected"

# On the Main Node, confirm the Node is back
journalctl -u aos-cm --since "2 minutes ago" | grep -i "node info changed"

The Node's isConnected state in the cloud dashboard should return to true once the stream is re-established and the SM reports in.

Resource Exhaustion

Resource exhaustion occurs when a Node's CPU, RAM, disk, or network usage exceeds configured thresholds. The monitoring pipeline detects these conditions and raises alerts (SystemQuotaAlert or InstanceQuotaAlert) that are forwarded to AosCloud.

Symptoms

  • SystemQuotaAlert or InstanceQuotaAlert alerts appear in AosCloud for the affected Node
  • Services on the Node become slow or unresponsive
  • New service deployments fail with resource-related errors
  • The Node's monitoring data shows sustained high usage for one or more resources
  • In severe cases (disk full), AosCore components may fail to write state and crash

Diagnostic Steps

1. Check monitoring alerts in AosCloud:

Review the alert history for the affected Node. Alerts include the resource type (CPU, RAM, disk partition, network) and whether the alert is in raise, continue, or fall state.

2. Check disk space on the Node:

# Check overall disk usage
df -h

# Check AosCore working directories specifically
du -sh /var/aos/
du -sh /var/aos/sm/
du -sh /var/aos/cm/

# Check for large service images consuming space
du -sh /var/aos/sm/images/

3. Check RAM usage:

# Current memory usage
free -h

# Top memory consumers
ps aux --sort=-%mem | head -20

# Check for OOM killer activity
journalctl -k | grep -i "oom\|out of memory"

4. Check CPU usage:

# Current CPU usage by process
top -bn1 | head -20

# Check for runaway service instances
systemctl list-units 'aos-service@*' --state=running

5. Check monitoring configuration:

Review the Node's alert thresholds in the Unit configuration to understand what limits are configured. See Unit Configuration for the alert rules schema.

Root Causes

CauseEvidenceResolution
Service instance consuming excessive resourcesInstanceQuotaAlert for specific instance; high CPU/RAM in ps outputReview the service's resource requirements; adjust resource ratios in Unit configuration; contact the service developer
Accumulated service images filling disk/var/aos/sm/images/ consuming significant spaceThe Image Manager should garbage-collect unused images; check if old versions are being retained due to rollback policies
Log files consuming disk spaceLarge files in /var/log/ or journal storageConfigure journal size limits (SystemMaxUse in journald.conf); rotate or archive old logs
Memory leak in a service instanceRAM usage grows continuously over time for one instanceRestart the affected service instance; report the leak to the service developer
Insufficient resource allocation for workloadMultiple services competing for limited Node resourcesAdjust resource ratios in the Unit configuration; redistribute services across Nodes using scheduling labels and priorities

Resolution

For immediate relief from resource pressure:

# If disk is full, identify and remove unnecessary files
journalctl --vacuum-size=100M # Reduce journal to 100MB

# If a specific service instance is the cause, it can be stopped via cloud
# (send a desired state without that instance) or locally:
systemctl stop aos-service@<instance-id>

For long-term resolution, adjust the Unit configuration through AosCloud to set appropriate alert thresholds and resource ratios that match the Node's hardware capabilities and workload requirements.

Component Crashes

AosCore components (CM, SM, IAM) run as systemd services. When a component crashes, systemd detects the failure and (depending on the unit file configuration) may automatically restart it. The Journal Alerts subsystem monitors the systemd journal for error-level messages from these services and forwards them to AosCloud as CoreAlert messages.

Symptoms

  • CoreAlert appears in AosCloud identifying the crashed component (CM, SM, or IAM) and the Node
  • The component's systemd service shows as failed or is in a restart loop
  • Functionality provided by the crashed component is unavailable:
    • CM crash: Cloud communication lost, no status updates, no desired-state processing
    • SM crash: Service instances continue running but no new deployments, no monitoring data
    • IAM crash: Certificate operations fail, Node registration stream drops, provisioning unavailable

Diagnostic Steps

1. Check the component's service status:

# Check which AosCore services are running
systemctl status aos-cm aos-sm aos-iam

# Check for recent failures
systemctl list-units 'aos-*' --state=failed

2. Examine the journal for crash details:

# For Communication Manager crashes
journalctl -u aos-cm --since "30 minutes ago" -p err

# For Service Manager crashes
journalctl -u aos-sm --since "30 minutes ago" -p err

# For IAM crashes
journalctl -u aos-iam --since "30 minutes ago" -p err

# Check for segfaults or signals
journalctl -u aos-cm --since "1 hour ago" | grep -i "signal\|segfault\|abort\|core dump"

3. Check restart behavior:

# See how many times the service has restarted
systemctl show aos-sm --property=NRestarts

# Check if the service is in a restart loop (rapid restarts)
journalctl -u aos-sm --since "10 minutes ago" | grep -i "start\|stop\|exit"

4. Check for resource-related crash causes:

# Check if OOM killer terminated the process
journalctl -k --since "1 hour ago" | grep -i "oom.*aos"

# Check available disk space (components need to write state)
df -h /var/aos/

Root Causes

CauseEvidenceResolution
Out-of-memory killKernel log shows OOM killer targeting aos-cm, aos-sm, or aos-iamIncrease available RAM; reduce service instance count; adjust memory limits in systemd unit file
Corrupted state databaseComponent logs show database errors on startupRemove the corrupted state file and restart; the component will rebuild state from the cloud or peer components
Configuration error after updateCrash occurs immediately after a Unit configuration changeRevert the configuration change via AosCloud; check JSON validity of the configuration
Disk full preventing state writesLogs show write errors, ENOSPCFree disk space (see Resource Exhaustion above)
Certificate corruptionIAM crashes with certificate parsing errorsRe-provision the Node's certificates; see Provisioning Workflow

Resolution

Restart the crashed component:

# Restart a specific component
systemctl restart aos-sm

# If the service is in a failed state and won't restart automatically
systemctl reset-failed aos-sm
systemctl start aos-sm

If the component is in a restart loop, the underlying cause must be resolved first. Check the journal for the error that occurs during startup — this is typically a configuration issue, corrupted state, or missing dependency.

Verify recovery:

# Confirm the service is running
systemctl is-active aos-sm

# Check that it's functioning (SM should register with CM)
journalctl -u aos-sm --since "1 minute ago" | grep -i "register\|connect\|init"

After a CM crash and recovery, the CM re-establishes the WebSocket connection to AosCloud and sends a full (non-delta) unitStatus message, bringing the cloud back in sync with the Unit's actual state.

Unit Configuration Failures

Unit configuration failures occur when the Communication Manager receives a new Unit configuration from AosCloud but cannot successfully apply it. The configuration status is reported back to the cloud with a failed state and an error message describing the failure.

Symptoms

  • The Unit configuration status in AosCloud shows state failed with an error message
  • The Node configuration status shows failed for specific Nodes
  • Alert thresholds, resource ratios, or labels are not updated as expected
  • The CM logs show configuration processing errors

Diagnostic Steps

1. Check the Unit configuration status in AosCloud:

The unitStatus message includes a unitConfig array with the status of each configuration version. Look for entries with state: "failed" and examine the error field.

2. Check CM logs for configuration processing:

# Check CM logs for unit config handling
journalctl -u aos-cm --since "30 minutes ago" | grep -i "unit.*config\|node.*config"

# Look for JSON parsing errors
journalctl -u aos-cm --since "30 minutes ago" | grep -i "json\|parse\|format"

3. Check Node configuration distribution:

# On the Main Node, check SM controller logs for config distribution
journalctl -u aos-cm --since "30 minutes ago" | grep -i "check.*config\|set.*config"

# On Secondary Nodes, check SM logs for config reception
journalctl -u aos-sm --since "30 minutes ago" | grep -i "config\|version"

4. Verify the configuration JSON:

If you have access to the configuration document, validate its structure:

  • Ensure formatVersion matches the expected schema version
  • Ensure version is strictly higher than the currently installed version
  • Verify that nodes array entries have valid nodeGroupSubject objects
  • Check that alert rule values are within valid ranges (percentages 0–100)

Root Causes

CauseEvidenceResolution
Invalid JSON syntaxCM logs show JSON parse errorsFix the JSON syntax in the configuration document on the cloud side
Version not higher than currentCM logs show version comparison failureEnsure the new configuration has a strictly higher version string than the currently installed one
Unknown formatVersionCM logs show unsupported format versionUse a formatVersion compatible with the installed AosCore version
Node ID mismatchNode config entry references a Node ID that doesn't exist in the UnitVerify Node IDs in the configuration match the actual provisioned Node IDs
SM rejects Node configurationSM logs show config validation errors on the target NodeCheck that the Node configuration values are valid for the target Node's capabilities
Network failure during distributionCM logs show timeout distributing config to Secondary NodesEnsure all target Nodes are connected; retry the configuration push

Resolution

Unit configuration failures are non-destructive — the previously installed configuration remains active. To resolve:

  1. Identify the error from the unitStatus report or CM logs
  2. Fix the configuration on the cloud side (correct JSON, bump version, fix Node references)
  3. Push the corrected configuration — the CM will process the new version and report the updated status

The configuration state machine has three states:

  • absent — no configuration has been installed
  • installed — configuration successfully applied
  • failed — configuration could not be applied (error message describes why)

A failed configuration does not prevent subsequent configuration attempts. Pushing a new, corrected configuration with a higher version will be processed normally.