Version: v1.1

Connectivity Issues

Introduction

This page covers the most common connectivity problems encountered in AosEdge Units: loss of cloud connection, inter-Node communication failures, and Service Manager (SM) to Communication Manager (CM) connection issues. Each section describes the symptoms you will observe, the diagnostic steps to identify the root cause, and the resolution actions to restore connectivity.

Connectivity in AosCore involves three distinct communication paths:

Cloud connection — the CM maintains a WebSocket connection to AosCloud via a service discovery and TLS handshake flow
Inter-Node communication — the Message Proxy (MP) connects Nodes within a Unit using either VChan (Xen virtual channels) or TCP sockets
SM-to-CM connection — each SM on a Node registers with the CM via a gRPC stream for receiving desired-state updates and reporting instance status

Cloud WebSocket Disconnection

Symptoms

AosCloud dashboard shows the Unit as offline
No unitStatus messages arriving at the cloud
CM logs show repeated connection attempts or TLS errors
Desired-state updates are not being received by the Unit
Alerts and monitoring data stop flowing to the cloud

Diagnostic Steps

1. Check CM connection status in logs

journalctl -u aos-communicationmanager --since "10 minutes ago" | grep -i "connect\|disconnect\|error"

Look for messages indicating:

Service discovery failures (HTTP errors contacting the discovery endpoint)
TLS handshake failures (certificate errors, expired certificates)
WebSocket connection drops (unexpected close frames, read/write errors)
Reconnection attempts with exponential backoff

2. Verify service discovery endpoint reachability

The CM first contacts the service discovery URL to obtain the WebSocket connection endpoint. Check that the configured URL is reachable:

# Check DNS resolution for the service discovery host
nslookup <service-discovery-hostname>

# Test HTTPS connectivity to the service discovery endpoint
curl -v https://<service-discovery-url>

The service discovery URL is configured in the CM configuration file under serviceDiscoveryURL. An override URL may also be configured via overrideServiceDiscoveryURL.

3. Verify certificate validity

The cloud connection uses mTLS with the "online" certificate type. Check that the certificate is present and not expired:

# Check certificate expiry
openssl x509 -in <online-cert-path> -noout -dates

# Verify the certificate chain against the CA
openssl verify -CAfile <ca-cert-path> <online-cert-path>

The CA certificate path is configured in the CM configuration under caCert. The certificate storage location is configured under certStorage.

4. Check network connectivity

# Verify DNS resolution is working
cat /etc/resolv.conf
dig <cloud-hostname>

# Check if outbound HTTPS/WSS port is reachable
nc -zv <cloud-hostname> 443

# Check for firewall rules blocking outbound traffic
iptables -L -n | grep -i drop

Root Causes

Cause	Log Evidence	Resolution
Certificate expired	`certificate has expired` or TLS handshake error	Trigger certificate renewal via IAM; check that the renewal notification flow from AosCloud is working
DNS resolution failure	`could not resolve host` or name resolution errors	Verify `/etc/resolv.conf` configuration; check network interface status
Service discovery unreachable	HTTP connection timeout or refused	Verify the `serviceDiscoveryURL` in CM config; check network routing to the cloud endpoint
Service discovery returns error	`RepeatLater` or `Error` in discovery response	Cloud-side issue — the discovery service may be temporarily unavailable; CM will retry with backoff
Service discovery redirect	`Redirect` response code	The Unit is being redirected to a different cloud endpoint — verify the redirect target is reachable
WebSocket connection dropped	Read/write errors on the WebSocket	Network instability between Unit and cloud; CM will automatically reconnect with exponential backoff (starting at 1 second, max 10 minutes)
Firewall blocking outbound traffic	Connection timeout with no response	Configure firewall rules to allow outbound HTTPS (port 443) to the cloud endpoint
Invalid system ID	Authentication failure during service discovery	Verify the Unit is properly provisioned and the system ID matches the cloud registration

Resolution Steps

Certificate expiry: If the online certificate has expired, the IAM component must issue a new certificate. Check IAM logs for certificate renewal activity. If automatic renewal failed, investigate the IAM-to-cloud certificate issuance flow.
Network issues: Restore network connectivity by fixing DNS configuration, network interface status, or firewall rules. Once network is restored, the CM will automatically reconnect — no manual restart is needed.
Service discovery configuration: If the service discovery URL is incorrect, update the CM configuration file and restart the aos-communicationmanager service:
```
systemctl restart aos-communicationmanager
```
Persistent disconnection with valid network: If the network is reachable but the CM cannot establish a WebSocket connection, check the cloud-side service health. The CM retries with exponential backoff up to a maximum of 10 minutes between attempts.

Inter-Node Communication Failures

Symptoms

Secondary Nodes appear as disconnected in the Unit status
Services scheduled on secondary Nodes are not starting
CM cannot send desired-state updates to secondary Nodes
MP logs show transport connection errors
IAM Node registration stream is broken

Diagnostic Steps

1. Check Message Proxy logs

journalctl -u aos-messageproxy --since "10 minutes ago" | grep -i "connect\|error\|transport"

Look for:

Transport connection failures (socket connect errors or VChan initialization failures)
Read/write errors on the transport layer
Reconnection attempts

2. Identify the transport type

The MP uses one of two transport mechanisms depending on the platform:

VChan (Xen virtual channels) — used when Nodes are Xen domains on the same physical host
Socket (TCP) — used when Nodes communicate over a network

Check which transport is compiled in by looking at the MP configuration or build flags. VChan-based systems use libxenvchan for inter-domain communication.

3. Check transport-specific connectivity

For Socket transport:

# Check if the MP socket port is listening
ss -tlnp | grep <mp-port>

# Test connectivity from the secondary Node to the main Node
nc -zv <main-node-ip> <mp-port>

# Check network interface status
ip addr show
ip route show

For VChan transport:

# Check Xen domain status
xl list

# Verify VChan paths are accessible
ls /dev/xen/

# Check xenstore for VChan configuration
xenstore-ls

4. Check IAM Node registration

Each Node registers with the IAM on the main Node via a gRPC stream (RegisterNode RPC). If this stream is broken, the Node cannot authenticate or receive certificate updates:

journalctl -u aos-iam --since "10 minutes ago" | grep -i "node\|register\|connect"

5. Verify TLS credentials for inter-Node communication

The SM-to-CM gRPC connection and IAM registration streams use mTLS. Verify that the Node's certificates are valid:

# Check the Node's certificate
openssl x509 -in <node-cert-path> -noout -dates -subject

Root Causes

Cause	Log Evidence	Resolution
Socket transport — port unreachable	`Connection refused` on the MP port	Verify the MP service is running on the main Node; check firewall rules
Socket transport — network partition	`Connection timed out`	Check network connectivity between Nodes; verify IP routing
VChan transport — domain not running	VChan initialization error	Verify the target Xen domain is running (`xl list`)
VChan transport — path misconfigured	`Failed connect to transport`	Verify VChan path configuration matches the Xen domain setup
IAM registration stream broken	gRPC stream errors in IAM logs	Check IAM service health; verify TLS certificates for the IAM server
Certificate expired on secondary Node	TLS handshake failure	Trigger certificate renewal on the affected Node via IAM
MP service not running	No MP process found	Start the MP service: `systemctl start aos-messageproxy`

Resolution Steps

Socket connectivity: Verify network configuration between Nodes. Ensure the MP port is not blocked by firewalls. Restart the MP service if the socket is in a bad state:
```
systemctl restart aos-messageproxy
```
VChan connectivity: Verify that both Xen domains are running and the VChan paths are correctly configured. VChan requires both the read and write channels to be established.
IAM registration: If the Node registration stream is broken, restart the IAM service on the main Node and the IAM client on the secondary Node. The Node will re-register automatically:
```
# On the main Node
systemctl restart aos-iam
```
Certificate issues: If inter-Node TLS is failing due to expired certificates, the IAM must issue new certificates for the affected Node. Check the provisioning status of the Node.

SM-to-CM Connection Loss

Symptoms

CM reports a Node as disconnected even though the Node is physically reachable
Instance status updates from the affected Node stop arriving
CM cannot push desired-state changes to the affected Node's SM
SM logs show gRPC connection errors
CM SM controller logs show Node disconnection events

Diagnostic Steps

1. Check CM SM controller logs

journalctl -u aos-communicationmanager --since "10 minutes ago" | grep -i "node\|sm\|disconnect\|grpc"

Look for OnNodeDisconnected events indicating which Node lost its connection.

2. Check SM logs on the affected Node

journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "connect\|grpc\|error"

Look for:

gRPC connection failures to the CM server
TLS handshake errors
Stream read/write errors

3. Verify the CM gRPC server is running

The CM exposes a gRPC server (configured via cmServerURL) that SMs connect to via the RegisterSM streaming RPC:

# Check if the CM gRPC port is listening
ss -tlnp | grep <cm-grpc-port>

# Test gRPC connectivity from the SM Node
grpcurl -plaintext <cm-server-url> list

4. Check TLS certificate status

The SM-to-CM gRPC connection uses mTLS (unless configured for insecure mode). Verify certificates on both sides:

# On the SM Node — check the client certificate
openssl x509 -in <sm-cert-path> -noout -dates

# On the CM Node — check the server certificate
openssl x509 -in <cm-cert-path> -noout -dates

5. Check for certificate rotation in progress

When certificates are rotated, the CM gRPC server restarts with new credentials. During this brief window, connected SMs will be disconnected and must reconnect:

journalctl -u aos-communicationmanager | grep -i "cert\|restart\|credential"

The CM has a reconnect retry timeout of 10 seconds (cReconnectRetryTimeout) for scheduling server restarts after certificate changes.

Root Causes

Cause	Log Evidence	Resolution
CM gRPC server not listening	`Connection refused` from SM	Verify CM service is running; check `cmServerURL` configuration
TLS certificate mismatch	gRPC TLS handshake failure	Ensure SM and CM certificates are issued by the same CA; check certificate types
Certificate rotation restart	Brief disconnection followed by reconnection	Normal behavior — SM will reconnect automatically within seconds
Network partition between Nodes	gRPC stream timeout	Check inter-Node network connectivity (see Inter-Node section above)
SM service crashed	No SM process on the Node	Check SM logs for crash cause; restart: `systemctl restart aos-servicemanager`
CM server overloaded	gRPC deadline exceeded	Check CM resource usage; investigate if too many concurrent operations are blocking the server
IAM certificate provider failure	CM cannot load server credentials	Check IAM service health; verify certificate storage is accessible

Resolution Steps

SM reconnection: The SM automatically attempts to reconnect to the CM. If the connection was lost due to a transient network issue or certificate rotation, no manual intervention is needed — wait for the automatic reconnection.
Certificate issues: If TLS is failing, verify that both the SM and CM have valid certificates from the same trust chain. Check the IAM certificate handler for any pending or failed certificate operations.
CM server restart: If the CM gRPC server is in a bad state, restart the CM service:
```
systemctl restart aos-communicationmanager
```
All SMs will automatically reconnect after the restart.
SM crash recovery: If the SM has crashed, investigate the crash cause in the journal logs, then restart:
```
journalctl -u aos-servicemanager --since "1 hour ago" -p err
systemctl restart aos-servicemanager
```

General Connectivity Checklist

When facing any connectivity issue, work through this checklist:

Identify which connection path is affected — cloud (CM→AosCloud), inter-Node (MP transport), or SM-to-CM (gRPC)
Check the relevant service is running — systemctl status <service-name>
Check logs for error messages — journalctl -u <service-name> --since "10 minutes ago" -p err
Verify network reachability — DNS resolution, port accessibility, routing
Check certificate validity — expiry dates, CA chain, certificate type
Check for recent configuration changes — service discovery URL, server addresses, certificate storage paths
Verify the reconnection mechanism — most connections auto-recover; check if the retry backoff has reached its maximum

Architecture: Communication Manager — CM architecture and cloud communication responsibilities
Architecture: Communication Manager — SM Controller — how the CM manages SM connections
Security Model: Certificate Architecture — certificate types, rotation, and trust chains
Multi-Node: Node Lifecycle — Node registration and connectivity states
Cloud Communication — cloud protocol details and connection management

Troubleshooting: Node and Unit Health — diagnosing Node-level health issues that may cause connectivity symptoms

Configuration Reference — CM, SM, IAM, and MP configuration parameters

Introduction​

Cloud WebSocket Disconnection​

Symptoms​

Diagnostic Steps​

1. Check CM connection status in logs​

2. Verify service discovery endpoint reachability​

3. Verify certificate validity​

4. Check network connectivity​

Root Causes​

Resolution Steps​

Inter-Node Communication Failures​

Symptoms​

Diagnostic Steps​

1. Check Message Proxy logs​

2. Identify the transport type​

3. Check transport-specific connectivity​

4. Check IAM Node registration​

5. Verify TLS credentials for inter-Node communication​

Root Causes​

Resolution Steps​

SM-to-CM Connection Loss​

Symptoms​

Diagnostic Steps​

1. Check CM SM controller logs​

2. Check SM logs on the affected Node​

3. Verify the CM gRPC server is running​

4. Check TLS certificate status​

5. Check for certificate rotation in progress​

Root Causes​

Resolution Steps​

General Connectivity Checklist​

Related Pages​

Introduction

Cloud WebSocket Disconnection

Symptoms

Diagnostic Steps

1. Check CM connection status in logs

2. Verify service discovery endpoint reachability

3. Verify certificate validity

4. Check network connectivity

Root Causes

Resolution Steps

Inter-Node Communication Failures

Symptoms

Diagnostic Steps

1. Check Message Proxy logs

2. Identify the transport type

3. Check transport-specific connectivity

4. Check IAM Node registration

5. Verify TLS credentials for inter-Node communication

Root Causes

Resolution Steps

SM-to-CM Connection Loss

Symptoms

Diagnostic Steps

1. Check CM SM controller logs

2. Check SM logs on the affected Node

3. Verify the CM gRPC server is running

4. Check TLS certificate status

5. Check for certificate rotation in progress

Root Causes

Resolution Steps

General Connectivity Checklist

Related Pages