Connectivity Issues
Introduction
This page covers the most common connectivity problems encountered in AosEdge Units: loss of cloud connection, inter-Node communication failures, and Service Manager (SM) to Communication Manager (CM) connection issues. Each section describes the symptoms you will observe, the diagnostic steps to identify the root cause, and the resolution actions to restore connectivity.
Connectivity in AosCore involves three distinct communication paths:
- Cloud connection — the CM maintains a WebSocket connection to AosCloud via a service discovery and TLS handshake flow
- Inter-Node communication — the Message Proxy (MP) connects Nodes within a Unit using either VChan (Xen virtual channels) or TCP sockets
- SM-to-CM connection — each SM on a Node registers with the CM via a gRPC stream for receiving desired-state updates and reporting instance status
Cloud WebSocket Disconnection
Symptoms
- AosCloud dashboard shows the Unit as offline
- No
unitStatusmessages arriving at the cloud - CM logs show repeated connection attempts or TLS errors
- Desired-state updates are not being received by the Unit
- Alerts and monitoring data stop flowing to the cloud
Diagnostic Steps
1. Check CM connection status in logs
journalctl -u aos-communicationmanager --since "10 minutes ago" | grep -i "connect\|disconnect\|error"
Look for messages indicating:
- Service discovery failures (HTTP errors contacting the discovery endpoint)
- TLS handshake failures (certificate errors, expired certificates)
- WebSocket connection drops (unexpected close frames, read/write errors)
- Reconnection attempts with exponential backoff
2. Verify service discovery endpoint reachability
The CM first contacts the service discovery URL to obtain the WebSocket connection endpoint. Check that the configured URL is reachable:
# Check DNS resolution for the service discovery host
nslookup <service-discovery-hostname>
# Test HTTPS connectivity to the service discovery endpoint
curl -v https://<service-discovery-url>
The service discovery URL is configured in the CM configuration file under serviceDiscoveryURL. An override URL may
also be configured via overrideServiceDiscoveryURL.
3. Verify certificate validity
The cloud connection uses mTLS with the "online" certificate type. Check that the certificate is present and not expired:
# Check certificate expiry
openssl x509 -in <online-cert-path> -noout -dates
# Verify the certificate chain against the CA
openssl verify -CAfile <ca-cert-path> <online-cert-path>
The CA certificate path is configured in the CM configuration under caCert. The certificate storage location is
configured under certStorage.
4. Check network connectivity
# Verify DNS resolution is working
cat /etc/resolv.conf
dig <cloud-hostname>
# Check if outbound HTTPS/WSS port is reachable
nc -zv <cloud-hostname> 443
# Check for firewall rules blocking outbound traffic
iptables -L -n | grep -i drop
Root Causes
| Cause | Log Evidence | Resolution |
|---|---|---|
| Certificate expired | certificate has expired or TLS handshake error | Trigger certificate renewal via IAM; check that the renewal notification flow from AosCloud is working |
| DNS resolution failure | could not resolve host or name resolution errors | Verify /etc/resolv.conf configuration; check network interface status |
| Service discovery unreachable | HTTP connection timeout or refused | Verify the serviceDiscoveryURL in CM config; check network routing to the cloud endpoint |
| Service discovery returns error | RepeatLater or Error in discovery response | Cloud-side issue — the discovery service may be temporarily unavailable; CM will retry with backoff |
| Service discovery redirect | Redirect response code | The Unit is being redirected to a different cloud endpoint — verify the redirect target is reachable |
| WebSocket connection dropped | Read/write errors on the WebSocket | Network instability between Unit and cloud; CM will automatically reconnect with exponential backoff (starting at 1 second, max 10 minutes) |
| Firewall blocking outbound traffic | Connection timeout with no response | Configure firewall rules to allow outbound HTTPS (port 443) to the cloud endpoint |
| Invalid system ID | Authentication failure during service discovery | Verify the Unit is properly provisioned and the system ID matches the cloud registration |
Resolution Steps
-
Certificate expiry: If the online certificate has expired, the IAM component must issue a new certificate. Check IAM logs for certificate renewal activity. If automatic renewal failed, investigate the IAM-to-cloud certificate issuance flow.
-
Network issues: Restore network connectivity by fixing DNS configuration, network interface status, or firewall rules. Once network is restored, the CM will automatically reconnect — no manual restart is needed.
-
Service discovery configuration: If the service discovery URL is incorrect, update the CM configuration file and restart the
aos-communicationmanagerservice:systemctl restart aos-communicationmanager -
Persistent disconnection with valid network: If the network is reachable but the CM cannot establish a WebSocket connection, check the cloud-side service health. The CM retries with exponential backoff up to a maximum of 10 minutes between attempts.
Inter-Node Communication Failures
Symptoms
- Secondary Nodes appear as disconnected in the Unit status
- Services scheduled on secondary Nodes are not starting
- CM cannot send desired-state updates to secondary Nodes
- MP logs show transport connection errors
- IAM Node registration stream is broken
Diagnostic Steps
1. Check Message Proxy logs
journalctl -u aos-messageproxy --since "10 minutes ago" | grep -i "connect\|error\|transport"
Look for:
- Transport connection failures (socket connect errors or VChan initialization failures)
- Read/write errors on the transport layer
- Reconnection attempts
2. Identify the transport type
The MP uses one of two transport mechanisms depending on the platform:
- VChan (Xen virtual channels) — used when Nodes are Xen domains on the same physical host
- Socket (TCP) — used when Nodes communicate over a network
Check which transport is compiled in by looking at the MP configuration or build flags. VChan-based systems use
libxenvchan for inter-domain communication.
3. Check transport-specific connectivity
For Socket transport:
# Check if the MP socket port is listening
ss -tlnp | grep <mp-port>
# Test connectivity from the secondary Node to the main Node
nc -zv <main-node-ip> <mp-port>
# Check network interface status
ip addr show
ip route show
For VChan transport:
# Check Xen domain status
xl list
# Verify VChan paths are accessible
ls /dev/xen/
# Check xenstore for VChan configuration
xenstore-ls
4. Check IAM Node registration
Each Node registers with the IAM on the main Node via a gRPC stream (RegisterNode RPC). If this stream is broken, the
Node cannot authenticate or receive certificate updates:
journalctl -u aos-iam --since "10 minutes ago" | grep -i "node\|register\|connect"
5. Verify TLS credentials for inter-Node communication
The SM-to-CM gRPC connection and IAM registration streams use mTLS. Verify that the Node's certificates are valid:
# Check the Node's certificate
openssl x509 -in <node-cert-path> -noout -dates -subject
Root Causes
| Cause | Log Evidence | Resolution |
|---|---|---|
| Socket transport — port unreachable | Connection refused on the MP port | Verify the MP service is running on the main Node; check firewall rules |
| Socket transport — network partition | Connection timed out | Check network connectivity between Nodes; verify IP routing |
| VChan transport — domain not running | VChan initialization error | Verify the target Xen domain is running (xl list) |
| VChan transport — path misconfigured | Failed connect to transport | Verify VChan path configuration matches the Xen domain setup |
| IAM registration stream broken | gRPC stream errors in IAM logs | Check IAM service health; verify TLS certificates for the IAM server |
| Certificate expired on secondary Node | TLS handshake failure | Trigger certificate renewal on the affected Node via IAM |
| MP service not running | No MP process found | Start the MP service: systemctl start aos-messageproxy |
Resolution Steps
-
Socket connectivity: Verify network configuration between Nodes. Ensure the MP port is not blocked by firewalls. Restart the MP service if the socket is in a bad state:
systemctl restart aos-messageproxy -
VChan connectivity: Verify that both Xen domains are running and the VChan paths are correctly configured. VChan requires both the read and write channels to be established.
-
IAM registration: If the Node registration stream is broken, restart the IAM service on the main Node and the IAM client on the secondary Node. The Node will re-register automatically:
# On the main Nodesystemctl restart aos-iam -
Certificate issues: If inter-Node TLS is failing due to expired certificates, the IAM must issue new certificates for the affected Node. Check the provisioning status of the Node.
SM-to-CM Connection Loss
Symptoms
- CM reports a Node as disconnected even though the Node is physically reachable
- Instance status updates from the affected Node stop arriving
- CM cannot push desired-state changes to the affected Node's SM
- SM logs show gRPC connection errors
- CM SM controller logs show Node disconnection events
Diagnostic Steps
1. Check CM SM controller logs
journalctl -u aos-communicationmanager --since "10 minutes ago" | grep -i "node\|sm\|disconnect\|grpc"
Look for OnNodeDisconnected events indicating which Node lost its connection.
2. Check SM logs on the affected Node
journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "connect\|grpc\|error"
Look for:
- gRPC connection failures to the CM server
- TLS handshake errors
- Stream read/write errors
3. Verify the CM gRPC server is running
The CM exposes a gRPC server (configured via cmServerURL) that SMs connect to via the RegisterSM streaming RPC:
# Check if the CM gRPC port is listening
ss -tlnp | grep <cm-grpc-port>
# Test gRPC connectivity from the SM Node
grpcurl -plaintext <cm-server-url> list
4. Check TLS certificate status
The SM-to-CM gRPC connection uses mTLS (unless configured for insecure mode). Verify certificates on both sides:
# On the SM Node — check the client certificate
openssl x509 -in <sm-cert-path> -noout -dates
# On the CM Node — check the server certificate
openssl x509 -in <cm-cert-path> -noout -dates
5. Check for certificate rotation in progress
When certificates are rotated, the CM gRPC server restarts with new credentials. During this brief window, connected SMs will be disconnected and must reconnect:
journalctl -u aos-communicationmanager | grep -i "cert\|restart\|credential"
The CM has a reconnect retry timeout of 10 seconds (cReconnectRetryTimeout) for scheduling server restarts after
certificate changes.
Root Causes
| Cause | Log Evidence | Resolution |
|---|---|---|
| CM gRPC server not listening | Connection refused from SM | Verify CM service is running; check cmServerURL configuration |
| TLS certificate mismatch | gRPC TLS handshake failure | Ensure SM and CM certificates are issued by the same CA; check certificate types |
| Certificate rotation restart | Brief disconnection followed by reconnection | Normal behavior — SM will reconnect automatically within seconds |
| Network partition between Nodes | gRPC stream timeout | Check inter-Node network connectivity (see Inter-Node section above) |
| SM service crashed | No SM process on the Node | Check SM logs for crash cause; restart: systemctl restart aos-servicemanager |
| CM server overloaded | gRPC deadline exceeded | Check CM resource usage; investigate if too many concurrent operations are blocking the server |
| IAM certificate provider failure | CM cannot load server credentials | Check IAM service health; verify certificate storage is accessible |
Resolution Steps
-
SM reconnection: The SM automatically attempts to reconnect to the CM. If the connection was lost due to a transient network issue or certificate rotation, no manual intervention is needed — wait for the automatic reconnection.
-
Certificate issues: If TLS is failing, verify that both the SM and CM have valid certificates from the same trust chain. Check the IAM certificate handler for any pending or failed certificate operations.
-
CM server restart: If the CM gRPC server is in a bad state, restart the CM service:
systemctl restart aos-communicationmanagerAll SMs will automatically reconnect after the restart.
-
SM crash recovery: If the SM has crashed, investigate the crash cause in the journal logs, then restart:
journalctl -u aos-servicemanager --since "1 hour ago" -p errsystemctl restart aos-servicemanager
General Connectivity Checklist
When facing any connectivity issue, work through this checklist:
- Identify which connection path is affected — cloud (CM→AosCloud), inter-Node (MP transport), or SM-to-CM (gRPC)
- Check the relevant service is running —
systemctl status <service-name> - Check logs for error messages —
journalctl -u <service-name> --since "10 minutes ago" -p err - Verify network reachability — DNS resolution, port accessibility, routing
- Check certificate validity — expiry dates, CA chain, certificate type
- Check for recent configuration changes — service discovery URL, server addresses, certificate storage paths
- Verify the reconnection mechanism — most connections auto-recover; check if the retry backoff has reached its maximum
Related Pages
- Architecture: Communication Manager — CM architecture and cloud communication responsibilities
- Architecture: Communication Manager — SM Controller — how the CM manages SM connections
- Security Model: Certificate Architecture — certificate types, rotation, and trust chains
- Multi-Node: Node Lifecycle — Node registration and connectivity states
- Cloud Communication — cloud protocol details and connection management
- Troubleshooting: Node and Unit Health — diagnosing Node-level health issues that may cause connectivity symptoms
- Configuration Reference — CM, SM, IAM, and MP configuration parameters