Connection Management
Introduction
This page documents how the Communication Manager (CM) establishes, maintains, and recovers the persistent WebSocket connection to the AosCloud backend. Connection management is a critical subsystem — it determines how quickly a Unit can recover from network disruptions and resume normal cloud communication.
The connection lifecycle involves three distinct phases: service discovery (obtaining the WebSocket URL), WebSocket establishment (authenticating and opening the connection), and steady-state operation (receiving frames, responding to pings, and detecting failures). When the connection drops, an exponential backoff reconnection strategy ensures the Unit recovers without overwhelming the cloud infrastructure.
Connection Lifecycle
The CM maintains a single persistent WebSocket connection to the cloud. The connection handler thread runs continuously while the CM is active, cycling through these states:
┌─────────────────────────────────────────────────────────────────┐
│ Connection Handler Loop │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Service │───▶│ WebSocket │───▶│ Receive Frames │ │
│ │ Discovery │ │ Establishment│ │ (steady state) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ ▲ │ │
│ │ Reconnect with │ │
│ └─────────── exponential backoff ◀────────┘ │
│ (on failure/disconnect) │
└─────────────────────────────────────────────────────────────────┘
The handler thread (HandleConnection) loops indefinitely:
- Attempt to connect to the cloud using the
Retryutility with exponential backoff - On success, enter the frame-receiving loop (
ReceiveFrames) - When the connection drops (close frame, error, or timeout), disconnect and return to step 1
Service Discovery
Before establishing a WebSocket connection, the Unit must discover the cloud endpoint URL. Service discovery is a lightweight HTTPS POST request, separate from the WebSocket protocol itself.
Request Format
The Unit sends an HTTP POST to the configured service discovery endpoint (e.g., /sd/v7/) with a JSON body:
{
"version": 7,
"systemId": "<unit-system-id>",
"supportedProtocols": ["wss"]
}
| Field | Type | Description |
|---|---|---|
version | integer | Protocol version (currently 7) |
systemId | string | The Unit's unique system identifier |
supportedProtocols | string[] | List of protocols the Unit supports (typically ["wss"]) |
The service discovery URL is obtained from the Unit's online certificate (via CryptoHelper::GetServiceDiscoveryURLs)
or can be overridden in the CM configuration (mOverrideServiceDiscoveryURL).
Response Format
The cloud responds with connection details:
{
"version": 7,
"systemId": "<unit-system-id>",
"connectionInfo": ["wss://cloud.example.com/ws/v7/"],
"authToken": "<bearer-token>",
"errorCode": 0,
"nextRequestDelay": null
}
| Field | Type | Description |
|---|---|---|
version | integer | Protocol version echo |
systemId | string | Unit system ID echo |
connectionInfo | string[] | List of WebSocket URLs to connect to |
authToken | string | Bearer token for WebSocket authentication |
errorCode | integer | Result code (see below) |
nextRequestDelay | integer | null | Delay in milliseconds before next discovery request (overrides default reconnect timeout) |
Error Codes
| Code | Name | Meaning |
|---|---|---|
0 | NoError | Success — connectionInfo and authToken are populated |
1 | Redirect | The connectionInfo contains a new service discovery URL to use instead |
2 | RepeatLater | The cloud is temporarily unavailable; retry after nextRequestDelay |
-1 | Error | General error — the Unit should retry with backoff |
Discovery Caching
The discovered WebSocket URL and auth token are cached in memory. The CM reuses the cached connection info across reconnections until:
- The WebSocket connection fails with an authorization error (HTTP 401) — this clears the cached discovery response and forces a fresh service discovery request
- All URLs in
connectionInfohave been exhausted without a successful connection
WebSocket Establishment
Once service discovery provides a WebSocket URL and auth token, the CM establishes the connection:
- Create TLS session — The CM creates an HTTPS client session using the Unit's online certificate for mutual TLS (mTLS) authentication
- Set authorization header — The bearer token from service discovery is added:
Authorization: Bearer <authToken> - WebSocket upgrade — The CM sends an HTTP Upgrade request to transition the HTTPS connection to a WebSocket connection
- Configure socket — Keep-alive is enabled; receive timeout is set to infinite (blocking read)
Authentication
WebSocket connections are authenticated at two levels:
- mTLS (transport level) — The Unit presents its online certificate; the cloud validates it against the CA chain
- Bearer token (application level) — The auth token from service discovery is sent in the
Authorizationheader during the WebSocket handshake
If the WebSocket handshake returns an unauthorized error, the CM clears the cached discovery response and performs a fresh service discovery on the next connection attempt.
Frame Format
All messages are sent as binary WebSocket frames containing UTF-8 JSON. The CM uses the Poco WebSocket library with
FRAME_OP_BINARY for all outgoing messages.
Reconnection Strategy
When the connection fails or drops, the CM uses an exponential backoff strategy to reconnect without overwhelming the cloud infrastructure.
Backoff Parameters
| Parameter | Value | Source |
|---|---|---|
| Initial delay | 1 second | cReconnectTimeout constant |
| Maximum delay | 10 minutes | cMaxReconnectTimeout constant |
| Retries per cycle | 5 | cReconnectTries constant |
| Backoff multiplier | 2× | Doubling on each retry |
Backoff Sequence
Each reconnection cycle attempts up to 5 connections with exponentially increasing delays:
| Attempt | Delay Before Attempt |
|---|---|
| 1 | 0 (immediate) |
| 2 | 1 second |
| 3 | 2 seconds |
| 4 | 4 seconds |
| 5 | 8 seconds |
If all 5 attempts in a cycle fail, the handler immediately starts a new cycle (the outer while (mIsRunning) loop in
HandleConnection). The delay resets to the initial value at the start of each cycle unless the service discovery
response specified a nextRequestDelay, which overrides the initial delay.
Delay Override
The service discovery response can override the reconnect timeout via the nextRequestDelay field. When present and
greater than zero, this value replaces the default initial delay for subsequent reconnection attempts. This allows the
cloud to implement server-side throttling — for example, instructing overloaded Units to back off for a longer period.
Connection Failure Scenarios
| Scenario | Behavior |
|---|---|
| Network unreachable | Retry with exponential backoff |
| DNS resolution failure | Retry with exponential backoff |
| TLS handshake failure | Retry with exponential backoff |
| WebSocket upgrade rejected (401) | Clear discovery cache, re-discover, then retry |
| Cloud sends close frame | Disconnect gracefully, reconnect immediately |
| Receive timeout / connection reset | Disconnect, reconnect with backoff |
Keepalive and Ping/Pong
The WebSocket connection uses the standard WebSocket ping/pong mechanism for keepalive:
- The cloud sends ping frames to verify the Unit is still reachable
- The Unit responds with pong frames immediately upon receiving a ping
- The Unit's socket has keep-alive enabled at the TCP level (
setKeepAlive(true))
The CM does not initiate ping frames — it relies on the cloud-side keepalive mechanism. If the cloud stops sending pings and the connection silently drops, the Unit detects the failure when the next send operation fails or when the receive loop returns zero bytes.
Connection State Notifications
The CM implements a publish-subscribe pattern for connection state changes. Other CM modules (Update Manager, SM Controller, etc.) subscribe to connection events to coordinate their behavior.
Subscriber Interface
class ConnectionListenerItf {
public:
virtual void OnConnect() = 0;
virtual void OnDisconnect() = 0;
};
Notification Behavior
| Event | Trigger | Subscriber Action |
|---|---|---|
OnConnect() | WebSocket connection successfully established | Subscribers can begin sending messages (e.g., send pending unit status) |
OnDisconnect() | Connection closed or lost | Subscribers should stop sending and queue messages for later delivery |
Notifications are delivered synchronously under a mutex lock. All registered subscribers receive the notification before the connection handler proceeds.
Connection State Query
Any module can check the current connection state via IsConnected(), which returns true only while the WebSocket
connection is active and no disconnect notification has been sent.
Message Delivery Guarantees
The connection management layer provides at-least-once delivery semantics for important messages:
Send Queue
Outgoing messages are placed in a send queue and transmitted by a dedicated sender thread. Messages remain in the queue until successfully written to the WebSocket.
Acknowledgment Tracking
Messages sent with SendPolicy::eExpectAck are tracked after transmission:
- The message is moved to a "sent but unacknowledged" map
- A background thread monitors unacknowledged messages
- If no
ackis received withinmCloudResponseWaitTimeout, the message is re-enqueued for retransmission - Each message allows up to 3 retry attempts before being dropped with an error log
Nack Handling
When the cloud sends a nack response:
- The original message is removed from the unacknowledged map
- It is re-enqueued with a future timestamp based on the
retryAfterfield from the nack - The message will be resent after the specified delay
Certificate Rotation and Reconnection
Certificate rotation can trigger a reconnection cycle:
- The cloud sends a
renewCertificatesNotificationmessage over the active WebSocket - The CM generates new CSRs and sends an
issueUnitCertificatesrequest - The cloud responds with
issuedUnitCertificatescontaining new certificate chains - The CM installs the new certificates via the IAM Certificate Handler
- If the online certificate (used for mTLS) is rotated, the existing TLS session becomes invalid
- The connection drops, triggering the normal reconnection flow — which now uses the new certificate
The CM handles certificate installation order carefully: secondary node certificates are installed first, then the main node's non-IAM certificates, and finally the main node's IAM certificate last. This ordering prevents IAM service restarts from interrupting the installation of remaining certificates.
Thread Architecture
The connection management subsystem uses multiple threads for concurrent operation:
| Thread | Role |
|---|---|
| Connection handler | Manages the connection lifecycle (discovery → connect → receive loop → reconnect) |
| Send queue handler | Dequeues messages and writes them to the WebSocket |
| Unacknowledged message handler | Monitors sent messages for timeout and re-enqueues them |
| Message handler pool (4 threads) | Parses and processes received messages concurrently |
All threads coordinate via a shared mutex and condition variable, and terminate cleanly when Stop() is called.
Related Pages
- Cloud Communication — overview of the Unit-to-Cloud communication architecture and protocol
- Cloud Protocol Reference — detailed message schemas and field definitions
- Communication Manager — the CM component that implements cloud communication
- CM Cloud Communication — CM's cloud communication submodule overview
- Certificate Architecture — certificate hierarchy and the online certificate used for mTLS
- Connectivity Issues — troubleshooting guide for connection problems