Skip to main content
Version: v1.1

Connection Management

Introduction

This page documents how the Communication Manager (CM) establishes, maintains, and recovers the persistent WebSocket connection to the AosCloud backend. Connection management is a critical subsystem — it determines how quickly a Unit can recover from network disruptions and resume normal cloud communication.

The connection lifecycle involves three distinct phases: service discovery (obtaining the WebSocket URL), WebSocket establishment (authenticating and opening the connection), and steady-state operation (receiving frames, responding to pings, and detecting failures). When the connection drops, an exponential backoff reconnection strategy ensures the Unit recovers without overwhelming the cloud infrastructure.

Connection Lifecycle

The CM maintains a single persistent WebSocket connection to the cloud. The connection handler thread runs continuously while the CM is active, cycling through these states:

┌─────────────────────────────────────────────────────────────────┐
│ Connection Handler Loop │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Service │───▶│ WebSocket │───▶│ Receive Frames │ │
│ │ Discovery │ │ Establishment│ │ (steady state) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ ▲ │ │
│ │ Reconnect with │ │
│ └─────────── exponential backoff ◀────────┘ │
│ (on failure/disconnect) │
└─────────────────────────────────────────────────────────────────┘

The handler thread (HandleConnection) loops indefinitely:

  1. Attempt to connect to the cloud using the Retry utility with exponential backoff
  2. On success, enter the frame-receiving loop (ReceiveFrames)
  3. When the connection drops (close frame, error, or timeout), disconnect and return to step 1

Service Discovery

Before establishing a WebSocket connection, the Unit must discover the cloud endpoint URL. Service discovery is a lightweight HTTPS POST request, separate from the WebSocket protocol itself.

Request Format

The Unit sends an HTTP POST to the configured service discovery endpoint (e.g., /sd/v7/) with a JSON body:

{
"version": 7,
"systemId": "<unit-system-id>",
"supportedProtocols": ["wss"]
}
FieldTypeDescription
versionintegerProtocol version (currently 7)
systemIdstringThe Unit's unique system identifier
supportedProtocolsstring[]List of protocols the Unit supports (typically ["wss"])

The service discovery URL is obtained from the Unit's online certificate (via CryptoHelper::GetServiceDiscoveryURLs) or can be overridden in the CM configuration (mOverrideServiceDiscoveryURL).

Response Format

The cloud responds with connection details:

{
"version": 7,
"systemId": "<unit-system-id>",
"connectionInfo": ["wss://cloud.example.com/ws/v7/"],
"authToken": "<bearer-token>",
"errorCode": 0,
"nextRequestDelay": null
}
FieldTypeDescription
versionintegerProtocol version echo
systemIdstringUnit system ID echo
connectionInfostring[]List of WebSocket URLs to connect to
authTokenstringBearer token for WebSocket authentication
errorCodeintegerResult code (see below)
nextRequestDelayinteger | nullDelay in milliseconds before next discovery request (overrides default reconnect timeout)

Error Codes

CodeNameMeaning
0NoErrorSuccess — connectionInfo and authToken are populated
1RedirectThe connectionInfo contains a new service discovery URL to use instead
2RepeatLaterThe cloud is temporarily unavailable; retry after nextRequestDelay
-1ErrorGeneral error — the Unit should retry with backoff

Discovery Caching

The discovered WebSocket URL and auth token are cached in memory. The CM reuses the cached connection info across reconnections until:

  • The WebSocket connection fails with an authorization error (HTTP 401) — this clears the cached discovery response and forces a fresh service discovery request
  • All URLs in connectionInfo have been exhausted without a successful connection

WebSocket Establishment

Once service discovery provides a WebSocket URL and auth token, the CM establishes the connection:

  1. Create TLS session — The CM creates an HTTPS client session using the Unit's online certificate for mutual TLS (mTLS) authentication
  2. Set authorization header — The bearer token from service discovery is added: Authorization: Bearer <authToken>
  3. WebSocket upgrade — The CM sends an HTTP Upgrade request to transition the HTTPS connection to a WebSocket connection
  4. Configure socket — Keep-alive is enabled; receive timeout is set to infinite (blocking read)

Authentication

WebSocket connections are authenticated at two levels:

  • mTLS (transport level) — The Unit presents its online certificate; the cloud validates it against the CA chain
  • Bearer token (application level) — The auth token from service discovery is sent in the Authorization header during the WebSocket handshake

If the WebSocket handshake returns an unauthorized error, the CM clears the cached discovery response and performs a fresh service discovery on the next connection attempt.

Frame Format

All messages are sent as binary WebSocket frames containing UTF-8 JSON. The CM uses the Poco WebSocket library with FRAME_OP_BINARY for all outgoing messages.

Reconnection Strategy

When the connection fails or drops, the CM uses an exponential backoff strategy to reconnect without overwhelming the cloud infrastructure.

Backoff Parameters

ParameterValueSource
Initial delay1 secondcReconnectTimeout constant
Maximum delay10 minutescMaxReconnectTimeout constant
Retries per cycle5cReconnectTries constant
Backoff multiplierDoubling on each retry

Backoff Sequence

Each reconnection cycle attempts up to 5 connections with exponentially increasing delays:

AttemptDelay Before Attempt
10 (immediate)
21 second
32 seconds
44 seconds
58 seconds

If all 5 attempts in a cycle fail, the handler immediately starts a new cycle (the outer while (mIsRunning) loop in HandleConnection). The delay resets to the initial value at the start of each cycle unless the service discovery response specified a nextRequestDelay, which overrides the initial delay.

Delay Override

The service discovery response can override the reconnect timeout via the nextRequestDelay field. When present and greater than zero, this value replaces the default initial delay for subsequent reconnection attempts. This allows the cloud to implement server-side throttling — for example, instructing overloaded Units to back off for a longer period.

Connection Failure Scenarios

ScenarioBehavior
Network unreachableRetry with exponential backoff
DNS resolution failureRetry with exponential backoff
TLS handshake failureRetry with exponential backoff
WebSocket upgrade rejected (401)Clear discovery cache, re-discover, then retry
Cloud sends close frameDisconnect gracefully, reconnect immediately
Receive timeout / connection resetDisconnect, reconnect with backoff

Keepalive and Ping/Pong

The WebSocket connection uses the standard WebSocket ping/pong mechanism for keepalive:

  • The cloud sends ping frames to verify the Unit is still reachable
  • The Unit responds with pong frames immediately upon receiving a ping
  • The Unit's socket has keep-alive enabled at the TCP level (setKeepAlive(true))

The CM does not initiate ping frames — it relies on the cloud-side keepalive mechanism. If the cloud stops sending pings and the connection silently drops, the Unit detects the failure when the next send operation fails or when the receive loop returns zero bytes.

Connection State Notifications

The CM implements a publish-subscribe pattern for connection state changes. Other CM modules (Update Manager, SM Controller, etc.) subscribe to connection events to coordinate their behavior.

Subscriber Interface

class ConnectionListenerItf {
public:
virtual void OnConnect() = 0;
virtual void OnDisconnect() = 0;
};

Notification Behavior

EventTriggerSubscriber Action
OnConnect()WebSocket connection successfully establishedSubscribers can begin sending messages (e.g., send pending unit status)
OnDisconnect()Connection closed or lostSubscribers should stop sending and queue messages for later delivery

Notifications are delivered synchronously under a mutex lock. All registered subscribers receive the notification before the connection handler proceeds.

Connection State Query

Any module can check the current connection state via IsConnected(), which returns true only while the WebSocket connection is active and no disconnect notification has been sent.

Message Delivery Guarantees

The connection management layer provides at-least-once delivery semantics for important messages:

Send Queue

Outgoing messages are placed in a send queue and transmitted by a dedicated sender thread. Messages remain in the queue until successfully written to the WebSocket.

Acknowledgment Tracking

Messages sent with SendPolicy::eExpectAck are tracked after transmission:

  • The message is moved to a "sent but unacknowledged" map
  • A background thread monitors unacknowledged messages
  • If no ack is received within mCloudResponseWaitTimeout, the message is re-enqueued for retransmission
  • Each message allows up to 3 retry attempts before being dropped with an error log

Nack Handling

When the cloud sends a nack response:

  • The original message is removed from the unacknowledged map
  • It is re-enqueued with a future timestamp based on the retryAfter field from the nack
  • The message will be resent after the specified delay

Certificate Rotation and Reconnection

Certificate rotation can trigger a reconnection cycle:

  1. The cloud sends a renewCertificatesNotification message over the active WebSocket
  2. The CM generates new CSRs and sends an issueUnitCertificates request
  3. The cloud responds with issuedUnitCertificates containing new certificate chains
  4. The CM installs the new certificates via the IAM Certificate Handler
  5. If the online certificate (used for mTLS) is rotated, the existing TLS session becomes invalid
  6. The connection drops, triggering the normal reconnection flow — which now uses the new certificate

The CM handles certificate installation order carefully: secondary node certificates are installed first, then the main node's non-IAM certificates, and finally the main node's IAM certificate last. This ordering prevents IAM service restarts from interrupting the installation of remaining certificates.

Thread Architecture

The connection management subsystem uses multiple threads for concurrent operation:

ThreadRole
Connection handlerManages the connection lifecycle (discovery → connect → receive loop → reconnect)
Send queue handlerDequeues messages and writes them to the WebSocket
Unacknowledged message handlerMonitors sent messages for timeout and re-enqueues them
Message handler pool (4 threads)Parses and processes received messages concurrently

All threads coordinate via a shared mutex and condition variable, and terminate cleanly when Stop() is called.