Skip to main content
Version: v1.1

Cloud Communication

Introduction

The communication module is the CM subcomponent responsible for maintaining the persistent connection between an AosEdge Unit and AosCloud. It implements a WebSocket-based JSON protocol (version 7) that carries all bidirectional traffic: desired-state delivery, status reporting, monitoring data, alerts, logs, certificate lifecycle events, and provisioning operations.

This page documents the connection lifecycle, message envelope format, the complete message type catalog, the acknowledgment mechanism, and the reconnection strategy with exponential backoff.

Connection Lifecycle

The communication module establishes and maintains the cloud connection through a multi-step process:

  1. Service Discovery — CM sends an HTTPS POST to the configured service discovery URL to obtain the WebSocket endpoint and an authentication token
  2. WebSocket Establishment — CM opens a persistent WebSocket connection to the endpoint returned by service discovery, authenticating with the bearer token
  3. Bidirectional Messaging — JSON messages flow in both directions over the WebSocket using binary frames
  4. Reconnection — If the connection drops, CM reconnects using exponential backoff

Service Discovery

Before connecting to the cloud WebSocket, CM must discover the endpoint. The service discovery flow:

  1. CM obtains the service discovery URL from its configuration (or from the certificate's Subject Alternative Name via the crypto helper)
  2. CM sends an HTTPS POST request with a JSON body containing:
    • version — the protocol version (7)
    • systemId — the Unit's system identifier
    • supportedProtocols — list of supported transport protocols (currently ["wss"])
  3. The service discovery server responds with:
    • version — protocol version
    • systemId — echoed system ID
    • connectionInfo — list of WebSocket endpoint URLs to connect to
    • authToken — bearer token for WebSocket authentication
    • nextRequestDelay — suggested delay before the next discovery request
    • errorCode — one of NoError, Redirect, RepeatLater, or Error

The connection uses mTLS — CM presents its "online" certificate (obtained from IAM) for client authentication, and validates the server's certificate against the configured CA.

WebSocket Connection

After successful service discovery, CM establishes the WebSocket connection:

  1. Creates an HTTPS session to the first URL in connectionInfo
  2. Sets the Authorization: Bearer <authToken> header
  3. Performs the WebSocket upgrade handshake
  4. Enables keep-alive on the socket
  5. Sets receive timeout to infinite (blocking read)

Once connected, CM notifies all registered connection listeners (other CM subcomponents that need to know about connectivity state) via the OnConnect() callback.

Protocol Format

All messages use a consistent JSON envelope with two top-level fields:

{
"header": {
"version": 7,
"systemId": "<unit-system-id>",
"createdAt": "<ISO-8601-timestamp>",
"txn": "<uuid-transaction-id>"
},
"data": {
"messageType": "<type-discriminator>",
"correlationId": "<optional-correlation-id>",
...
}
}

Header Fields

FieldTypeDescription
versionintegerProtocol version, always 7
systemIdstringThe Unit's unique system identifier
createdAtstringISO 8601 UTC timestamp of message creation
txnstringUUID v4 transaction identifier, unique per message

Data Fields

FieldTypeDescription
messageTypestringDiscriminator identifying the payload type
correlationIdstring (optional)Links a response to its originating request
(additional fields)variesPayload-specific fields depending on messageType

Messages are transmitted as WebSocket binary frames. CM uses 4 dedicated handler threads to process received messages concurrently.

Message Types

The protocol defines the following message types, organized by direction and functional category:

Unit → Cloud Messages

messageTypePurpose
unitStatusReports current Unit state: node information, installed items, running instances, subjects
alertsForwards alert events (system alerts, quota alerts, instance alerts, download progress)
monitoringDataSends resource monitoring data (CPU, RAM, disk usage per node and instance)
pushLogSends requested log data (system logs, instance logs, crash logs)
newStateReports a service instance's new state data
stateRequestRequests state data for a specific service instance
overrideEnvVarsStatusReports status of environment variable override operations
requestBlobUrlsRequests download URLs for image blobs (by OCI digest)
issueUnitCertificatesSends CSRs for certificate issuance
installUnitCertificatesConfirmationConfirms successful certificate installation
startProvisioningResponseResponds to a provisioning start request
finishProvisioningResponseResponds to a provisioning finish request
deprovisioningResponseResponds to a deprovisioning request
ackAcknowledges successful receipt of a cloud message
nackIndicates failure to process a cloud message (includes retryAfter in milliseconds)

Cloud → Unit Messages

messageTypePurpose
desiredStatusDelivers the desired state: items to deploy, instances to run, node states, unit config, certificates
blobUrlsProvides download URLs and decryption info for requested image blobs
requestLogRequests log data from the Unit (system, instance, or crash logs)
stateAcceptanceAccepts or rejects a service instance's new state
updateStateNotifies the Unit of an update state change
overrideEnvVarsRequests environment variable overrides for service instances
renewCertificatesNotificationNotifies that certificates need renewal
issuedUnitCertificatesDelivers newly issued certificates
startProvisioningRequestInitiates the provisioning workflow
finishProvisioningRequestCompletes the provisioning workflow
deprovisioningRequestInitiates deprovisioning
ackAcknowledges successful receipt of a Unit message
nackIndicates failure to process a Unit message

Delta Status Reporting

The unitStatus message supports delta reporting via the isDeltaInfo flag. When true, only changed fields are included rather than the full Unit state. This reduces message size for periodic status updates.

Acknowledgment Mechanism

Messages that require delivery confirmation use an ACK/NACK mechanism:

  1. Sender transmits a message with a unique txn and optionally a correlationId
  2. Receiver processes the message and responds with either:
    • ack — message processed successfully (includes correlationId linking to the original)
    • nack — message processing failed (includes retryAfter suggesting when to retry, default 500ms)
  3. Sender tracks unacknowledged messages and retries if no response arrives within the configured timeout

Send Policies

CM uses two send policies for outgoing messages:

PolicyBehavior
ExpectAckMessage is tracked after sending; if no ACK arrives within the response wait timeout, it is re-enqueued for retry (up to 3 attempts)
SendOnlyMessage is sent without tracking; no retry on missing acknowledgment

Important messages (alerts, logs, env var statuses, certificate operations) use ExpectAck. Non-critical messages (monitoring data, state updates, unit status) use SendOnly.

Retry Logic

When a message expecting acknowledgment does not receive a response within the configured cloudResponseWaitTimeout:

  1. The message is re-enqueued to the send queue with an incremented retry counter
  2. Maximum retry attempts: 3 per message
  3. If all retries are exhausted, the message is dropped and the failure is logged

Reconnection Strategy

The communication module uses exponential backoff for reconnection:

ParameterValue
Initial reconnect delay1 second
Maximum reconnect delay10 minutes
Reconnect attempts per cycle5
Backoff multiplier2× (exponential)

Reconnection Flow

  1. The connection handler thread runs a continuous loop while CM is active
  2. On connection failure, the Retry utility is invoked with exponential backoff:
    • First retry after 1 second
    • Subsequent retries double the delay (2s, 4s, 8s, ...) up to the 10-minute cap
    • After 5 failed attempts at the current delay, the delay increases
  3. If the WebSocket receives a close frame or an error occurs during frame reception, CM disconnects and re-enters the connection loop
  4. If authorization fails (HTTP 401 during WebSocket upgrade), the cached service discovery response is cleared, forcing a fresh discovery on the next attempt

Connection State Notifications

CM maintains a subscriber list for connection state changes. Other CM subcomponents (update manager, state handler) subscribe to receive:

  • OnConnect() — called when the WebSocket connection is successfully established
  • OnDisconnect() — called when the connection is lost

This allows dependent modules to pause operations during disconnection and resume when connectivity is restored.

Threading Model

The communication module operates with the following dedicated threads:

ThreadResponsibility
Connection handlerManages the connection lifecycle (discovery → connect → receive loop → reconnect)
Send queue handlerDequeues messages and transmits them over the WebSocket
Unacknowledged message handlerMonitors sent messages awaiting ACK; re-enqueues on timeout
Message handler pool (×4)Processes received messages concurrently (parses JSON, dispatches to handlers)

All threads coordinate through a shared mutex and condition variable. The send queue and receive queue are protected by the mutex, and threads wake via notify_all() when new work is available.

Security

The cloud connection is secured through multiple layers:

  • mTLS Authentication: CM presents its "online" certificate for client authentication; the server's certificate is validated against the configured CA certificate
  • Certificate Rotation: When IAM notifies CM of a certificate change, CM disconnects and reconnects with the new credentials
  • Bearer Token: The service discovery auth token is included in the WebSocket upgrade request for session-level authentication
  • Message Integrity: All messages include the systemId in the header; CM validates that received messages match its own system ID

Message Logging

CM optionally logs all sent and received messages to a file for debugging purposes. When the cloudMessageLog configuration path is set, every transmitted (TX) and received (RX) message is written to the log file with a direction indicator. This is useful for protocol-level debugging but should be disabled in production due to the volume of data.