Version: v1.1

Monitoring Pipeline

Introduction

This page describes the monitoring data pipeline — the end-to-end flow from raw metric collection on a Node through aggregation, alert evaluation, and transmission to AosCloud. Understanding this pipeline helps operators reason about metric latency, averaging behavior, and how resource data reaches the cloud dashboard.

The pipeline runs continuously on each Node as a timer-driven loop within the Service Manager (SM). Each poll cycle collects fresh samples, updates running averages, evaluates alert thresholds, and sends the result to the Communication Manager (CM) for cloud delivery.

Pipeline Overview

The monitoring pipeline consists of five stages:

Data Sources — OS-level interfaces that expose raw resource metrics
Collection — providers that read raw data into structured monitoring records
Aggregation — averaging and normalization of collected samples
Alert Evaluation — threshold-based detection of resource quota violations
Transmission — delivery of monitoring data and alerts to the cloud

Stage 1: Data Sources

The pipeline reads resource metrics from standard Linux interfaces:

Metric	Source	What is Read
CPU usage	`/proc/stat`	Aggregate CPU time counters (user, system, idle, etc.)
RAM usage	`/proc/meminfo`	MemTotal, MemFree, Buffers, Cached, SReclaimable
Disk usage	`statvfs()` syscall	Block counts per configured partition path
Network traffic	Network Manager	Download and upload byte counters
Instance resources	Instance Info Provider	Per-service-instance CPU, RAM, disk, and network usage

These sources are read at each poll interval. No persistent state is maintained between reads except for the previous CPU time counters needed to compute utilization as a delta.

Stage 2: Collection

Two providers gather raw metrics into MonitoringData structures:

NodeMonitoringProvider

Located in sm/monitoring/, this provider collects Node-level system metrics:

CPU — Computes utilization as a percentage by comparing idle time delta to total time delta between consecutive reads of /proc/stat. This gives the average CPU usage across all cores since the last sample.
RAM — Calculates used memory as: MemTotal - MemFree - Buffers - Cached - SReclaimable. This matches the "used" calculation in standard tools like free.
Disk — Calls statvfs() on each partition path defined in the Node configuration. Reports used bytes as (total_blocks - free_blocks) * fragment_size.
Network — Queries the SystemTrafficProvider (part of SM's Network Manager) for cumulative download and upload byte counts.

Instance Info Provider

For each active service instance, the SM collects per-instance resource usage. Instance monitoring starts when an instance enters the Activating or Active state and stops when it becomes Inactive or Failed.

The collected data for each instance includes the same metric types (CPU, RAM, disk partitions, network) scoped to that specific service instance.

Stage 3: Aggregation

The Monitoring module in core/common/monitoring/ orchestrates the aggregation stage. A timer fires at the configured pollPeriod interval, triggering the ProcessMonitoring() cycle.

Averaging

The Average class implements an exponential moving average using a sliding window:

Window size = averageWindow / pollPeriod (number of samples in the window)
Update formula — For each metric value: accumulated = accumulated - (accumulated / windowSize) + newSample
Read formula — Average value = accumulated / windowSize

This approach smooths transient spikes while remaining responsive to sustained changes. The first sample initializes the accumulator to newSample * windowSize so the average starts at the actual value rather than ramping up from zero.

Both Node-level and per-instance metrics are averaged independently using the same window parameters.

Normalization

After averaging, the pipeline normalizes Node-level totals to ensure consistency:

Node CPU ≥ sum of all instance CPU values
Node RAM ≥ sum of all instance RAM values
Node download ≥ sum of all instance download values
Node upload ≥ sum of all instance upload values
Partition usage = maximum of Node-reported and instance-reported values

This normalization prevents the situation where individual instance metrics sum to more than the reported Node total, which could confuse cloud-side dashboards.

Stage 4: Alert Evaluation

After aggregation, the pipeline evaluates each metric against configured alert thresholds. Alert processing runs on both Node-level and per-instance data.

AlertProcessor

Each monitored resource (CPU, RAM, each partition, download, upload) has a dedicated AlertProcessor instance configured with:

Parameter	Description
`maxThreshold`	Upper limit — crossing this starts the alert timer
`minThreshold`	Lower limit — dropping below this starts the recovery timer
`minTimeout`	Duration the value must remain beyond a threshold before an alert fires

The timeout mechanism prevents alert flapping from brief spikes. A metric must sustain the threshold violation for the full minTimeout duration before an alert is raised.

Alert State Machine

Each AlertProcessor maintains a two-state machine:

┌─────────────────────────────────────────────────────────┐
│                                                         │
│  Normal State                    Alert Condition        │
│  ┌──────────┐                    ┌──────────────┐      │
│  │          │  value ≥ max for   │              │      │
│  │  Idle    │  ≥ minTimeout      │   Raised     │      │
│  │          │ ──────────────────►│              │      │
│  │          │                    │              │      │
│  │          │  value < min for   │              │      │
│  │          │◄────────────────── │              │      │
│  │          │  ≥ minTimeout      │              │      │
│  └──────────┘                    └──────────────┘      │
│                                                         │
└─────────────────────────────────────────────────────────┘

The three alert states sent to the cloud:

State	Meaning
Raise	Value exceeded `maxThreshold` for ≥ `minTimeout` — alert condition begins
Continue	Value remains above `minThreshold` while in alert condition (periodic reminder)
Fall	Value dropped below `minThreshold` for ≥ `minTimeout` — alert condition ends

Continue alerts are sent periodically (every minTimeout interval) while the alert condition persists, providing ongoing visibility into sustained resource pressure.

Alert Types

Threshold alerts are emitted as one of two types depending on scope:

SystemQuotaAlert — Node-level resource threshold violation (includes Node ID and parameter name)
InstanceQuotaAlert — Instance-level resource threshold violation (includes instance identity and parameter name)

Alert thresholds for Node-level resources are defined as percentages in the Node configuration and converted to absolute values at startup (e.g., 90% CPU on a 1000 DMIPS Node becomes a threshold of 900).

Stage 5: Transmission

SM to CM

The SM Client (smclient) implements the monitoring::SenderItf interface. After each poll cycle completes aggregation and alert evaluation, the SM sends the complete NodeMonitoringData structure to the CM via the internal SM-to-CM communication channel.

The NodeMonitoringData message contains:

Timestamp of the measurement
Node ID
Node-level MonitoringData (CPU, RAM, partitions, download, upload)
Array of InstanceMonitoringData for each active instance

Alerts are sent separately through the alerts::SenderItf interface as they are detected.

CM to Cloud

The CM's monitoring module (cm/monitoring/) receives data from all Nodes in the Unit:

Caching — Incoming monitoring data is cached per-Node, building a complete picture of the Unit's resource state.
Batching — The CM aggregates data from multiple Nodes into a single Monitoring message containing NodeMonitoringData arrays and InstanceMonitoringData arrays.
Sending — On a configurable timer, the CM sends the batched monitoring message to AosCloud over the WebSocket connection.

This two-stage transmission (SM → CM → Cloud) decouples the per-Node poll frequency from the cloud send frequency, allowing the CM to batch and optimize network usage.

Configuration Parameters

The monitoring pipeline behavior is controlled by two primary parameters:

Parameter	Default	Description
`pollPeriod`	Compile-time configurable	Interval between metric collection cycles
`averageWindow`	Compile-time configurable	Time window for the sliding average calculation

The ratio averageWindow / pollPeriod determines the number of samples in the averaging window. A larger ratio produces smoother data but increases latency in detecting changes.

Alert thresholds are configured per-Node through the Node configuration, which can be updated dynamically. When the Node configuration changes, alert processors are reconfigured without restarting the monitoring pipeline.

Data Flow Summary

Each poll cycle executes the following sequence:

Timer fires (every pollPeriod)
NodeMonitoringProvider.GetNodeMonitoringData() reads OS metrics
InstanceInfoProvider provides per-instance metrics for all active instances
Average.Update() incorporates new samples into the sliding window
AlertProcessor.CheckAlertDetection() evaluates each resource against thresholds
NormalizeMonitoringData() ensures Node totals ≥ instance sums
Sender.SendMonitoringData() transmits the result to CM
CM caches, batches, and forwards to AosCloud on its own schedule

The entire cycle is synchronous within the timer callback, ensuring consistent snapshots where all metrics in a single NodeMonitoringData message correspond to the same point in time.

Monitoring and Observability — overview of all monitoring subsystems
Alerts and Thresholds — detailed alert rule configuration and behavior

Service Manager — SM component that hosts the monitoring pipeline
Architecture Overview — system component relationships

Introduction​

Pipeline Overview​

Stage 1: Data Sources​

Stage 2: Collection​

NodeMonitoringProvider​

Instance Info Provider​

Stage 3: Aggregation​

Averaging​

Normalization​

Stage 4: Alert Evaluation​

AlertProcessor​

Alert State Machine​

Alert Types​

Stage 5: Transmission​

SM to CM​

CM to Cloud​

Configuration Parameters​

Data Flow Summary​

Related Pages​