Monitoring Pipeline
Introduction
This page describes the monitoring data pipeline — the end-to-end flow from raw metric collection on a Node through aggregation, alert evaluation, and transmission to AosCloud. Understanding this pipeline helps operators reason about metric latency, averaging behavior, and how resource data reaches the cloud dashboard.
The pipeline runs continuously on each Node as a timer-driven loop within the Service Manager (SM). Each poll cycle collects fresh samples, updates running averages, evaluates alert thresholds, and sends the result to the Communication Manager (CM) for cloud delivery.
Pipeline Overview
loading...The monitoring pipeline consists of five stages:
- Data Sources — OS-level interfaces that expose raw resource metrics
- Collection — providers that read raw data into structured monitoring records
- Aggregation — averaging and normalization of collected samples
- Alert Evaluation — threshold-based detection of resource quota violations
- Transmission — delivery of monitoring data and alerts to the cloud
Stage 1: Data Sources
The pipeline reads resource metrics from standard Linux interfaces:
| Metric | Source | What is Read |
|---|---|---|
| CPU usage | /proc/stat | Aggregate CPU time counters (user, system, idle, etc.) |
| RAM usage | /proc/meminfo | MemTotal, MemFree, Buffers, Cached, SReclaimable |
| Disk usage | statvfs() syscall | Block counts per configured partition path |
| Network traffic | Network Manager | Download and upload byte counters |
| Instance resources | Instance Info Provider | Per-service-instance CPU, RAM, disk, and network usage |
These sources are read at each poll interval. No persistent state is maintained between reads except for the previous CPU time counters needed to compute utilization as a delta.
Stage 2: Collection
Two providers gather raw metrics into MonitoringData structures:
NodeMonitoringProvider
Located in sm/monitoring/, this provider collects Node-level system metrics:
- CPU — Computes utilization as a percentage by comparing idle time delta to total time delta between consecutive
reads of
/proc/stat. This gives the average CPU usage across all cores since the last sample. - RAM — Calculates used memory as:
MemTotal - MemFree - Buffers - Cached - SReclaimable. This matches the "used" calculation in standard tools likefree. - Disk — Calls
statvfs()on each partition path defined in the Node configuration. Reports used bytes as(total_blocks - free_blocks) * fragment_size. - Network — Queries the
SystemTrafficProvider(part of SM's Network Manager) for cumulative download and upload byte counts.
Instance Info Provider
For each active service instance, the SM collects per-instance resource usage. Instance monitoring starts when an
instance enters the Activating or Active state and stops when it becomes Inactive or Failed.
The collected data for each instance includes the same metric types (CPU, RAM, disk partitions, network) scoped to that specific service instance.
Stage 3: Aggregation
The Monitoring module in core/common/monitoring/ orchestrates the aggregation stage. A timer fires at the configured
pollPeriod interval, triggering the ProcessMonitoring() cycle.
Averaging
The Average class implements an exponential moving average using a sliding window:
- Window size =
averageWindow / pollPeriod(number of samples in the window) - Update formula — For each metric value:
accumulated = accumulated - (accumulated / windowSize) + newSample - Read formula — Average value =
accumulated / windowSize
This approach smooths transient spikes while remaining responsive to sustained changes. The first sample initializes the
accumulator to newSample * windowSize so the average starts at the actual value rather than ramping up from zero.
Both Node-level and per-instance metrics are averaged independently using the same window parameters.
Normalization
After averaging, the pipeline normalizes Node-level totals to ensure consistency:
- Node CPU ≥ sum of all instance CPU values
- Node RAM ≥ sum of all instance RAM values
- Node download ≥ sum of all instance download values
- Node upload ≥ sum of all instance upload values
- Partition usage = maximum of Node-reported and instance-reported values
This normalization prevents the situation where individual instance metrics sum to more than the reported Node total, which could confuse cloud-side dashboards.
Stage 4: Alert Evaluation
After aggregation, the pipeline evaluates each metric against configured alert thresholds. Alert processing runs on both Node-level and per-instance data.
AlertProcessor
Each monitored resource (CPU, RAM, each partition, download, upload) has a dedicated AlertProcessor instance
configured with:
| Parameter | Description |
|---|---|
maxThreshold | Upper limit — crossing this starts the alert timer |
minThreshold | Lower limit — dropping below this starts the recovery timer |
minTimeout | Duration the value must remain beyond a threshold before an alert fires |
The timeout mechanism prevents alert flapping from brief spikes. A metric must sustain the threshold violation for the
full minTimeout duration before an alert is raised.
Alert State Machine
Each AlertProcessor maintains a two-state machine:
┌─────────────────────────────────────────────────────────┐
│ │
│ Normal State Alert Condition │
│ ┌──────────┐ ┌──────────────┐ │
│ │ │ value ≥ max for │ │ │
│ │ Idle │ ≥ minTimeout │ Raised │ │
│ │ │ ──────────────────►│ │ │
│ │ │ │ │ │
│ │ │ value < min for │ │ │
│ │ │◄────────────────── │ │ │
│ │ │ ≥ minTimeout │ │ │
│ └──────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
The three alert states sent to the cloud:
| State | Meaning |
|---|---|
| Raise | Value exceeded maxThreshold for ≥ minTimeout — alert condition begins |
| Continue | Value remains above minThreshold while in alert condition (periodic reminder) |
| Fall | Value dropped below minThreshold for ≥ minTimeout — alert condition ends |
Continue alerts are sent periodically (every minTimeout interval) while the alert condition persists, providing
ongoing visibility into sustained resource pressure.
Alert Types
Threshold alerts are emitted as one of two types depending on scope:
- SystemQuotaAlert — Node-level resource threshold violation (includes Node ID and parameter name)
- InstanceQuotaAlert — Instance-level resource threshold violation (includes instance identity and parameter name)
Alert thresholds for Node-level resources are defined as percentages in the Node configuration and converted to absolute values at startup (e.g., 90% CPU on a 1000 DMIPS Node becomes a threshold of 900).
Stage 5: Transmission
SM to CM
The SM Client (smclient) implements the monitoring::SenderItf interface. After each poll cycle completes aggregation
and alert evaluation, the SM sends the complete NodeMonitoringData structure to the CM via the internal SM-to-CM
communication channel.
The NodeMonitoringData message contains:
- Timestamp of the measurement
- Node ID
- Node-level
MonitoringData(CPU, RAM, partitions, download, upload) - Array of
InstanceMonitoringDatafor each active instance
Alerts are sent separately through the alerts::SenderItf interface as they are detected.
CM to Cloud
The CM's monitoring module (cm/monitoring/) receives data from all Nodes in the Unit:
- Caching — Incoming monitoring data is cached per-Node, building a complete picture of the Unit's resource state.
- Batching — The CM aggregates data from multiple Nodes into a single
Monitoringmessage containingNodeMonitoringDataarrays andInstanceMonitoringDataarrays. - Sending — On a configurable timer, the CM sends the batched monitoring message to AosCloud over the WebSocket connection.
This two-stage transmission (SM → CM → Cloud) decouples the per-Node poll frequency from the cloud send frequency, allowing the CM to batch and optimize network usage.
Configuration Parameters
The monitoring pipeline behavior is controlled by two primary parameters:
| Parameter | Default | Description |
|---|---|---|
pollPeriod | Compile-time configurable | Interval between metric collection cycles |
averageWindow | Compile-time configurable | Time window for the sliding average calculation |
The ratio averageWindow / pollPeriod determines the number of samples in the averaging window. A larger ratio produces
smoother data but increases latency in detecting changes.
Alert thresholds are configured per-Node through the Node configuration, which can be updated dynamically. When the Node configuration changes, alert processors are reconfigured without restarting the monitoring pipeline.
Data Flow Summary
Each poll cycle executes the following sequence:
- Timer fires (every
pollPeriod) NodeMonitoringProvider.GetNodeMonitoringData()reads OS metricsInstanceInfoProviderprovides per-instance metrics for all active instancesAverage.Update()incorporates new samples into the sliding windowAlertProcessor.CheckAlertDetection()evaluates each resource against thresholdsNormalizeMonitoringData()ensures Node totals ≥ instance sumsSender.SendMonitoringData()transmits the result to CM- CM caches, batches, and forwards to AosCloud on its own schedule
The entire cycle is synchronous within the timer callback, ensuring consistent snapshots where all metrics in a single
NodeMonitoringData message correspond to the same point in time.
Related Pages
- Monitoring and Observability — overview of all monitoring subsystems
- Alerts and Thresholds — detailed alert rule configuration and behavior
- Service Manager — SM component that hosts the monitoring pipeline
- Architecture Overview — system component relationships