Skip to main content
Version: v1.1

Monitoring Pipeline

Introduction

This page describes the monitoring data pipeline — the end-to-end flow from raw metric collection on a Node through aggregation, alert evaluation, and transmission to AosCloud. Understanding this pipeline helps operators reason about metric latency, averaging behavior, and how resource data reaches the cloud dashboard.

The pipeline runs continuously on each Node as a timer-driven loop within the Service Manager (SM). Each poll cycle collects fresh samples, updates running averages, evaluates alert thresholds, and sends the result to the Communication Manager (CM) for cloud delivery.

Pipeline Overview

loading...

The monitoring pipeline consists of five stages:

  1. Data Sources — OS-level interfaces that expose raw resource metrics
  2. Collection — providers that read raw data into structured monitoring records
  3. Aggregation — averaging and normalization of collected samples
  4. Alert Evaluation — threshold-based detection of resource quota violations
  5. Transmission — delivery of monitoring data and alerts to the cloud

Stage 1: Data Sources

The pipeline reads resource metrics from standard Linux interfaces:

MetricSourceWhat is Read
CPU usage/proc/statAggregate CPU time counters (user, system, idle, etc.)
RAM usage/proc/meminfoMemTotal, MemFree, Buffers, Cached, SReclaimable
Disk usagestatvfs() syscallBlock counts per configured partition path
Network trafficNetwork ManagerDownload and upload byte counters
Instance resourcesInstance Info ProviderPer-service-instance CPU, RAM, disk, and network usage

These sources are read at each poll interval. No persistent state is maintained between reads except for the previous CPU time counters needed to compute utilization as a delta.

Stage 2: Collection

Two providers gather raw metrics into MonitoringData structures:

NodeMonitoringProvider

Located in sm/monitoring/, this provider collects Node-level system metrics:

  • CPU — Computes utilization as a percentage by comparing idle time delta to total time delta between consecutive reads of /proc/stat. This gives the average CPU usage across all cores since the last sample.
  • RAM — Calculates used memory as: MemTotal - MemFree - Buffers - Cached - SReclaimable. This matches the "used" calculation in standard tools like free.
  • Disk — Calls statvfs() on each partition path defined in the Node configuration. Reports used bytes as (total_blocks - free_blocks) * fragment_size.
  • Network — Queries the SystemTrafficProvider (part of SM's Network Manager) for cumulative download and upload byte counts.

Instance Info Provider

For each active service instance, the SM collects per-instance resource usage. Instance monitoring starts when an instance enters the Activating or Active state and stops when it becomes Inactive or Failed.

The collected data for each instance includes the same metric types (CPU, RAM, disk partitions, network) scoped to that specific service instance.

Stage 3: Aggregation

The Monitoring module in core/common/monitoring/ orchestrates the aggregation stage. A timer fires at the configured pollPeriod interval, triggering the ProcessMonitoring() cycle.

Averaging

The Average class implements an exponential moving average using a sliding window:

  • Window size = averageWindow / pollPeriod (number of samples in the window)
  • Update formula — For each metric value: accumulated = accumulated - (accumulated / windowSize) + newSample
  • Read formula — Average value = accumulated / windowSize

This approach smooths transient spikes while remaining responsive to sustained changes. The first sample initializes the accumulator to newSample * windowSize so the average starts at the actual value rather than ramping up from zero.

Both Node-level and per-instance metrics are averaged independently using the same window parameters.

Normalization

After averaging, the pipeline normalizes Node-level totals to ensure consistency:

  • Node CPU ≥ sum of all instance CPU values
  • Node RAM ≥ sum of all instance RAM values
  • Node download ≥ sum of all instance download values
  • Node upload ≥ sum of all instance upload values
  • Partition usage = maximum of Node-reported and instance-reported values

This normalization prevents the situation where individual instance metrics sum to more than the reported Node total, which could confuse cloud-side dashboards.

Stage 4: Alert Evaluation

After aggregation, the pipeline evaluates each metric against configured alert thresholds. Alert processing runs on both Node-level and per-instance data.

AlertProcessor

Each monitored resource (CPU, RAM, each partition, download, upload) has a dedicated AlertProcessor instance configured with:

ParameterDescription
maxThresholdUpper limit — crossing this starts the alert timer
minThresholdLower limit — dropping below this starts the recovery timer
minTimeoutDuration the value must remain beyond a threshold before an alert fires

The timeout mechanism prevents alert flapping from brief spikes. A metric must sustain the threshold violation for the full minTimeout duration before an alert is raised.

Alert State Machine

Each AlertProcessor maintains a two-state machine:

┌─────────────────────────────────────────────────────────┐
│ │
│ Normal State Alert Condition │
│ ┌──────────┐ ┌──────────────┐ │
│ │ │ value ≥ max for │ │ │
│ │ Idle │ ≥ minTimeout │ Raised │ │
│ │ │ ──────────────────►│ │ │
│ │ │ │ │ │
│ │ │ value < min for │ │ │
│ │ │◄────────────────── │ │ │
│ │ │ ≥ minTimeout │ │ │
│ └──────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘

The three alert states sent to the cloud:

StateMeaning
RaiseValue exceeded maxThreshold for ≥ minTimeout — alert condition begins
ContinueValue remains above minThreshold while in alert condition (periodic reminder)
FallValue dropped below minThreshold for ≥ minTimeout — alert condition ends

Continue alerts are sent periodically (every minTimeout interval) while the alert condition persists, providing ongoing visibility into sustained resource pressure.

Alert Types

Threshold alerts are emitted as one of two types depending on scope:

  • SystemQuotaAlert — Node-level resource threshold violation (includes Node ID and parameter name)
  • InstanceQuotaAlert — Instance-level resource threshold violation (includes instance identity and parameter name)

Alert thresholds for Node-level resources are defined as percentages in the Node configuration and converted to absolute values at startup (e.g., 90% CPU on a 1000 DMIPS Node becomes a threshold of 900).

Stage 5: Transmission

SM to CM

The SM Client (smclient) implements the monitoring::SenderItf interface. After each poll cycle completes aggregation and alert evaluation, the SM sends the complete NodeMonitoringData structure to the CM via the internal SM-to-CM communication channel.

The NodeMonitoringData message contains:

  • Timestamp of the measurement
  • Node ID
  • Node-level MonitoringData (CPU, RAM, partitions, download, upload)
  • Array of InstanceMonitoringData for each active instance

Alerts are sent separately through the alerts::SenderItf interface as they are detected.

CM to Cloud

The CM's monitoring module (cm/monitoring/) receives data from all Nodes in the Unit:

  1. Caching — Incoming monitoring data is cached per-Node, building a complete picture of the Unit's resource state.
  2. Batching — The CM aggregates data from multiple Nodes into a single Monitoring message containing NodeMonitoringData arrays and InstanceMonitoringData arrays.
  3. Sending — On a configurable timer, the CM sends the batched monitoring message to AosCloud over the WebSocket connection.

This two-stage transmission (SM → CM → Cloud) decouples the per-Node poll frequency from the cloud send frequency, allowing the CM to batch and optimize network usage.

Configuration Parameters

The monitoring pipeline behavior is controlled by two primary parameters:

ParameterDefaultDescription
pollPeriodCompile-time configurableInterval between metric collection cycles
averageWindowCompile-time configurableTime window for the sliding average calculation

The ratio averageWindow / pollPeriod determines the number of samples in the averaging window. A larger ratio produces smoother data but increases latency in detecting changes.

Alert thresholds are configured per-Node through the Node configuration, which can be updated dynamically. When the Node configuration changes, alert processors are reconfigured without restarting the monitoring pipeline.

Data Flow Summary

Each poll cycle executes the following sequence:

  1. Timer fires (every pollPeriod)
  2. NodeMonitoringProvider.GetNodeMonitoringData() reads OS metrics
  3. InstanceInfoProvider provides per-instance metrics for all active instances
  4. Average.Update() incorporates new samples into the sliding window
  5. AlertProcessor.CheckAlertDetection() evaluates each resource against thresholds
  6. NormalizeMonitoringData() ensures Node totals ≥ instance sums
  7. Sender.SendMonitoringData() transmits the result to CM
  8. CM caches, batches, and forwards to AosCloud on its own schedule

The entire cycle is synchronous within the timer callback, ensuring consistent snapshots where all metrics in a single NodeMonitoringData message correspond to the same point in time.