Skip to main content
Version: v1.1

Monitoring and Observability

Introduction

This section covers how AosCore monitors the health and resource usage of Nodes and service instances, generates alerts when thresholds are exceeded, and provides access to system and service logs. These capabilities give operators visibility into Unit behavior and enable proactive issue detection.

All monitoring and observability functions are implemented within the Service Manager (SM) and the shared monitoring library. Collected data is forwarded to AosCloud through the Communication Manager (CM) for centralized analysis and dashboarding.

Monitoring Capabilities

AosCore provides three complementary observability subsystems:

Resource Metrics Collection

The monitoring pipeline periodically samples system-level and per-instance resource usage:

MetricSourceScope
CPU usage/proc/statNode-level and per-instance
RAM usage/proc/meminfoNode-level and per-instance
Disk usageFilesystem stats per partitionNode-level and per-instance
Network trafficDownload and upload byte countersNode-level and per-instance

Metrics are collected at a configurable poll interval, averaged over a sliding window, and sent to the cloud as NodeMonitoringData and InstanceMonitoringData messages. The averaging window smooths transient spikes and provides a representative view of resource consumption over time.

Alerting

AosCore implements a threshold-based alerting system that evaluates collected metrics against configurable rules. Alerts are generated at two levels:

  • System-level alerts — triggered when Node resource usage (CPU, RAM, disk, network) crosses defined thresholds.
  • Instance-level alerts — triggered when a specific service instance exceeds its allocated resource quotas.

Each alert rule defines minimum and maximum thresholds with a timeout duration. When a metric exceeds the maximum threshold for longer than the configured timeout, a "raise" alert is sent. When it drops below the minimum threshold, a "fall" alert signals recovery. This hysteresis prevents alert flapping.

In addition to resource quota alerts, the system generates:

  • Journal-based alerts — the SM monitors the systemd journal for error-level messages from AosCore components and service instances, forwarding them as CoreAlert, InstanceAlert, or SystemAlert messages.
  • Resource allocation alerts — generated when resource allocation for a service instance fails.
  • Download progress alerts — track the state of image downloads (started, paused, interrupted, finished).

Logging

The log provider gives operators on-demand access to logs through cloud-initiated requests:

  • Instance logs — journal entries from a specific service instance, filtered by the instance's systemd unit name.
  • Instance crash logs — journal entries surrounding a service crash event, useful for post-mortem analysis.
  • System logs — journal entries from AosCore system components.

Log requests support time-range filtering (from/till timestamps) and are returned as compressed archives split into configurable part sizes for efficient transport over the cloud protocol.

Architecture

The monitoring and observability subsystems are implemented across these modules:

ModuleLocationResponsibility
Monitoring (common)core/common/monitoring/Metric averaging, alert threshold processing, data aggregation
Node Monitoring Providersm/monitoring/Collects raw CPU, RAM, disk, and network metrics from the OS
Journal Alertssm/alerts/Monitors systemd journal for error-level entries
Log Providersm/logprovider/Serves log requests from the cloud

The data flow follows this path:

  1. CollectionNodeMonitoringProvider reads system metrics from procfs and filesystem APIs at each poll interval.
  2. Aggregation — The common Monitoring module averages samples over the configured window and maintains per-Node and per-instance data.
  3. Alert evaluationAlertProcessor instances compare current values against threshold rules and emit alerts when conditions are met.
  4. Transmission — Aggregated monitoring data and alerts are sent to the cloud via CM.

In This Section