Monitoring and Observability
Introduction
This section covers how AosCore monitors the health and resource usage of Nodes and service instances, generates alerts when thresholds are exceeded, and provides access to system and service logs. These capabilities give operators visibility into Unit behavior and enable proactive issue detection.
All monitoring and observability functions are implemented within the Service Manager (SM) and the shared monitoring library. Collected data is forwarded to AosCloud through the Communication Manager (CM) for centralized analysis and dashboarding.
Monitoring Capabilities
AosCore provides three complementary observability subsystems:
Resource Metrics Collection
The monitoring pipeline periodically samples system-level and per-instance resource usage:
| Metric | Source | Scope |
|---|---|---|
| CPU usage | /proc/stat | Node-level and per-instance |
| RAM usage | /proc/meminfo | Node-level and per-instance |
| Disk usage | Filesystem stats per partition | Node-level and per-instance |
| Network traffic | Download and upload byte counters | Node-level and per-instance |
Metrics are collected at a configurable poll interval, averaged over a sliding window, and sent to the cloud as
NodeMonitoringData and InstanceMonitoringData messages. The averaging window smooths transient spikes and provides a
representative view of resource consumption over time.
Alerting
AosCore implements a threshold-based alerting system that evaluates collected metrics against configurable rules. Alerts are generated at two levels:
- System-level alerts — triggered when Node resource usage (CPU, RAM, disk, network) crosses defined thresholds.
- Instance-level alerts — triggered when a specific service instance exceeds its allocated resource quotas.
Each alert rule defines minimum and maximum thresholds with a timeout duration. When a metric exceeds the maximum threshold for longer than the configured timeout, a "raise" alert is sent. When it drops below the minimum threshold, a "fall" alert signals recovery. This hysteresis prevents alert flapping.
In addition to resource quota alerts, the system generates:
- Journal-based alerts — the SM monitors the systemd journal for error-level messages from AosCore components and
service instances, forwarding them as
CoreAlert,InstanceAlert, orSystemAlertmessages. - Resource allocation alerts — generated when resource allocation for a service instance fails.
- Download progress alerts — track the state of image downloads (started, paused, interrupted, finished).
Logging
The log provider gives operators on-demand access to logs through cloud-initiated requests:
- Instance logs — journal entries from a specific service instance, filtered by the instance's systemd unit name.
- Instance crash logs — journal entries surrounding a service crash event, useful for post-mortem analysis.
- System logs — journal entries from AosCore system components.
Log requests support time-range filtering (from/till timestamps) and are returned as compressed archives split into configurable part sizes for efficient transport over the cloud protocol.
Architecture
The monitoring and observability subsystems are implemented across these modules:
| Module | Location | Responsibility |
|---|---|---|
| Monitoring (common) | core/common/monitoring/ | Metric averaging, alert threshold processing, data aggregation |
| Node Monitoring Provider | sm/monitoring/ | Collects raw CPU, RAM, disk, and network metrics from the OS |
| Journal Alerts | sm/alerts/ | Monitors systemd journal for error-level entries |
| Log Provider | sm/logprovider/ | Serves log requests from the cloud |
The data flow follows this path:
- Collection —
NodeMonitoringProviderreads system metrics from procfs and filesystem APIs at each poll interval. - Aggregation — The common
Monitoringmodule averages samples over the configured window and maintains per-Node and per-instance data. - Alert evaluation —
AlertProcessorinstances compare current values against threshold rules and emit alerts when conditions are met. - Transmission — Aggregated monitoring data and alerts are sent to the cloud via CM.
In This Section
- Monitoring Pipeline — detailed data collection, averaging, and transmission flow
- Alerts and Thresholds — alert rule configuration, threshold behavior, and alert types
- Logging Pipeline — log collection, request handling, and archive delivery
Related Pages
- Architecture Overview — system component relationships
- Service Manager — SM component that hosts monitoring subsystems
- Configuration — monitoring and alerting configuration options