Alerts and Thresholds
Introduction
This page describes the AosCore alerting system in detail — the types of alerts generated, how threshold rules are configured, the hysteresis mechanism that prevents alert flapping, and how journal-based alerts capture error-level log entries from system components and service instances.
Alerts provide operators with real-time visibility into resource pressure, component failures, and download progress. They are generated locally on each Node and forwarded to AosCloud through the Communication Manager (CM) for centralized processing and notification.
Alert Types
AosCore defines seven alert types, each carrying a distinct tag that identifies its category. All alerts include a timestamp indicating when the condition was detected.
SystemQuotaAlert
Raised when a Node-level resource metric (CPU, RAM, disk partition, download, or upload) crosses a configured threshold.
| Field | Description |
|---|---|
tag | systemQuotaAlert |
timestamp | Time the alert condition was detected |
nodeID | Identifier of the Node where the threshold was crossed |
parameter | Resource name (e.g., cpu, ram, download, upload, or partition name) |
value | Current resource usage value at the time of the alert |
state | Alert state: raise, continue, or fall |
SystemQuotaAlerts are generated by the AlertProcessor when Node-level monitoring data exceeds configured thresholds.
The thresholds are defined as percentages in the Node configuration and converted to absolute values at startup.
InstanceQuotaAlert
Raised when a specific service instance exceeds its allocated resource quota.
| Field | Description |
|---|---|
tag | instanceQuotaAlert |
timestamp | Time the alert condition was detected |
serviceID | Service identifier |
subjectID | Subject (tenant) identifier |
instance | Instance number |
parameter | Resource name (e.g., cpu, ram, download, upload, or partition name) |
value | Current resource usage value |
state | Alert state: raise, continue, or fall |
InstanceQuotaAlerts use the same threshold mechanism as SystemQuotaAlerts but are scoped to individual service instances. Thresholds are defined per-instance through the service configuration.
ResourceAllocateAlert
Generated when resource allocation for a service instance fails — for example, when the system cannot allocate the requested disk space or network bandwidth.
| Field | Description |
|---|---|
tag | resourceAllocateAlert |
timestamp | Time the allocation failure occurred |
serviceID | Service identifier |
subjectID | Subject (tenant) identifier |
instance | Instance number |
nodeID | Node where allocation was attempted |
resource | Name of the resource that failed to allocate |
message | Description of the allocation failure |
SystemAlert
Captures error-level messages from the systemd journal that do not originate from AosCore components or service instances. These represent OS-level or third-party service errors.
| Field | Description |
|---|---|
tag | systemAlert |
timestamp | Time the journal entry was recorded |
nodeID | Node where the entry was logged |
message | Content of the journal entry |
CoreAlert
Captures error-level journal entries from AosCore core components (CM, SM, IAM).
| Field | Description |
|---|---|
tag | coreAlert |
timestamp | Time the journal entry was recorded |
nodeID | Node where the entry was logged |
coreComponent | Component that generated the entry: CM, SM, IAM, or MP |
message | Content of the journal entry |
The journal alerts module identifies core component entries by matching the systemd unit name against known service
names (aos-cm.service, aos-sm.service, aos-iam.service).
InstanceAlert
Captures error-level journal entries from service instances.
| Field | Description |
|---|---|
tag | updateItemInstanceAlert |
timestamp | Time the journal entry was recorded |
serviceID | Service identifier |
subjectID | Subject (tenant) identifier |
instance | Instance number |
version | Service version |
message | Content of the journal entry |
Instance alerts are identified by matching the systemd unit name pattern aos-service@<instanceID>.service. The
instance ID is resolved to the full instance identity (service ID, subject ID, instance number) through the Instance
Info Provider.
DownloadAlert
Tracks the progress and state of image downloads.
| Field | Description |
|---|---|
tag | downloadProgressAlert |
timestamp | Time of the state change |
digest | Content digest of the image being downloaded |
url | Download URL |
downloadedBytes | Bytes downloaded so far |
totalBytes | Total expected size |
state | Download state: started, paused, interrupted, or finished |
reason | Optional reason for state change (e.g., why download was interrupted) |
error | Error code if the download failed |
Threshold Configuration
Resource alerts (SystemQuotaAlert and InstanceQuotaAlert) are controlled by threshold rules. Each rule defines three parameters:
| Parameter | Type | Description |
|---|---|---|
minTimeout | Duration (ISO 8601) | How long a threshold must be sustained before an alert fires |
minThreshold | Number | Lower boundary — dropping below this for minTimeout clears the alert |
maxThreshold | Number | Upper boundary — exceeding this for minTimeout raises the alert |
Rule Formats
Thresholds are expressed in two formats depending on the resource type:
Percentage-based rules (AlertRulePercents) — Used for CPU, RAM, and disk partitions. Thresholds are specified as
percentages (0–100) of the total resource capacity. At startup, the monitoring system converts these to absolute values
based on the Node's actual resource capacity.
{
"cpu": {
"minTimeout": "PT10S",
"minThreshold": 80,
"maxThreshold": 90
}
}
In this example, a CPU alert raises when usage exceeds 90% for at least 10 seconds, and clears when it drops below 80% for at least 10 seconds.
Absolute-value rules (AlertRulePoints) — Used for download and upload traffic. Thresholds are specified in bytes
(or bytes per second, depending on the metric).
{
"download": {
"minTimeout": "PT30S",
"minThreshold": 1000,
"maxThreshold": 2000
}
}
Node-Level Configuration
Node-level alert rules are defined in the Node configuration (delivered as part of the Unit configuration from the
cloud). The alertRules object supports:
| Resource | Rule Type | Description |
|---|---|---|
cpu | Percentage | CPU usage threshold as percentage of total DMIPS |
ram | Percentage | RAM usage threshold as percentage of total memory |
partitions | Percentage (per partition) | Disk usage threshold per named partition |
download | Absolute | Download traffic threshold in bytes |
upload | Absolute | Upload traffic threshold in bytes |
Example Node-level alert rules:
{
"alertRules": {
"cpu": {
"minTimeout": "PT10S",
"minThreshold": 80,
"maxThreshold": 90
},
"ram": {
"minTimeout": "PT10S",
"minThreshold": 70,
"maxThreshold": 85
},
"partitions": [
{
"name": "states",
"minTimeout": "PT30S",
"minThreshold": 70,
"maxThreshold": 90
}
],
"download": {
"minTimeout": "PT30S",
"minThreshold": 100000,
"maxThreshold": 200000
},
"upload": {
"minTimeout": "PT30S",
"minThreshold": 50000,
"maxThreshold": 100000
}
}
}
Instance-Level Configuration
Instance-level alert rules follow the same structure and are defined per service instance through the service configuration. They control when InstanceQuotaAlerts are generated for individual services exceeding their allocated quotas.
Dynamic Reconfiguration
Alert thresholds can be updated dynamically through the Unit configuration without restarting the monitoring pipeline. When a new configuration is received from the cloud, the alert processors are reconfigured with the updated threshold values.
Hysteresis Behavior
The alert system implements hysteresis through a two-state machine with three output states. This prevents rapid alert toggling (flapping) when a metric oscillates around a threshold boundary.
State Machine
Each AlertProcessor maintains an internal boolean state (alertCondition) that tracks whether the resource is
currently in an alert condition:
value ≥ maxThreshold
for ≥ minTimeout
┌──────────┐ ─────────────────────► ┌──────────────┐
│ │ [Raise] │ │
│ Normal │ │ Alert │
│ │ ◄───────────────────── │ Condition │
└──────────┘ value < minThreshold └──────────────┘
for ≥ minTimeout │
[Fall] │
│ value ≥ minThreshold
│ for ≥ minTimeout
│ [Continue]
└──────┐
│
▼
(periodic reminder)
Alert States
| State | Trigger | Meaning |
|---|---|---|
| Raise | Value ≥ maxThreshold sustained for ≥ minTimeout | Alert condition begins — resource pressure detected |
| Continue | Value remains ≥ minThreshold while in alert condition, every minTimeout interval | Periodic reminder that alert condition persists |
| Fall | Value < minThreshold sustained for ≥ minTimeout | Alert condition ends — resource pressure resolved |
Detailed Behavior
-
Normal → Raise: When the monitored value first crosses
maxThreshold, a timer starts. If the value remains at or abovemaxThresholdfor the fullminTimeoutduration, aRaisealert is sent and the processor enters the alert condition. If the value drops belowmaxThresholdbefore the timeout expires, the timer resets. -
Alert Condition → Continue: While in the alert condition, if the value remains at or above
minThreshold, aContinuealert is sent everyminTimeoutinterval. This provides ongoing visibility that the resource is still under pressure, even if it has dropped belowmaxThreshold. -
Alert Condition → Fall: When the value drops below
minThreshold, a timer starts. If it remains belowminThresholdfor the fullminTimeoutduration, aFallalert is sent and the processor returns to normal state. If the value rises back aboveminThresholdbefore the timeout expires, the timer resets and the alert condition continues.
Example Sequence
Given a rule with minThreshold: 80, maxThreshold: 90, minTimeout: 10s:
| Time | Value | Event |
|---|---|---|
| T+0s | 50% | Normal — no action |
| T+5s | 92% | Exceeds maxThreshold — timer starts |
| T+15s | 91% | Timer expired (10s elapsed) — Raise alert sent |
| T+25s | 85% | Still ≥ minThreshold, timer expired — Continue alert sent |
| T+35s | 82% | Still ≥ minThreshold, timer expired — Continue alert sent |
| T+40s | 75% | Below minThreshold — recovery timer starts |
| T+50s | 72% | Recovery timer expired (10s elapsed) — Fall alert sent |
Journal-Based Alerts
In addition to threshold-based resource alerts, AosCore monitors the systemd journal for error-level entries and forwards them as alerts to the cloud. This provides visibility into component crashes, service failures, and system-level errors without requiring explicit instrumentation.
How It Works
The JournalAlerts module (located in sm/alerts/) runs a dedicated monitoring thread that:
- Opens the systemd journal and seeks to the last stored cursor position
- Adds match filters for entries at or below the configured priority level
- Continuously reads new journal entries as they appear
- Categorizes each entry and sends the appropriate alert type
Entry Categorization
When a new journal entry is read, it is categorized based on the systemd unit that produced it:
| Condition | Alert Type | Identification |
|---|---|---|
Unit matches aos-service@*.service | InstanceAlert | Instance ID extracted from unit name, resolved to full identity |
Unit matches aos-cm.service, aos-sm.service, or aos-iam.service | CoreAlert | Core component identified from service name |
| Any other entry | SystemAlert | Generic system-level alert |
For service instances running under cgroup v2, the systemd unit may not be present in the journal entry. In this case,
the instance ID is extracted from the _SYSTEMD_CGROUP field instead.
Priority Filtering
The journal alerts module uses two priority levels from the configuration:
| Parameter | Description |
|---|---|
systemAlertPriority | Maximum journal priority level for system-wide monitoring (e.g., 3 = error and above) |
serviceAlertPriority | Maximum priority level for service instance entries from init.scope |
Journal priority levels follow the standard syslog scale (0 = emergency, 7 = debug). Setting systemAlertPriority to 3
means only entries with priority 0 (emergency), 1 (alert), 2 (critical), or 3 (error) are captured.
Message Filtering
The configuration supports regex-based message filters that suppress specific alert messages. If a journal entry's message matches any configured filter pattern, the entry is silently discarded. This allows operators to suppress known-noisy entries that would otherwise generate excessive alerts.
{
"journalAlerts": {
"systemAlertPriority": 3,
"serviceAlertPriority": 4,
"filter": [
"^audit:.*",
"^systemd\\[1\\]: .* failed\\.$"
]
}
}
Cursor Persistence
The journal alerts module maintains a cursor that tracks the last-read position in the journal. This cursor is:
- Persisted to storage periodically (every 10 seconds by default)
- Restored on startup to avoid re-processing old entries
- Reset on journal read errors to recover from corruption
This ensures that after a restart, the module resumes from where it left off without generating duplicate alerts or missing entries that arrived during downtime.
Error Recovery
If the journal monitoring thread encounters an error while reading entries:
- The current cursor is cleared from storage
- The journal connection is re-established
- The journal is seeked to the tail (most recent entry)
- Monitoring resumes from the current position
A backoff mechanism doubles the wait timeout on consecutive errors (up to 10× the normal timeout) to avoid tight error loops.
Alert Delivery
All alerts follow the same delivery path regardless of type:
- Generation — The alert is created by the appropriate subsystem (AlertProcessor for quota alerts, JournalAlerts for journal-based alerts, Downloader for download alerts)
- Local send — The alert is passed to the
alerts::SenderItfimplementation within the SM - Forwarding to CM — The SM forwards alerts to the CM through the internal communication channel
- Cloud transmission — The CM batches alerts and sends them to AosCloud over the WebSocket connection as part of
the
Alertsmessage
Alerts are sent as an array of AlertVariant items, allowing multiple alerts of different types to be batched in a
single cloud message.
Related Pages
- Monitoring Pipeline — data collection, averaging, and how metrics reach the alert evaluation stage
- Monitoring and Observability — overview of all monitoring subsystems
- Logging Pipeline — log collection and retrieval (distinct from journal-based alerts)
- Storage and Quota Configuration — quota settings that influence alert thresholds
- Service Manager — SM component that hosts the alerting subsystems