Version: v1.1

Alerts and Thresholds

Introduction

This page describes the AosCore alerting system in detail — the types of alerts generated, how threshold rules are configured, the hysteresis mechanism that prevents alert flapping, and how journal-based alerts capture error-level log entries from system components and service instances.

Alerts provide operators with real-time visibility into resource pressure, component failures, and download progress. They are generated locally on each Node and forwarded to AosCloud through the Communication Manager (CM) for centralized processing and notification.

Alert Types

AosCore defines seven alert types, each carrying a distinct tag that identifies its category. All alerts include a timestamp indicating when the condition was detected.

SystemQuotaAlert

Raised when a Node-level resource metric (CPU, RAM, disk partition, download, or upload) crosses a configured threshold.

Field	Description
`tag`	`systemQuotaAlert`
`timestamp`	Time the alert condition was detected
`nodeID`	Identifier of the Node where the threshold was crossed
`parameter`	Resource name (e.g., `cpu`, `ram`, `download`, `upload`, or partition name)
`value`	Current resource usage value at the time of the alert
`state`	Alert state: `raise`, `continue`, or `fall`

SystemQuotaAlerts are generated by the AlertProcessor when Node-level monitoring data exceeds configured thresholds. The thresholds are defined as percentages in the Node configuration and converted to absolute values at startup.

InstanceQuotaAlert

Raised when a specific service instance exceeds its allocated resource quota.

Field	Description
`tag`	`instanceQuotaAlert`
`timestamp`	Time the alert condition was detected
`serviceID`	Service identifier
`subjectID`	Subject (tenant) identifier
`instance`	Instance number
`parameter`	Resource name (e.g., `cpu`, `ram`, `download`, `upload`, or partition name)
`value`	Current resource usage value
`state`	Alert state: `raise`, `continue`, or `fall`

InstanceQuotaAlerts use the same threshold mechanism as SystemQuotaAlerts but are scoped to individual service instances. Thresholds are defined per-instance through the service configuration.

ResourceAllocateAlert

Generated when resource allocation for a service instance fails — for example, when the system cannot allocate the requested disk space or network bandwidth.

Field	Description
`tag`	`resourceAllocateAlert`
`timestamp`	Time the allocation failure occurred
`serviceID`	Service identifier
`subjectID`	Subject (tenant) identifier
`instance`	Instance number
`nodeID`	Node where allocation was attempted
`resource`	Name of the resource that failed to allocate
`message`	Description of the allocation failure

SystemAlert

Captures error-level messages from the systemd journal that do not originate from AosCore components or service instances. These represent OS-level or third-party service errors.

Field	Description
`tag`	`systemAlert`
`timestamp`	Time the journal entry was recorded
`nodeID`	Node where the entry was logged
`message`	Content of the journal entry

CoreAlert

Captures error-level journal entries from AosCore core components (CM, SM, IAM).

Field	Description
`tag`	`coreAlert`
`timestamp`	Time the journal entry was recorded
`nodeID`	Node where the entry was logged
`coreComponent`	Component that generated the entry: `CM`, `SM`, `IAM`, or `MP`
`message`	Content of the journal entry

The journal alerts module identifies core component entries by matching the systemd unit name against known service names (aos-cm.service, aos-sm.service, aos-iam.service).

InstanceAlert

Captures error-level journal entries from service instances.

Field	Description
`tag`	`updateItemInstanceAlert`
`timestamp`	Time the journal entry was recorded
`serviceID`	Service identifier
`subjectID`	Subject (tenant) identifier
`instance`	Instance number
`version`	Service version
`message`	Content of the journal entry

Instance alerts are identified by matching the systemd unit name pattern aos-service@<instanceID>.service. The instance ID is resolved to the full instance identity (service ID, subject ID, instance number) through the Instance Info Provider.

DownloadAlert

Tracks the progress and state of image downloads.

Field	Description
`tag`	`downloadProgressAlert`
`timestamp`	Time of the state change
`digest`	Content digest of the image being downloaded
`url`	Download URL
`downloadedBytes`	Bytes downloaded so far
`totalBytes`	Total expected size
`state`	Download state: `started`, `paused`, `interrupted`, or `finished`
`reason`	Optional reason for state change (e.g., why download was interrupted)
`error`	Error code if the download failed

Threshold Configuration

Resource alerts (SystemQuotaAlert and InstanceQuotaAlert) are controlled by threshold rules. Each rule defines three parameters:

Parameter	Type	Description
`minTimeout`	Duration (ISO 8601)	How long a threshold must be sustained before an alert fires
`minThreshold`	Number	Lower boundary — dropping below this for `minTimeout` clears the alert
`maxThreshold`	Number	Upper boundary — exceeding this for `minTimeout` raises the alert

Rule Formats

Thresholds are expressed in two formats depending on the resource type:

Percentage-based rules (AlertRulePercents) — Used for CPU, RAM, and disk partitions. Thresholds are specified as percentages (0–100) of the total resource capacity. At startup, the monitoring system converts these to absolute values based on the Node's actual resource capacity.

{
  "cpu": {
    "minTimeout": "PT10S",
    "minThreshold": 80,
    "maxThreshold": 90
  }
}

In this example, a CPU alert raises when usage exceeds 90% for at least 10 seconds, and clears when it drops below 80% for at least 10 seconds.

Absolute-value rules (AlertRulePoints) — Used for download and upload traffic. Thresholds are specified in bytes (or bytes per second, depending on the metric).

{
  "download": {
    "minTimeout": "PT30S",
    "minThreshold": 1000,
    "maxThreshold": 2000
  }
}

Node-Level Configuration

Node-level alert rules are defined in the Node configuration (delivered as part of the Unit configuration from the cloud). The alertRules object supports:

Resource	Rule Type	Description
`cpu`	Percentage	CPU usage threshold as percentage of total DMIPS
`ram`	Percentage	RAM usage threshold as percentage of total memory
`partitions`	Percentage (per partition)	Disk usage threshold per named partition
`download`	Absolute	Download traffic threshold in bytes
`upload`	Absolute	Upload traffic threshold in bytes

Example Node-level alert rules:

{
  "alertRules": {
    "cpu": {
      "minTimeout": "PT10S",
      "minThreshold": 80,
      "maxThreshold": 90
    },
    "ram": {
      "minTimeout": "PT10S",
      "minThreshold": 70,
      "maxThreshold": 85
    },
    "partitions": [
      {
        "name": "states",
        "minTimeout": "PT30S",
        "minThreshold": 70,
        "maxThreshold": 90
      }
    ],
    "download": {
      "minTimeout": "PT30S",
      "minThreshold": 100000,
      "maxThreshold": 200000
    },
    "upload": {
      "minTimeout": "PT30S",
      "minThreshold": 50000,
      "maxThreshold": 100000
    }
  }
}

Instance-Level Configuration

Instance-level alert rules follow the same structure and are defined per service instance through the service configuration. They control when InstanceQuotaAlerts are generated for individual services exceeding their allocated quotas.

Dynamic Reconfiguration

Alert thresholds can be updated dynamically through the Unit configuration without restarting the monitoring pipeline. When a new configuration is received from the cloud, the alert processors are reconfigured with the updated threshold values.

Hysteresis Behavior

The alert system implements hysteresis through a two-state machine with three output states. This prevents rapid alert toggling (flapping) when a metric oscillates around a threshold boundary.

State Machine

Each AlertProcessor maintains an internal boolean state (alertCondition) that tracks whether the resource is currently in an alert condition:

                    value ≥ maxThreshold
                    for ≥ minTimeout
    ┌──────────┐  ─────────────────────►  ┌──────────────┐
    │          │         [Raise]           │              │
    │  Normal  │                           │    Alert     │
    │          │  ◄─────────────────────   │  Condition   │
    └──────────┘   value < minThreshold    └──────────────┘
                    for ≥ minTimeout               │
                         [Fall]                    │
                                                   │ value ≥ minThreshold
                                                   │ for ≥ minTimeout
                                                   │    [Continue]
                                                   └──────┐
                                                          │
                                                          ▼
                                                   (periodic reminder)

Alert States

State	Trigger	Meaning
Raise	Value ≥ `maxThreshold` sustained for ≥ `minTimeout`	Alert condition begins — resource pressure detected
Continue	Value remains ≥ `minThreshold` while in alert condition, every `minTimeout` interval	Periodic reminder that alert condition persists
Fall	Value < `minThreshold` sustained for ≥ `minTimeout`	Alert condition ends — resource pressure resolved

Detailed Behavior

Normal → Raise: When the monitored value first crosses maxThreshold, a timer starts. If the value remains at or above maxThreshold for the full minTimeout duration, a Raise alert is sent and the processor enters the alert condition. If the value drops below maxThreshold before the timeout expires, the timer resets.
Alert Condition → Continue: While in the alert condition, if the value remains at or above minThreshold, a Continue alert is sent every minTimeout interval. This provides ongoing visibility that the resource is still under pressure, even if it has dropped below maxThreshold.
Alert Condition → Fall: When the value drops below minThreshold, a timer starts. If it remains below minThreshold for the full minTimeout duration, a Fall alert is sent and the processor returns to normal state. If the value rises back above minThreshold before the timeout expires, the timer resets and the alert condition continues.

Example Sequence

Given a rule with minThreshold: 80, maxThreshold: 90, minTimeout: 10s:

Time	Value	Event
T+0s	50%	Normal — no action
T+5s	92%	Exceeds maxThreshold — timer starts
T+15s	91%	Timer expired (10s elapsed) — Raise alert sent
T+25s	85%	Still ≥ minThreshold, timer expired — Continue alert sent
T+35s	82%	Still ≥ minThreshold, timer expired — Continue alert sent
T+40s	75%	Below minThreshold — recovery timer starts
T+50s	72%	Recovery timer expired (10s elapsed) — Fall alert sent

Journal-Based Alerts

In addition to threshold-based resource alerts, AosCore monitors the systemd journal for error-level entries and forwards them as alerts to the cloud. This provides visibility into component crashes, service failures, and system-level errors without requiring explicit instrumentation.

How It Works

The JournalAlerts module (located in sm/alerts/) runs a dedicated monitoring thread that:

Opens the systemd journal and seeks to the last stored cursor position
Adds match filters for entries at or below the configured priority level
Continuously reads new journal entries as they appear
Categorizes each entry and sends the appropriate alert type

Entry Categorization

When a new journal entry is read, it is categorized based on the systemd unit that produced it:

Condition	Alert Type	Identification
Unit matches `aos-service@*.service`	InstanceAlert	Instance ID extracted from unit name, resolved to full identity
Unit matches `aos-cm.service`, `aos-sm.service`, or `aos-iam.service`	CoreAlert	Core component identified from service name
Any other entry	SystemAlert	Generic system-level alert

For service instances running under cgroup v2, the systemd unit may not be present in the journal entry. In this case, the instance ID is extracted from the _SYSTEMD_CGROUP field instead.

Priority Filtering

The journal alerts module uses two priority levels from the configuration:

Parameter	Description
`systemAlertPriority`	Maximum journal priority level for system-wide monitoring (e.g., 3 = error and above)
`serviceAlertPriority`	Maximum priority level for service instance entries from `init.scope`

Journal priority levels follow the standard syslog scale (0 = emergency, 7 = debug). Setting systemAlertPriority to 3 means only entries with priority 0 (emergency), 1 (alert), 2 (critical), or 3 (error) are captured.

Message Filtering

The configuration supports regex-based message filters that suppress specific alert messages. If a journal entry's message matches any configured filter pattern, the entry is silently discarded. This allows operators to suppress known-noisy entries that would otherwise generate excessive alerts.

{
  "journalAlerts": {
    "systemAlertPriority": 3,
    "serviceAlertPriority": 4,
    "filter": [
      "^audit:.*",
      "^systemd\\[1\\]: .* failed\\.$"
    ]
  }
}

Cursor Persistence

The journal alerts module maintains a cursor that tracks the last-read position in the journal. This cursor is:

Persisted to storage periodically (every 10 seconds by default)
Restored on startup to avoid re-processing old entries
Reset on journal read errors to recover from corruption

This ensures that after a restart, the module resumes from where it left off without generating duplicate alerts or missing entries that arrived during downtime.

Error Recovery

If the journal monitoring thread encounters an error while reading entries:

The current cursor is cleared from storage
The journal connection is re-established
The journal is seeked to the tail (most recent entry)
Monitoring resumes from the current position

A backoff mechanism doubles the wait timeout on consecutive errors (up to 10× the normal timeout) to avoid tight error loops.

Alert Delivery

All alerts follow the same delivery path regardless of type:

Generation — The alert is created by the appropriate subsystem (AlertProcessor for quota alerts, JournalAlerts for journal-based alerts, Downloader for download alerts)
Local send — The alert is passed to the alerts::SenderItf implementation within the SM
Forwarding to CM — The SM forwards alerts to the CM through the internal communication channel
Cloud transmission — The CM batches alerts and sends them to AosCloud over the WebSocket connection as part of the Alerts message

Alerts are sent as an array of AlertVariant items, allowing multiple alerts of different types to be batched in a single cloud message.

Monitoring Pipeline — data collection, averaging, and how metrics reach the alert evaluation stage
Monitoring and Observability — overview of all monitoring subsystems
Logging Pipeline — log collection and retrieval (distinct from journal-based alerts)
Storage and Quota Configuration — quota settings that influence alert thresholds

Service Manager — SM component that hosts the alerting subsystems

Introduction​

Alert Types​

SystemQuotaAlert​

InstanceQuotaAlert​

ResourceAllocateAlert​

SystemAlert​

CoreAlert​

InstanceAlert​

DownloadAlert​

Threshold Configuration​

Rule Formats​

Node-Level Configuration​

Instance-Level Configuration​

Dynamic Reconfiguration​

Hysteresis Behavior​

State Machine​

Alert States​

Detailed Behavior​

Example Sequence​

Journal-Based Alerts​

How It Works​

Entry Categorization​

Priority Filtering​

Message Filtering​

Cursor Persistence​

Error Recovery​

Alert Delivery​

Related Pages​