Skip to main content
Version: v1.1

Alerts and Thresholds

Introduction

This page describes the AosCore alerting system in detail — the types of alerts generated, how threshold rules are configured, the hysteresis mechanism that prevents alert flapping, and how journal-based alerts capture error-level log entries from system components and service instances.

Alerts provide operators with real-time visibility into resource pressure, component failures, and download progress. They are generated locally on each Node and forwarded to AosCloud through the Communication Manager (CM) for centralized processing and notification.

Alert Types

AosCore defines seven alert types, each carrying a distinct tag that identifies its category. All alerts include a timestamp indicating when the condition was detected.

SystemQuotaAlert

Raised when a Node-level resource metric (CPU, RAM, disk partition, download, or upload) crosses a configured threshold.

FieldDescription
tagsystemQuotaAlert
timestampTime the alert condition was detected
nodeIDIdentifier of the Node where the threshold was crossed
parameterResource name (e.g., cpu, ram, download, upload, or partition name)
valueCurrent resource usage value at the time of the alert
stateAlert state: raise, continue, or fall

SystemQuotaAlerts are generated by the AlertProcessor when Node-level monitoring data exceeds configured thresholds. The thresholds are defined as percentages in the Node configuration and converted to absolute values at startup.

InstanceQuotaAlert

Raised when a specific service instance exceeds its allocated resource quota.

FieldDescription
taginstanceQuotaAlert
timestampTime the alert condition was detected
serviceIDService identifier
subjectIDSubject (tenant) identifier
instanceInstance number
parameterResource name (e.g., cpu, ram, download, upload, or partition name)
valueCurrent resource usage value
stateAlert state: raise, continue, or fall

InstanceQuotaAlerts use the same threshold mechanism as SystemQuotaAlerts but are scoped to individual service instances. Thresholds are defined per-instance through the service configuration.

ResourceAllocateAlert

Generated when resource allocation for a service instance fails — for example, when the system cannot allocate the requested disk space or network bandwidth.

FieldDescription
tagresourceAllocateAlert
timestampTime the allocation failure occurred
serviceIDService identifier
subjectIDSubject (tenant) identifier
instanceInstance number
nodeIDNode where allocation was attempted
resourceName of the resource that failed to allocate
messageDescription of the allocation failure

SystemAlert

Captures error-level messages from the systemd journal that do not originate from AosCore components or service instances. These represent OS-level or third-party service errors.

FieldDescription
tagsystemAlert
timestampTime the journal entry was recorded
nodeIDNode where the entry was logged
messageContent of the journal entry

CoreAlert

Captures error-level journal entries from AosCore core components (CM, SM, IAM).

FieldDescription
tagcoreAlert
timestampTime the journal entry was recorded
nodeIDNode where the entry was logged
coreComponentComponent that generated the entry: CM, SM, IAM, or MP
messageContent of the journal entry

The journal alerts module identifies core component entries by matching the systemd unit name against known service names (aos-cm.service, aos-sm.service, aos-iam.service).

InstanceAlert

Captures error-level journal entries from service instances.

FieldDescription
tagupdateItemInstanceAlert
timestampTime the journal entry was recorded
serviceIDService identifier
subjectIDSubject (tenant) identifier
instanceInstance number
versionService version
messageContent of the journal entry

Instance alerts are identified by matching the systemd unit name pattern aos-service@<instanceID>.service. The instance ID is resolved to the full instance identity (service ID, subject ID, instance number) through the Instance Info Provider.

DownloadAlert

Tracks the progress and state of image downloads.

FieldDescription
tagdownloadProgressAlert
timestampTime of the state change
digestContent digest of the image being downloaded
urlDownload URL
downloadedBytesBytes downloaded so far
totalBytesTotal expected size
stateDownload state: started, paused, interrupted, or finished
reasonOptional reason for state change (e.g., why download was interrupted)
errorError code if the download failed

Threshold Configuration

Resource alerts (SystemQuotaAlert and InstanceQuotaAlert) are controlled by threshold rules. Each rule defines three parameters:

ParameterTypeDescription
minTimeoutDuration (ISO 8601)How long a threshold must be sustained before an alert fires
minThresholdNumberLower boundary — dropping below this for minTimeout clears the alert
maxThresholdNumberUpper boundary — exceeding this for minTimeout raises the alert

Rule Formats

Thresholds are expressed in two formats depending on the resource type:

Percentage-based rules (AlertRulePercents) — Used for CPU, RAM, and disk partitions. Thresholds are specified as percentages (0–100) of the total resource capacity. At startup, the monitoring system converts these to absolute values based on the Node's actual resource capacity.

{
"cpu": {
"minTimeout": "PT10S",
"minThreshold": 80,
"maxThreshold": 90
}
}

In this example, a CPU alert raises when usage exceeds 90% for at least 10 seconds, and clears when it drops below 80% for at least 10 seconds.

Absolute-value rules (AlertRulePoints) — Used for download and upload traffic. Thresholds are specified in bytes (or bytes per second, depending on the metric).

{
"download": {
"minTimeout": "PT30S",
"minThreshold": 1000,
"maxThreshold": 2000
}
}

Node-Level Configuration

Node-level alert rules are defined in the Node configuration (delivered as part of the Unit configuration from the cloud). The alertRules object supports:

ResourceRule TypeDescription
cpuPercentageCPU usage threshold as percentage of total DMIPS
ramPercentageRAM usage threshold as percentage of total memory
partitionsPercentage (per partition)Disk usage threshold per named partition
downloadAbsoluteDownload traffic threshold in bytes
uploadAbsoluteUpload traffic threshold in bytes

Example Node-level alert rules:

{
"alertRules": {
"cpu": {
"minTimeout": "PT10S",
"minThreshold": 80,
"maxThreshold": 90
},
"ram": {
"minTimeout": "PT10S",
"minThreshold": 70,
"maxThreshold": 85
},
"partitions": [
{
"name": "states",
"minTimeout": "PT30S",
"minThreshold": 70,
"maxThreshold": 90
}
],
"download": {
"minTimeout": "PT30S",
"minThreshold": 100000,
"maxThreshold": 200000
},
"upload": {
"minTimeout": "PT30S",
"minThreshold": 50000,
"maxThreshold": 100000
}
}
}

Instance-Level Configuration

Instance-level alert rules follow the same structure and are defined per service instance through the service configuration. They control when InstanceQuotaAlerts are generated for individual services exceeding their allocated quotas.

Dynamic Reconfiguration

Alert thresholds can be updated dynamically through the Unit configuration without restarting the monitoring pipeline. When a new configuration is received from the cloud, the alert processors are reconfigured with the updated threshold values.

Hysteresis Behavior

The alert system implements hysteresis through a two-state machine with three output states. This prevents rapid alert toggling (flapping) when a metric oscillates around a threshold boundary.

State Machine

Each AlertProcessor maintains an internal boolean state (alertCondition) that tracks whether the resource is currently in an alert condition:

value ≥ maxThreshold
for ≥ minTimeout
┌──────────┐ ─────────────────────► ┌──────────────┐
│ │ [Raise] │ │
│ Normal │ │ Alert │
│ │ ◄───────────────────── │ Condition │
└──────────┘ value < minThreshold └──────────────┘
for ≥ minTimeout │
[Fall] │
│ value ≥ minThreshold
│ for ≥ minTimeout
│ [Continue]
└──────┐


(periodic reminder)

Alert States

StateTriggerMeaning
RaiseValue ≥ maxThreshold sustained for ≥ minTimeoutAlert condition begins — resource pressure detected
ContinueValue remains ≥ minThreshold while in alert condition, every minTimeout intervalPeriodic reminder that alert condition persists
FallValue < minThreshold sustained for ≥ minTimeoutAlert condition ends — resource pressure resolved

Detailed Behavior

  1. Normal → Raise: When the monitored value first crosses maxThreshold, a timer starts. If the value remains at or above maxThreshold for the full minTimeout duration, a Raise alert is sent and the processor enters the alert condition. If the value drops below maxThreshold before the timeout expires, the timer resets.

  2. Alert Condition → Continue: While in the alert condition, if the value remains at or above minThreshold, a Continue alert is sent every minTimeout interval. This provides ongoing visibility that the resource is still under pressure, even if it has dropped below maxThreshold.

  3. Alert Condition → Fall: When the value drops below minThreshold, a timer starts. If it remains below minThreshold for the full minTimeout duration, a Fall alert is sent and the processor returns to normal state. If the value rises back above minThreshold before the timeout expires, the timer resets and the alert condition continues.

Example Sequence

Given a rule with minThreshold: 80, maxThreshold: 90, minTimeout: 10s:

TimeValueEvent
T+0s50%Normal — no action
T+5s92%Exceeds maxThreshold — timer starts
T+15s91%Timer expired (10s elapsed) — Raise alert sent
T+25s85%Still ≥ minThreshold, timer expired — Continue alert sent
T+35s82%Still ≥ minThreshold, timer expired — Continue alert sent
T+40s75%Below minThreshold — recovery timer starts
T+50s72%Recovery timer expired (10s elapsed) — Fall alert sent

Journal-Based Alerts

In addition to threshold-based resource alerts, AosCore monitors the systemd journal for error-level entries and forwards them as alerts to the cloud. This provides visibility into component crashes, service failures, and system-level errors without requiring explicit instrumentation.

How It Works

The JournalAlerts module (located in sm/alerts/) runs a dedicated monitoring thread that:

  1. Opens the systemd journal and seeks to the last stored cursor position
  2. Adds match filters for entries at or below the configured priority level
  3. Continuously reads new journal entries as they appear
  4. Categorizes each entry and sends the appropriate alert type

Entry Categorization

When a new journal entry is read, it is categorized based on the systemd unit that produced it:

ConditionAlert TypeIdentification
Unit matches aos-service@*.serviceInstanceAlertInstance ID extracted from unit name, resolved to full identity
Unit matches aos-cm.service, aos-sm.service, or aos-iam.serviceCoreAlertCore component identified from service name
Any other entrySystemAlertGeneric system-level alert

For service instances running under cgroup v2, the systemd unit may not be present in the journal entry. In this case, the instance ID is extracted from the _SYSTEMD_CGROUP field instead.

Priority Filtering

The journal alerts module uses two priority levels from the configuration:

ParameterDescription
systemAlertPriorityMaximum journal priority level for system-wide monitoring (e.g., 3 = error and above)
serviceAlertPriorityMaximum priority level for service instance entries from init.scope

Journal priority levels follow the standard syslog scale (0 = emergency, 7 = debug). Setting systemAlertPriority to 3 means only entries with priority 0 (emergency), 1 (alert), 2 (critical), or 3 (error) are captured.

Message Filtering

The configuration supports regex-based message filters that suppress specific alert messages. If a journal entry's message matches any configured filter pattern, the entry is silently discarded. This allows operators to suppress known-noisy entries that would otherwise generate excessive alerts.

{
"journalAlerts": {
"systemAlertPriority": 3,
"serviceAlertPriority": 4,
"filter": [
"^audit:.*",
"^systemd\\[1\\]: .* failed\\.$"
]
}
}

Cursor Persistence

The journal alerts module maintains a cursor that tracks the last-read position in the journal. This cursor is:

  • Persisted to storage periodically (every 10 seconds by default)
  • Restored on startup to avoid re-processing old entries
  • Reset on journal read errors to recover from corruption

This ensures that after a restart, the module resumes from where it left off without generating duplicate alerts or missing entries that arrived during downtime.

Error Recovery

If the journal monitoring thread encounters an error while reading entries:

  1. The current cursor is cleared from storage
  2. The journal connection is re-established
  3. The journal is seeked to the tail (most recent entry)
  4. Monitoring resumes from the current position

A backoff mechanism doubles the wait timeout on consecutive errors (up to 10× the normal timeout) to avoid tight error loops.

Alert Delivery

All alerts follow the same delivery path regardless of type:

  1. Generation — The alert is created by the appropriate subsystem (AlertProcessor for quota alerts, JournalAlerts for journal-based alerts, Downloader for download alerts)
  2. Local send — The alert is passed to the alerts::SenderItf implementation within the SM
  3. Forwarding to CM — The SM forwards alerts to the CM through the internal communication channel
  4. Cloud transmission — The CM batches alerts and sends them to AosCloud over the WebSocket connection as part of the Alerts message

Alerts are sent as an array of AlertVariant items, allowing multiple alerts of different types to be batched in a single cloud message.