Service Failure Handling
Introduction
This page documents how AosCore detects and handles service instance failures at the Service Manager (SM) level. When a service instance crashes, exceeds resource limits, or becomes unreachable due to connectivity loss, SM must detect the failure, report it through the system, and apply the configured recovery policy. Understanding these mechanisms is essential for OEMs configuring service restart behavior and diagnosing production failures.
Service failure handling involves three phases: detection (identifying that a failure occurred), reporting (propagating the failure status to CM and ultimately to the cloud), and recovery (applying restart policies to restore the service).
Failure Detection Mechanisms
The SM Launcher's container runtime detects service failures through its Runner module, which continuously monitors the systemd units that host service instances.
Process Exit Detection via Systemd Unit Monitoring
Each container service instance runs as a systemd transient unit (aos-service@<instance-id>.service). The Runner
module monitors these units by polling their state every second via the systemd D-Bus connection:
| Systemd Unit State | Mapped Instance State | Meaning |
|---|---|---|
active | Active | Process is running normally |
inactive | Inactive | Process has stopped (graceful shutdown) |
failed | Failed | Process exited with an error or was killed |
activating | (monitored during startup) | Process is starting up |
deactivating | Failed | Process is shutting down unexpectedly |
| Any other state | Failed | Unexpected state treated as failure |
The Runner maintains two tracking maps:
- Starting units — units that are in the process of being launched. If a starting unit enters the
failedstate, the Runner immediately notifies the waiting start operation. - Running units — units that have successfully started. The monitoring thread detects state changes and reports them
to the container runtime via the
RunStatusReceiverItfcallback.
When a running unit's state changes (e.g., from active to failed), the Runner collects the exit code from systemd
and reports a RunStatus containing the instance ID, new state, and error information.
Offline TTL Expiration
Each service instance can have an offline TTL (Time-To-Live) configured in its OCI image configuration. This mechanism stops instances that have been running without cloud connectivity for too long:
- When the cloud connection drops, the Launcher records the disconnect timestamp
- A timer is started based on the shortest TTL among all Active/Activating instances
- When the timer fires, the Launcher checks each instance:
- If
time_since_disconnect > instance.offlineTTL, the instance is stopped - The instance transitions to the Failed state
- If
- If connectivity is restored before any TTL expires, the disconnect timestamp is cleared and the timer is cancelled
Instances without a configured offline TTL (duration of zero) are not affected by connectivity loss and continue running indefinitely.
Resource Limit Violations
Container instances run with Linux cgroup resource limits enforced by the container runtime:
- CPU — quota/period-based limiting
- Memory (RAM) — hard memory limit in bytes
- PID count — maximum number of processes
When a container exceeds its memory limit, the Linux kernel's OOM killer terminates the process. This causes the systemd
unit to enter the failed state, which the Runner detects during its next polling cycle. The exit code from the OOM
kill is captured and included in the error report.
CPU quota violations do not terminate the process — they throttle it. PID limit violations prevent new process creation within the container but do not kill existing processes.
Resource usage is also monitored by the SM monitoring subsystem, which generates InstanceQuotaAlert notifications when usage exceeds configured thresholds. These alerts are informational and do not directly trigger instance termination — they serve as early warnings before hard limits are hit.
The Failed Instance State
When a failure is detected, the instance transitions to the eFailed state. The InstanceStatus structure carries:
| Field | Description |
|---|---|
mState | Set to eFailed |
mError | An Error object containing the exit code (from systemd) and a descriptive message |
| Instance identity fields | mItemID, mSubjectID, mInstance — identifying which instance failed |
mRuntimeID | Which runtime was hosting the instance |
mNodeID | Which Node the instance was running on |
The error information is derived from the systemd unit's exit code. If the process exited with a non-zero code, that
code is preserved. If the failure was due to a systemd-level issue (unit could not start), a generic eFailed error is
reported with a descriptive message.
Restart Policies
The container runtime applies restart policies through systemd's built-in restart mechanism. Each service instance's
systemd unit is configured with Restart=always, meaning systemd automatically restarts the process when it exits. The
restart behavior is controlled by three parameters from the RunParameters configuration:
Configuration Parameters
| Parameter | Field | Default | Description |
|---|---|---|---|
| Start Interval | mStartInterval | 5 seconds | The time window within which StartBurst restarts are allowed. Maps to systemd's StartLimitIntervalSec. |
| Start Burst | mStartBurst | 3 | Maximum number of start attempts allowed within the Start Interval. Maps to systemd's StartLimitBurst. |
| Restart Interval | mRestartInterval | 1 second | Delay between a process exit and the next restart attempt. Maps to systemd's RestartSec. |
These parameters are set per-instance in the service's item configuration (delivered as part of the desired state from the cloud). If not specified, the defaults above are used.
Systemd Drop-In Configuration
When starting an instance, the Runner creates a systemd drop-in file at
/run/systemd/system/aos-service@<instance-id>.service.d/parameters.conf with the following content:
[Unit]
StartLimitIntervalSec=<startInterval in seconds>
StartLimitBurst=<startBurst>
[Service]
RestartSec=<restartInterval in seconds>
This configures systemd to:
- Restart the process after
RestartSecseconds when it exits - Allow up to
StartLimitBurstrestarts withinStartLimitIntervalSec - If the burst limit is exceeded, stop attempting restarts and leave the unit in
failedstate
Restart Behavior Example
With default parameters (startInterval=5s, startBurst=3, restartInterval=1s):
- Instance crashes → systemd waits 1 second → restarts (attempt 1)
- Instance crashes again → waits 1 second → restarts (attempt 2)
- Instance crashes again → waits 1 second → restarts (attempt 3)
- Instance crashes again → 3 restarts within 5 seconds exceeds the burst limit → unit enters
failedstate permanently
At this point, the Runner's monitoring thread detects the failed state and reports it up through the system. The
instance remains in the Failed state until a new UpdateInstances command is received (typically triggered by a new
desired-state update from the cloud).
Start Interval Multiplier
When starting an instance, the Runner applies a 1.2× multiplier to the configured start interval to determine the
timeout for the initial start operation. This provides a small buffer beyond the configured interval to account for
system load during startup. If the unit does not reach the active state within this timeout, the start is considered
failed.
Failure Reporting Flow
When a service instance fails, the information propagates through the system in a well-defined path:
1. Runner → Container Runtime
The Runner's monitoring thread detects the state change and calls UpdateRunStatus() on the container runtime with a
vector of RunStatus entries for all changed instances. Each entry contains the instance ID, new state (eFailed), and
the error (including exit code).
2. Container Runtime → Launcher
The container runtime's UpdateRunStatus() implementation updates the internal Instance object's run status and calls
OnInstancesStatusesReceived() on the Launcher (which implements InstanceStatusReceiverItf).
3. Launcher → SM Client (gRPC to CM)
The Launcher calls SendUpdateInstancesStatuses() on the SM client, which serializes the InstanceStatus array into
the gRPC UpdateInstancesStatus message and sends it to CM over the SM registration stream.
4. CM → Cloud
CM's SM Controller receives the instance status update and forwards it to the CM Launcher, which aggregates instance
statuses across all Nodes. The aggregated status is included in the unitStatus JSON message sent to AosCloud via the
WebSocket connection, containing per-instance error details.
Proto Message Structure
The InstanceStatus message in the SM v5 protocol carries failure information:
message InstanceStatus {
common.v2.InstanceIdent instance = 1;
string version = 2;
string runtime_id = 3;
string manifest_digest = 4;
repeated EnvVarStatus env_vars = 5;
string state = 6; // "failed"
common.v2.ErrorInfo error = 7; // failure details
}
The ErrorInfo contains:
aos_code— internal error classificationexit_code— the process exit code from systemdmessage— human-readable description of the failure
Instance Alerts
In addition to status reporting, the SM generates alerts for service instance events. These alerts are sent to CM as
InstanceAlert messages and forwarded to the cloud:
| Alert Type | Trigger | Content |
|---|---|---|
| Instance Alert | Service process logs an error-level message | The log message content, instance identity, and severity |
| Instance Quota Alert | Resource usage exceeds a configured threshold | The resource parameter name, current value, and alert state (raised/cleared) |
Instance alerts provide real-time visibility into service health without waiting for a full failure. They are generated by the SM's journal alert monitor, which watches the systemd journal for log entries from service units.
Recovery After Failure
Once an instance is in the Failed state and systemd has exhausted its restart attempts, recovery requires external intervention:
Automatic Recovery via Desired State
The most common recovery path is a new desired-state update from the cloud:
- The cloud receives the failure report via
unitStatus - An operator or automated system issues a new desired state (possibly identical to the current one)
- CM processes the new desired state and sends
UpdateInstancesto SM - SM's Launcher stops the failed instance (resets the systemd unit's failed state) and starts it fresh
What Happens During Recovery
When the Launcher receives an UpdateInstances command that includes a previously-failed instance:
- The instance appears in the stop list — the Launcher calls
StopInstance()on the runtime StopInstance()callsResetFailedUnit()on systemd to clear the failed state- The instance then appears in the start list — the Launcher creates a fresh instance with new run parameters
- The runtime sets up the environment and starts the systemd unit again
This ensures a clean restart — the previous failed state is fully cleared before the new attempt.
Cloud Connectivity Restoration
For instances stopped due to offline TTL expiration:
- When cloud connectivity is restored, the Launcher's
OnConnect()callback clears the offline timestamp - The Launcher reports current instance statuses to CM
- CM includes the stopped instances in the next desired-state reconciliation
- SM receives a new
UpdateInstancesand restarts the affected instances
Configuration Summary
| Configuration Source | Parameter | Effect |
|---|---|---|
| Item config (per-service) | runParameters.startInterval | Time window for burst detection |
| Item config (per-service) | runParameters.startBurst | Max restarts within the interval |
| Item config (per-service) | runParameters.restartInterval | Delay between restart attempts |
| Item config (per-service) | offlineTTL | Duration before stopping on connectivity loss |
| Cgroup limits (per-service) | CPU quota, RAM limit, PID limit | Resource boundaries enforced by kernel |
| Monitoring config | Alert thresholds | When to generate quota alerts (informational) |
Related Pages
- Error Handling and Recovery — overview of AosCore error handling philosophy and recovery strategies
- Error Propagation — detailed documentation of how errors flow between components
- Update Failure and Rollback — update-specific failure handling and rollback procedures
- SM Launcher — the Launcher module architecture and runtime dispatch
- Service Instance States — complete state machine including the Failed state
- Monitoring Pipeline — how resource metrics and alerts are collected
- Alerts and Thresholds — alert configuration including instance quota alerts