Version: v1.1

Service Failure Handling

Introduction

This page documents how AosCore detects and handles service instance failures at the Service Manager (SM) level. When a service instance crashes, exceeds resource limits, or becomes unreachable due to connectivity loss, SM must detect the failure, report it through the system, and apply the configured recovery policy. Understanding these mechanisms is essential for OEMs configuring service restart behavior and diagnosing production failures.

Service failure handling involves three phases: detection (identifying that a failure occurred), reporting (propagating the failure status to CM and ultimately to the cloud), and recovery (applying restart policies to restore the service).

Failure Detection Mechanisms

The SM Launcher's container runtime detects service failures through its Runner module, which continuously monitors the systemd units that host service instances.

Process Exit Detection via Systemd Unit Monitoring

Each container service instance runs as a systemd transient unit (aos-service@<instance-id>.service). The Runner module monitors these units by polling their state every second via the systemd D-Bus connection:

Systemd Unit State	Mapped Instance State	Meaning
`active`	Active	Process is running normally
`inactive`	Inactive	Process has stopped (graceful shutdown)
`failed`	Failed	Process exited with an error or was killed
`activating`	(monitored during startup)	Process is starting up
`deactivating`	Failed	Process is shutting down unexpectedly
Any other state	Failed	Unexpected state treated as failure

The Runner maintains two tracking maps:

Starting units — units that are in the process of being launched. If a starting unit enters the failed state, the Runner immediately notifies the waiting start operation.
Running units — units that have successfully started. The monitoring thread detects state changes and reports them to the container runtime via the RunStatusReceiverItf callback.

When a running unit's state changes (e.g., from active to failed), the Runner collects the exit code from systemd and reports a RunStatus containing the instance ID, new state, and error information.

Offline TTL Expiration

Each service instance can have an offline TTL (Time-To-Live) configured in its OCI image configuration. This mechanism stops instances that have been running without cloud connectivity for too long:

When the cloud connection drops, the Launcher records the disconnect timestamp
A timer is started based on the shortest TTL among all Active/Activating instances
When the timer fires, the Launcher checks each instance:
- If time_since_disconnect > instance.offlineTTL, the instance is stopped
- The instance transitions to the Failed state
If connectivity is restored before any TTL expires, the disconnect timestamp is cleared and the timer is cancelled

Instances without a configured offline TTL (duration of zero) are not affected by connectivity loss and continue running indefinitely.

Resource Limit Violations

Container instances run with Linux cgroup resource limits enforced by the container runtime:

CPU — quota/period-based limiting
Memory (RAM) — hard memory limit in bytes
PID count — maximum number of processes

When a container exceeds its memory limit, the Linux kernel's OOM killer terminates the process. This causes the systemd unit to enter the failed state, which the Runner detects during its next polling cycle. The exit code from the OOM kill is captured and included in the error report.

CPU quota violations do not terminate the process — they throttle it. PID limit violations prevent new process creation within the container but do not kill existing processes.

Resource usage is also monitored by the SM monitoring subsystem, which generates InstanceQuotaAlert notifications when usage exceeds configured thresholds. These alerts are informational and do not directly trigger instance termination — they serve as early warnings before hard limits are hit.

The Failed Instance State

When a failure is detected, the instance transitions to the eFailed state. The InstanceStatus structure carries:

Field	Description
`mState`	Set to `eFailed`
`mError`	An `Error` object containing the exit code (from systemd) and a descriptive message
Instance identity fields	`mItemID`, `mSubjectID`, `mInstance` — identifying which instance failed
`mRuntimeID`	Which runtime was hosting the instance
`mNodeID`	Which Node the instance was running on

The error information is derived from the systemd unit's exit code. If the process exited with a non-zero code, that code is preserved. If the failure was due to a systemd-level issue (unit could not start), a generic eFailed error is reported with a descriptive message.

Restart Policies

The container runtime applies restart policies through systemd's built-in restart mechanism. Each service instance's systemd unit is configured with Restart=always, meaning systemd automatically restarts the process when it exits. The restart behavior is controlled by three parameters from the RunParameters configuration:

Configuration Parameters

Parameter	Field	Default	Description
Start Interval	`mStartInterval`	5 seconds	The time window within which `StartBurst` restarts are allowed. Maps to systemd's `StartLimitIntervalSec`.
Start Burst	`mStartBurst`	3	Maximum number of start attempts allowed within the Start Interval. Maps to systemd's `StartLimitBurst`.
Restart Interval	`mRestartInterval`	1 second	Delay between a process exit and the next restart attempt. Maps to systemd's `RestartSec`.

These parameters are set per-instance in the service's item configuration (delivered as part of the desired state from the cloud). If not specified, the defaults above are used.

Systemd Drop-In Configuration

When starting an instance, the Runner creates a systemd drop-in file at /run/systemd/system/aos-service@<instance-id>.service.d/parameters.conf with the following content:

[Unit]
StartLimitIntervalSec=<startInterval in seconds>
StartLimitBurst=<startBurst>

[Service]
RestartSec=<restartInterval in seconds>

This configures systemd to:

Restart the process after RestartSec seconds when it exits
Allow up to StartLimitBurst restarts within StartLimitIntervalSec
If the burst limit is exceeded, stop attempting restarts and leave the unit in failed state

Restart Behavior Example

With default parameters (startInterval=5s, startBurst=3, restartInterval=1s):

Instance crashes → systemd waits 1 second → restarts (attempt 1)
Instance crashes again → waits 1 second → restarts (attempt 2)
Instance crashes again → waits 1 second → restarts (attempt 3)
Instance crashes again → 3 restarts within 5 seconds exceeds the burst limit → unit enters failed state permanently

At this point, the Runner's monitoring thread detects the failed state and reports it up through the system. The instance remains in the Failed state until a new UpdateInstances command is received (typically triggered by a new desired-state update from the cloud).

Start Interval Multiplier

When starting an instance, the Runner applies a 1.2× multiplier to the configured start interval to determine the timeout for the initial start operation. This provides a small buffer beyond the configured interval to account for system load during startup. If the unit does not reach the active state within this timeout, the start is considered failed.

Failure Reporting Flow

When a service instance fails, the information propagates through the system in a well-defined path:

1. Runner → Container Runtime

The Runner's monitoring thread detects the state change and calls UpdateRunStatus() on the container runtime with a vector of RunStatus entries for all changed instances. Each entry contains the instance ID, new state (eFailed), and the error (including exit code).

2. Container Runtime → Launcher

The container runtime's UpdateRunStatus() implementation updates the internal Instance object's run status and calls OnInstancesStatusesReceived() on the Launcher (which implements InstanceStatusReceiverItf).

3. Launcher → SM Client (gRPC to CM)

The Launcher calls SendUpdateInstancesStatuses() on the SM client, which serializes the InstanceStatus array into the gRPC UpdateInstancesStatus message and sends it to CM over the SM registration stream.

4. CM → Cloud

CM's SM Controller receives the instance status update and forwards it to the CM Launcher, which aggregates instance statuses across all Nodes. The aggregated status is included in the unitStatus JSON message sent to AosCloud via the WebSocket connection, containing per-instance error details.

Proto Message Structure

The InstanceStatus message in the SM v5 protocol carries failure information:

message InstanceStatus {
    common.v2.InstanceIdent instance        = 1;
    string                  version         = 2;
    string                  runtime_id      = 3;
    string                  manifest_digest = 4;
    repeated EnvVarStatus   env_vars        = 5;
    string                  state           = 6;  // "failed"
    common.v2.ErrorInfo     error           = 7;  // failure details
}

The ErrorInfo contains:

aos_code — internal error classification
exit_code — the process exit code from systemd
message — human-readable description of the failure

Instance Alerts

In addition to status reporting, the SM generates alerts for service instance events. These alerts are sent to CM as InstanceAlert messages and forwarded to the cloud:

Alert Type	Trigger	Content
Instance Alert	Service process logs an error-level message	The log message content, instance identity, and severity
Instance Quota Alert	Resource usage exceeds a configured threshold	The resource parameter name, current value, and alert state (raised/cleared)

Instance alerts provide real-time visibility into service health without waiting for a full failure. They are generated by the SM's journal alert monitor, which watches the systemd journal for log entries from service units.

Recovery After Failure

Once an instance is in the Failed state and systemd has exhausted its restart attempts, recovery requires external intervention:

Automatic Recovery via Desired State

The most common recovery path is a new desired-state update from the cloud:

The cloud receives the failure report via unitStatus
An operator or automated system issues a new desired state (possibly identical to the current one)
CM processes the new desired state and sends UpdateInstances to SM
SM's Launcher stops the failed instance (resets the systemd unit's failed state) and starts it fresh

What Happens During Recovery

When the Launcher receives an UpdateInstances command that includes a previously-failed instance:

The instance appears in the stop list — the Launcher calls StopInstance() on the runtime
StopInstance() calls ResetFailedUnit() on systemd to clear the failed state
The instance then appears in the start list — the Launcher creates a fresh instance with new run parameters
The runtime sets up the environment and starts the systemd unit again

This ensures a clean restart — the previous failed state is fully cleared before the new attempt.

Cloud Connectivity Restoration

For instances stopped due to offline TTL expiration:

When cloud connectivity is restored, the Launcher's OnConnect() callback clears the offline timestamp
The Launcher reports current instance statuses to CM
CM includes the stopped instances in the next desired-state reconciliation
SM receives a new UpdateInstances and restarts the affected instances

Configuration Summary

Configuration Source	Parameter	Effect
Item config (per-service)	`runParameters.startInterval`	Time window for burst detection
Item config (per-service)	`runParameters.startBurst`	Max restarts within the interval
Item config (per-service)	`runParameters.restartInterval`	Delay between restart attempts
Item config (per-service)	`offlineTTL`	Duration before stopping on connectivity loss
Cgroup limits (per-service)	CPU quota, RAM limit, PID limit	Resource boundaries enforced by kernel
Monitoring config	Alert thresholds	When to generate quota alerts (informational)

Error Handling and Recovery — overview of AosCore error handling philosophy and recovery strategies
Error Propagation — detailed documentation of how errors flow between components

Update Failure and Rollback — update-specific failure handling and rollback procedures

SM Launcher — the Launcher module architecture and runtime dispatch
Service Instance States — complete state machine including the Failed state
Monitoring Pipeline — how resource metrics and alerts are collected
Alerts and Thresholds — alert configuration including instance quota alerts

Introduction​

Failure Detection Mechanisms​

Process Exit Detection via Systemd Unit Monitoring​

Offline TTL Expiration​

Resource Limit Violations​

The Failed Instance State​

Restart Policies​

Configuration Parameters​

Systemd Drop-In Configuration​

Restart Behavior Example​

Start Interval Multiplier​

Failure Reporting Flow​

1. Runner → Container Runtime​

2. Container Runtime → Launcher​

3. Launcher → SM Client (gRPC to CM)​

4. CM → Cloud​

Proto Message Structure​

Instance Alerts​

Recovery After Failure​

Automatic Recovery via Desired State​

What Happens During Recovery​

Cloud Connectivity Restoration​

Configuration Summary​

Related Pages​