Skip to main content
Version: v1.1

Service Failure Handling

Introduction

This page documents how AosCore detects and handles service instance failures at the Service Manager (SM) level. When a service instance crashes, exceeds resource limits, or becomes unreachable due to connectivity loss, SM must detect the failure, report it through the system, and apply the configured recovery policy. Understanding these mechanisms is essential for OEMs configuring service restart behavior and diagnosing production failures.

Service failure handling involves three phases: detection (identifying that a failure occurred), reporting (propagating the failure status to CM and ultimately to the cloud), and recovery (applying restart policies to restore the service).

Failure Detection Mechanisms

The SM Launcher's container runtime detects service failures through its Runner module, which continuously monitors the systemd units that host service instances.

Process Exit Detection via Systemd Unit Monitoring

Each container service instance runs as a systemd transient unit (aos-service@<instance-id>.service). The Runner module monitors these units by polling their state every second via the systemd D-Bus connection:

Systemd Unit StateMapped Instance StateMeaning
activeActiveProcess is running normally
inactiveInactiveProcess has stopped (graceful shutdown)
failedFailedProcess exited with an error or was killed
activating(monitored during startup)Process is starting up
deactivatingFailedProcess is shutting down unexpectedly
Any other stateFailedUnexpected state treated as failure

The Runner maintains two tracking maps:

  • Starting units — units that are in the process of being launched. If a starting unit enters the failed state, the Runner immediately notifies the waiting start operation.
  • Running units — units that have successfully started. The monitoring thread detects state changes and reports them to the container runtime via the RunStatusReceiverItf callback.

When a running unit's state changes (e.g., from active to failed), the Runner collects the exit code from systemd and reports a RunStatus containing the instance ID, new state, and error information.

Offline TTL Expiration

Each service instance can have an offline TTL (Time-To-Live) configured in its OCI image configuration. This mechanism stops instances that have been running without cloud connectivity for too long:

  1. When the cloud connection drops, the Launcher records the disconnect timestamp
  2. A timer is started based on the shortest TTL among all Active/Activating instances
  3. When the timer fires, the Launcher checks each instance:
    • If time_since_disconnect > instance.offlineTTL, the instance is stopped
    • The instance transitions to the Failed state
  4. If connectivity is restored before any TTL expires, the disconnect timestamp is cleared and the timer is cancelled

Instances without a configured offline TTL (duration of zero) are not affected by connectivity loss and continue running indefinitely.

Resource Limit Violations

Container instances run with Linux cgroup resource limits enforced by the container runtime:

  • CPU — quota/period-based limiting
  • Memory (RAM) — hard memory limit in bytes
  • PID count — maximum number of processes

When a container exceeds its memory limit, the Linux kernel's OOM killer terminates the process. This causes the systemd unit to enter the failed state, which the Runner detects during its next polling cycle. The exit code from the OOM kill is captured and included in the error report.

CPU quota violations do not terminate the process — they throttle it. PID limit violations prevent new process creation within the container but do not kill existing processes.

Resource usage is also monitored by the SM monitoring subsystem, which generates InstanceQuotaAlert notifications when usage exceeds configured thresholds. These alerts are informational and do not directly trigger instance termination — they serve as early warnings before hard limits are hit.

The Failed Instance State

When a failure is detected, the instance transitions to the eFailed state. The InstanceStatus structure carries:

FieldDescription
mStateSet to eFailed
mErrorAn Error object containing the exit code (from systemd) and a descriptive message
Instance identity fieldsmItemID, mSubjectID, mInstance — identifying which instance failed
mRuntimeIDWhich runtime was hosting the instance
mNodeIDWhich Node the instance was running on

The error information is derived from the systemd unit's exit code. If the process exited with a non-zero code, that code is preserved. If the failure was due to a systemd-level issue (unit could not start), a generic eFailed error is reported with a descriptive message.

Restart Policies

The container runtime applies restart policies through systemd's built-in restart mechanism. Each service instance's systemd unit is configured with Restart=always, meaning systemd automatically restarts the process when it exits. The restart behavior is controlled by three parameters from the RunParameters configuration:

Configuration Parameters

ParameterFieldDefaultDescription
Start IntervalmStartInterval5 secondsThe time window within which StartBurst restarts are allowed. Maps to systemd's StartLimitIntervalSec.
Start BurstmStartBurst3Maximum number of start attempts allowed within the Start Interval. Maps to systemd's StartLimitBurst.
Restart IntervalmRestartInterval1 secondDelay between a process exit and the next restart attempt. Maps to systemd's RestartSec.

These parameters are set per-instance in the service's item configuration (delivered as part of the desired state from the cloud). If not specified, the defaults above are used.

Systemd Drop-In Configuration

When starting an instance, the Runner creates a systemd drop-in file at /run/systemd/system/aos-service@<instance-id>.service.d/parameters.conf with the following content:

[Unit]
StartLimitIntervalSec=<startInterval in seconds>
StartLimitBurst=<startBurst>

[Service]
RestartSec=<restartInterval in seconds>

This configures systemd to:

  1. Restart the process after RestartSec seconds when it exits
  2. Allow up to StartLimitBurst restarts within StartLimitIntervalSec
  3. If the burst limit is exceeded, stop attempting restarts and leave the unit in failed state

Restart Behavior Example

With default parameters (startInterval=5s, startBurst=3, restartInterval=1s):

  1. Instance crashes → systemd waits 1 second → restarts (attempt 1)
  2. Instance crashes again → waits 1 second → restarts (attempt 2)
  3. Instance crashes again → waits 1 second → restarts (attempt 3)
  4. Instance crashes again → 3 restarts within 5 seconds exceeds the burst limit → unit enters failed state permanently

At this point, the Runner's monitoring thread detects the failed state and reports it up through the system. The instance remains in the Failed state until a new UpdateInstances command is received (typically triggered by a new desired-state update from the cloud).

Start Interval Multiplier

When starting an instance, the Runner applies a 1.2× multiplier to the configured start interval to determine the timeout for the initial start operation. This provides a small buffer beyond the configured interval to account for system load during startup. If the unit does not reach the active state within this timeout, the start is considered failed.

Failure Reporting Flow

When a service instance fails, the information propagates through the system in a well-defined path:

1. Runner → Container Runtime

The Runner's monitoring thread detects the state change and calls UpdateRunStatus() on the container runtime with a vector of RunStatus entries for all changed instances. Each entry contains the instance ID, new state (eFailed), and the error (including exit code).

2. Container Runtime → Launcher

The container runtime's UpdateRunStatus() implementation updates the internal Instance object's run status and calls OnInstancesStatusesReceived() on the Launcher (which implements InstanceStatusReceiverItf).

3. Launcher → SM Client (gRPC to CM)

The Launcher calls SendUpdateInstancesStatuses() on the SM client, which serializes the InstanceStatus array into the gRPC UpdateInstancesStatus message and sends it to CM over the SM registration stream.

4. CM → Cloud

CM's SM Controller receives the instance status update and forwards it to the CM Launcher, which aggregates instance statuses across all Nodes. The aggregated status is included in the unitStatus JSON message sent to AosCloud via the WebSocket connection, containing per-instance error details.

Proto Message Structure

The InstanceStatus message in the SM v5 protocol carries failure information:

message InstanceStatus {
common.v2.InstanceIdent instance = 1;
string version = 2;
string runtime_id = 3;
string manifest_digest = 4;
repeated EnvVarStatus env_vars = 5;
string state = 6; // "failed"
common.v2.ErrorInfo error = 7; // failure details
}

The ErrorInfo contains:

  • aos_code — internal error classification
  • exit_code — the process exit code from systemd
  • message — human-readable description of the failure

Instance Alerts

In addition to status reporting, the SM generates alerts for service instance events. These alerts are sent to CM as InstanceAlert messages and forwarded to the cloud:

Alert TypeTriggerContent
Instance AlertService process logs an error-level messageThe log message content, instance identity, and severity
Instance Quota AlertResource usage exceeds a configured thresholdThe resource parameter name, current value, and alert state (raised/cleared)

Instance alerts provide real-time visibility into service health without waiting for a full failure. They are generated by the SM's journal alert monitor, which watches the systemd journal for log entries from service units.

Recovery After Failure

Once an instance is in the Failed state and systemd has exhausted its restart attempts, recovery requires external intervention:

Automatic Recovery via Desired State

The most common recovery path is a new desired-state update from the cloud:

  1. The cloud receives the failure report via unitStatus
  2. An operator or automated system issues a new desired state (possibly identical to the current one)
  3. CM processes the new desired state and sends UpdateInstances to SM
  4. SM's Launcher stops the failed instance (resets the systemd unit's failed state) and starts it fresh

What Happens During Recovery

When the Launcher receives an UpdateInstances command that includes a previously-failed instance:

  1. The instance appears in the stop list — the Launcher calls StopInstance() on the runtime
  2. StopInstance() calls ResetFailedUnit() on systemd to clear the failed state
  3. The instance then appears in the start list — the Launcher creates a fresh instance with new run parameters
  4. The runtime sets up the environment and starts the systemd unit again

This ensures a clean restart — the previous failed state is fully cleared before the new attempt.

Cloud Connectivity Restoration

For instances stopped due to offline TTL expiration:

  1. When cloud connectivity is restored, the Launcher's OnConnect() callback clears the offline timestamp
  2. The Launcher reports current instance statuses to CM
  3. CM includes the stopped instances in the next desired-state reconciliation
  4. SM receives a new UpdateInstances and restarts the affected instances

Configuration Summary

Configuration SourceParameterEffect
Item config (per-service)runParameters.startIntervalTime window for burst detection
Item config (per-service)runParameters.startBurstMax restarts within the interval
Item config (per-service)runParameters.restartIntervalDelay between restart attempts
Item config (per-service)offlineTTLDuration before stopping on connectivity loss
Cgroup limits (per-service)CPU quota, RAM limit, PID limitResource boundaries enforced by kernel
Monitoring configAlert thresholdsWhen to generate quota alerts (informational)