Skip to main content
Version: v1.1

Service Manager

Introduction

The Service Manager (SM) is the AosCore component responsible for the complete lifecycle of service instances on a Node. It handles everything from acquiring and storing OCI service images, through launching containers or native processes, to monitoring resource usage and collecting logs.

Every Node in a Unit runs its own SM instance. The SM connects to the Communication Manager (CM) on the Main Node via gRPC, receives deployment commands, and reports back instance status, monitoring data, alerts, and logs.

Process: aos_servicemanager

Role Within the Architecture

The SM operates as a managed executor — it does not make autonomous decisions about which services to run. Instead, it receives instructions from CM (via the smcontroller interface) and carries them out:

  1. Receives instance update commands from CM specifying which services to start or stop
  2. Acquires required service images (downloading and verifying OCI layers)
  3. Launches service instances using the appropriate runtime (container or native)
  4. Enforces resource limits (CPU, RAM, storage, network bandwidth)
  5. Monitors Node and instance resource consumption
  6. Reports status, metrics, alerts, and logs back to CM

This separation of concerns means the SM has no direct cloud connectivity — all communication with AosCloud flows through CM.

Subcomponents

The SM is composed of the following internal modules:

ModuleResponsibility
imagemanagerManages OCI service images — coordinates image download, layer unpacking, storage, and integrity verification
launcherStarts and stops service instances — supports multiple runtime types (container, boot, rootfs) through a pluggable runtime interface
resourcemanagerProvides resource information — tracks available host resources (devices, shared resources) for service allocation
networkmanagerConfigures per-service networking — CNI plugin execution for network setup/teardown, traffic monitoring via iptables
monitoringCollects Node-level resource metrics — CPU, RAM, disk usage, and network traffic for the Node and individual service instances
alertsGenerates alert events from systemd journal — watches journal entries for configured patterns and threshold violations
logproviderCollects and forwards logs on demand — system logs, per-instance logs, and crash logs
smclientgRPC client connecting to CM — registers the SM, receives commands, sends status and telemetry upstream
iamclientClient interface to IAM — obtains TLS certificates, Node identity, and service permissions
databaseLocal SQLite storage with schema migration — persists service image metadata, instance state, network configuration, and traffic data

Key Interfaces

CM Interface (outbound gRPC client)

The SM connects to CM's smcontroller as a gRPC client using the servicemanager/v5 protocol. On connection, the SM:

  1. Registers itself by sending SMInfo — reporting its Node ID, available runtimes (with capabilities like DMIPS, RAM, OS/architecture info), and host resources
  2. Receives commands from CM:
    • UpdateInstances — start or stop specific service instances
    • UpdateNetworks — configure network parameters for services
    • GetNodeConfigStatus / CheckNodeConfig / SetNodeConfig — Node configuration management
    • SystemLogRequest / InstanceLogRequest / InstanceCrashLogRequest — log retrieval
    • GetAverageMonitoring — request averaged monitoring data
    • ConnectionStatus — cloud connection state notifications
  3. Reports back to CM:
    • UpdateInstancesStatus / NodeInstancesStatus — instance lifecycle state changes
    • InstantMonitoring / AverageMonitoring — resource usage metrics
    • Alert — threshold violations and system events
    • LogData — requested log content

The connection is secured with mutual TLS using certificates obtained from the local IAM instance.

IAM Interface (outbound gRPC client)

The SM's iamclient module connects to the local IAM instance to obtain:

  • TLS certificates for securing the gRPC connection to CM
  • Node identity information (Node ID, Node type) used during SM registration
  • Service permissions — register/unregister instance-level permissions

The IAM client automatically reconnects when certificates are renewed.

Proxy Interface (v5)

In the v5 protocol, SM also interacts with a ProxyService for content download operations. The proxy handles:

  • DownloadRequest / CancelDownload — image content retrieval
  • ClockSync — time synchronization between Nodes

This interface is used in multi-Node deployments where the Secondary Node's SM downloads content through the Message Proxy rather than directly from the network.

Runtime Types

The SM launcher supports multiple runtime types through a pluggable architecture:

RuntimeDescriptionUse Case
ContainerOCI-compatible container execution using systemd units (aos-service@.service)Standard service deployment — isolated execution with cgroups, namespaces, and overlay filesystem
BootEFI boot-based deployment with partition managementFirmware-level components that require boot-time execution
RootfsRoot filesystem deploymentSystem-level services that run directly on the host filesystem

Each runtime reports its capabilities (DMIPS, RAM limits, OS info, architecture) to CM during registration, enabling the cloud to make informed scheduling decisions.

Data Persistence

The SM database (servicemanager.db) persists:

  • Service image metadata — item IDs, versions, manifest digests, states, timestamps
  • Instance information — full instance configuration including runtime assignment, network parameters, monitoring parameters, environment variables
  • Network state — network configurations, instance-to-network mappings, VLAN assignments
  • Traffic data — cumulative traffic counters per iptables chain for bandwidth monitoring
  • Journal cursor — position in systemd journal for alert processing continuity across restarts

The database supports schema migration for upgrades between SM versions.

Monitoring and Alerting

The SM provides two complementary observability mechanisms:

Resource Monitoring

The monitoring module collects metrics at configurable intervals:

  • Node-level: CPU usage, RAM usage, disk partition usage, network traffic (upload/download)
  • Instance-level: Per-service CPU, RAM, disk, and network usage (tracked via cgroups and iptables)

Metrics are reported to CM as both instant snapshots and time-averaged values.

Journal-Based Alerts

The alerts module watches the systemd journal and generates alerts when:

  • System quota alerts — Node resource usage exceeds configured thresholds
  • Instance quota alerts — individual service exceeds its resource allocation
  • Resource allocation alerts — resource assignment failures
  • System alerts — critical system-level journal messages
  • Core alerts — AosCore component errors (SM, CM, IAM)
  • Instance alerts — service-specific error messages

Alert rules support configurable thresholds with duration-based evaluation (percentage-based for CPU/RAM, point-based for traffic).