Version: v1.1

Scheduling and Placement

Introduction

When a new desired state arrives from AosCloud, the Communication Manager (CM) must decide where each Instance should run. This decision — scheduling and placement — is handled by the CM's Launcher module and its internal Balancer component. The Balancer implements a multi-stage filtering pipeline that evaluates every connected Node against each Instance's requirements, selecting the optimal Node-runtime combination based on priority, labels, resource capacity, and platform compatibility.

This page documents the scheduling algorithm, the filtering stages, resource accounting, rebalancing triggers, and how the system handles placement failures. Understanding this mechanism is important for OEMs because it determines how workloads are distributed across Nodes in a Unit, and how Node configuration (labels, priorities, resource limits) influences placement decisions.

Scheduling Architecture

Scheduling is a CM-level responsibility. The SM Launcher on each Node only executes start/stop commands — it does not make placement decisions. The CM Launcher coordinates the full scheduling lifecycle:

┌─────────────────────────────────────────────────────────────────┐
│                    CM Launcher                                    │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │  Instance    │  │    Node      │  │      Balancer        │  │
│  │  Manager     │  │   Manager    │  │                      │  │
│  │              │  │              │  │  • Priority sort     │  │
│  │  • Tracks    │  │  • Tracks    │  │  • Policy balancing  │  │
│  │    active    │  │    connected │  │  • Node filtering    │  │
│  │    instances │  │    nodes     │  │  • Runtime selection │  │
│  │  • Manages   │  │  • Resource  │  │  • Best-fit pick    │  │
│  │    lifecycle │  │    tracking  │  │                      │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│                                                                  │
│         ┌──────────────────────────────────────┐                │
│         │         Network Manager              │                │
│         │  • DNS, connectivity, exposed ports  │                │
│         └──────────────────────────────────────┘                │
└─────────────────────────────────────────────────────────────────┘
         │                              │
         ▼                              ▼
   ┌───────────┐                 ┌───────────┐
   │  Node A   │                 │  Node B   │
   │  (SM)     │                 │  (SM)     │
   └───────────┘                 └───────────┘

Key Components

Component	Role
Launcher	Entry point — receives `RunInstances` requests from the Update Manager and coordinates the scheduling process
Balancer	Core scheduling logic — implements the multi-phase algorithm that assigns Instances to Nodes
Instance Manager	Tracks all active and scheduled Instances, manages their lifecycle states
Node Manager	Maintains the set of connected Nodes, their capabilities, and available resources
Network Manager	Configures networking (DNS, provider networks) for scheduled service Instances

Scheduling Input

The Launcher receives a RunInstances call from the Update Manager containing an array of RunInstanceRequest entries. Each request specifies:

Field	Description
Item ID	Identifies the Deployable Item (service image or component)
Item Type	Whether this is a service or component Instance
Version	Target version of the Deployable Item
Owner ID	The owner (provider) of the service
Subject Info	Identity subject under which the Instance runs
Priority	Scheduling priority — higher values are scheduled first
Num Instances	How many replicas of this Instance to schedule
Labels	Placement constraint labels that must match Node labels

The Launcher expands each request into individual Instance objects (one per replica) and passes them to the Balancer for placement.

Multi-Phase Scheduling Algorithm

The Balancer executes scheduling in five sequential phases:

Phase 1: Instance Prioritization

Before placement begins, Instances are sorted by priority (descending). Higher-priority Instances are scheduled first, ensuring they get first pick of available resources. When priorities are equal, Instances are ordered by Item ID for deterministic behavior.

This ordering applies uniformly to both service Instances (container processes) and component Instances (firmware, rootfs).

Phase 2: Policy-Based Rebalancing

During a rebalance operation (not initial placement), the Balancer first handles Instances with a disabled balancing policy. These Instances are pinned to their current Node — they are re-scheduled to the same Node and runtime they were previously running on, bypassing the normal filtering pipeline.

Balancing Policy	Behavior
Enabled (default)	Instance participates in normal scheduling and can be moved between Nodes during rebalancing
Disabled	Instance is pinned to its current Node — rebalancing will not move it

The balancing policy is specified in the Deployable Item's configuration (the OCI item config).

Phase 3: Node Filtering Pipeline

For each Instance that needs placement, the Balancer runs a multi-stage filtering pipeline. The pipeline starts with all connected Nodes and progressively eliminates those that cannot host the Instance.

Stage 1: Node ID Filter

If the Instance has a previous Node assignment (from a prior scheduling round), the system checks whether that Node is still valid. This stage also handles explicit Node ID constraints if specified.

Stage 2: Label Matching

Nodes are filtered based on the Instance's required labels. A Node must possess all labels specified in the Instance's label set to be eligible. Labels are string key-value pairs configured on each Node.

Example: If an Instance requires labels ["gpu=true", "zone=edge"], only Nodes that have both of these labels pass this filter.

Stage 3: Shared Resource Availability

Nodes are filtered based on required shared resources (hardware devices, partitions, or other named resources). The Instance's item configuration declares which named resources it needs, and only Nodes that advertise those resources remain eligible.

Stage 4: Runtime Selection

After Node-level filtering, the Balancer selects the best runtime on the remaining Nodes. Each Node can have multiple runtimes (e.g., a container runtime and a boot runtime). The runtime selection applies its own filtering stages:

Runtime Filter	Criteria
Runtime type	The Instance's required runtime type must match (e.g., `crun`, `boot`, `rootfs`)
Platform	The runtime's architecture and OS must match the Instance's image platform
CPU capacity	The runtime must have sufficient available CPU (DMIPS) for the Instance's requirements
RAM capacity	The runtime must have sufficient available RAM for the Instance's requirements
Instance count	The runtime must not have reached its maximum Instance capacity

Stage 5: Priority Selection and Best Fit

From the remaining Node-runtime candidates:

Top priority filter — only Nodes with the highest priority value are retained
Best-fit selection — among equal-priority Nodes, the system selects the Node with the most available resources (CPU first, then RAM as tiebreaker, then Node ID for determinism)

This "most available resources" strategy distributes workloads to avoid hotspots, preferring Nodes that have the most headroom.

Phase 4: Network Configuration

After all Instances are scheduled, the Balancer updates network configurations:

Removes network parameters for Instances that are no longer scheduled
Sets up provider networks for newly-scheduled service Instances on each Node
Configures network parameters — first for Instances with exposed ports, then for all others
Restarts the DNS server to reflect the new Instance topology

Phase 5: Instance Submission

The scheduling process concludes by:

Submitting scheduled Instances to the Instance Manager (transitioning them from "scheduled" to "active")
Sending the updated Instance lists to each Node's Service Manager via the SM Controller

The SM on each Node then executes the actual start/stop operations.

Resource Accounting

The Balancer tracks resource availability per Node and per runtime to make informed placement decisions.

Resource Types

Resource	Unit	Applies To	Description
CPU	DMIPS	Service Instances	Processing capacity available on the Node/runtime
RAM	Bytes	Service Instances	Memory available on the Node/runtime
Storage	Bytes	Service Instances	Persistent storage partition space
State	Bytes	Service Instances	State partition space
Shared Resources	Named	Both	Hardware devices, partitions, or other named resources

How Resource Requirements Are Determined

For service Instances, resource requirements come from two sources:

Requested resources — explicit CPU and RAM values declared in the Deployable Item's configuration
Node configuration ratios — if no explicit request is specified, the system calculates requirements using configurable ratios from the Node configuration (a percentage of total Node capacity)

The system also considers actual monitoring data during rebalancing — if a Node is under heavy load, the Balancer uses real consumption data rather than declared requirements to make better placement decisions.

Available Resource Calculation

Each Node tracks:

Total capacity — the Node's declared CPU (DMIPS) and RAM
System usage — CPU and RAM consumed by non-AosEdge processes (derived from monitoring data minus Instance consumption)
Instance reservations — resources reserved by already-scheduled Instances

Available resources = Total capacity − System usage − Instance reservations

Per-runtime tracking ensures that Instances are only placed on runtimes that have sufficient capacity, even when a Node has multiple runtimes sharing the same hardware.

Rebalancing

Rebalancing redistributes Instances across Nodes when resource usage becomes uneven. It is triggered by the monitoring system when resource thresholds are exceeded.

Threshold Configuration

Each Node can have resource usage thresholds configured:

Threshold	Purpose
Max threshold	Upper limit — when usage exceeds this percentage for longer than the minimum timeout, rebalancing is triggered
Min threshold	Lower limit — rebalancing is considered complete when usage drops below this percentage for longer than the minimum timeout
Min timeout	Duration that usage must continuously exceed/fall below a threshold before action is taken

Thresholds are configured as percentages of total Node capacity. System-wide defaults apply unless overridden by per-Node configuration.

Rebalancing Flow

Monitoring detects that a Node's resource usage exceeds the max threshold for the configured timeout
The system triggers a Rebalance call on the Launcher
The Balancer runs the full scheduling algorithm with rebalancing=true:
- Instances with disabled balancing policy are pinned to their current Nodes
- All other Instances are re-evaluated for optimal placement
- Monitoring data is used for resource calculations (actual usage, not just declared requirements)
Instances that should move are stopped on their current Node and started on the new target Node
Rebalancing is considered complete when all Nodes fall below the min threshold

Preventing Rebalancing Oscillation

The dual-threshold design (max/min) with minimum timeouts prevents oscillation:

A brief CPU spike does not trigger rebalancing (must exceed max threshold for the full timeout duration)
Once triggered, rebalancing continues until usage stabilizes below the min threshold (not just below max)
Short-term drops below min threshold do not prematurely end rebalancing (must remain below for the full timeout)

Placement Failures

When the Balancer cannot find a suitable Node for an Instance, it records an error on that Instance and continues scheduling the remaining Instances. Placement can fail for several reasons:

Failure Reason	Description
No connected Nodes	No Nodes are currently connected to the CM
No matching Node ID	The Instance requires a specific Node that is not available
No matching labels	No connected Node has all required labels
No matching resources	No Node has the required shared resources
No matching runtime type	No Node has a runtime of the required type
No matching platform	No runtime matches the Instance's architecture/OS
Insufficient CPU	All candidate runtimes lack sufficient CPU capacity
Insufficient RAM	All candidate runtimes lack sufficient RAM capacity
Instance limit reached	All candidate runtimes have reached their maximum Instance count

Failed Instances are reported with an error status to the cloud, enabling operators to diagnose and resolve placement issues (e.g., by adding capacity, adjusting labels, or reducing Instance requirements).

Instance Types and Scheduling Differences

The scheduling algorithm handles two Instance types with slightly different resource semantics:

Aspect	Service Instance	Component Instance
Runtime	Container (crun/runc)	Boot or Rootfs
CPU/RAM filtering	Uses requested resources or Node config ratios	Does not consume CPU/RAM quotas (system-level)
Shared resources	Optional (network devices, GPUs)	Required (boot partitions, rootfs devices)
Max instances per runtime	Configurable (typically many)	Typically 1 per runtime
Network configuration	Full (DNS, exposed ports, provider network)	None
Balancing policy	Configurable (enabled/disabled)	Typically disabled

Scheduling Sequence

The following sequence shows the end-to-end flow from desired-state reception through Instance placement:

Cloud → CM Update Manager → CM Launcher.RunInstances()
                                    │
                                    ▼
                            Balancer.RunInstances()
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
            PrepareForBalancing  PolicyBalancing  NodeBalancing
            (update monitoring,  (pin disabled-  (filter pipeline
             reset state)        rebalance       for each Instance)
                                 instances)
                                                    │
                                                    ▼
                                            UpdateNetwork
                                            (DNS, providers)
                                                    │
                                                    ▼
                                        SubmitScheduledInstances
                                                    │
                                                    ▼
                                    NodeManager.SendScheduledInstances()
                                                    │
                              ┌─────────────────────┼──────────────────┐
                              ▼                     ▼                  ▼
                        SM Controller         SM Controller       SM Controller
                        → Node A              → Node B            → Node C
                        (start/stop)          (start/stop)        (start/stop)

Service Lifecycle — overview of the complete service lifecycle
Desired State Model — how the desired state arrives and triggers reconciliation
SM Launcher — the Node-level Launcher that executes start/stop commands after placement
Runtime Types — detailed comparison of container, boot, and rootfs runtimes
Service Instance States — the Instance state machine after placement
Image Deployment Pipeline — how images are downloaded and prepared before Instances can be scheduled

Key Concepts — terminology including Node, Unit, Deployable Item

Introduction​

Scheduling Architecture​

Key Components​

Scheduling Input​

Multi-Phase Scheduling Algorithm​

Phase 1: Instance Prioritization​

Phase 2: Policy-Based Rebalancing​

Phase 3: Node Filtering Pipeline​

Stage 1: Node ID Filter​

Stage 2: Label Matching​

Stage 3: Shared Resource Availability​

Stage 4: Runtime Selection​

Stage 5: Priority Selection and Best Fit​

Phase 4: Network Configuration​

Phase 5: Instance Submission​

Resource Accounting​

Resource Types​

How Resource Requirements Are Determined​

Available Resource Calculation​

Rebalancing​

Threshold Configuration​

Rebalancing Flow​

Preventing Rebalancing Oscillation​

Placement Failures​

Instance Types and Scheduling Differences​

Scheduling Sequence​

Related Pages​