Balancing Overview

This document explains the balancing mechanism, its triggers, rules, and the rebalancing process for services and instances within the system.

Key Points

Initial Balancing: Occurs when a unit receives a desired state that differs from its previous state. This typically happens when services or instances are added or removed, triggering the system to rebalance.
Rebalancing Behavior: If services or instances are moved to other nodes due to runtime rebalancing, they will be rescheduled according to the initial balancing rules when rebalancing occurs again.

Balancing is triggered by the following events:

Instance Policies: Instances with the balancingPolicy set to disabled are excluded from rebalancing.
Instance Sorting: Instances are sorted in descending order by priority. If instances have the same priority, they are further sorted by their service_id in ascending order.
Node Sorting: Nodes are also sorted by priority in descending order. Nodes with the same priority are sorted by their node_id in ascending order.
Node Eligibility: Only nodes in a provisioned state participate in balancing; unprovisioned or paused nodes are excluded.
Filters: Instance runners, labels, and resources are used to filter eligible nodes for balancing.
Node Exclusion: If a service instance has already been rebalanced from Node X to Node Y, Node X is excluded from further balancing considerations for that specific instance.

Balancing decisions are influenced by the following factors:

The priority of the service and its subject.
Service quotas and requested resources.
Current resource consumption of the services.
Node priority.
Available node resources.
Service annotations may include rules that prevent balancing or impose specific conditions for balancing to occur.

Lower Priority Migration: When resources are limited, services with lower priority are migrated to other nodes to free up resources.
Update Scheduler: If an update is scheduled, balancing will occur according to the defined schedule.
Migration Process: Service instances are migrated by shutting them down and restarting them on another node. The requested resources for the instance are considered during the migration.
Priority-Based Balancing: The system prioritizes balancing to nodes with lower priority. If no such nodes are available, the system will balance onto any available node, even those with higher priority. Once a service migrates away from a node, that node is temporarily excluded from further migration options to avoid "ping-pong" behavior until new balancing is triggered.
Resource Allocation: If a device resource required by a new service instance is currently allocated to a lower-priority service instance, the lower-priority instance is migrated.
Instance Distribution: Different instances of the same service can run on different nodes based on the rebalancing parameters.
Resource Thresholds: After balancing, the current resource consumption on each node should fall below the configured thresholds if sufficient resources are available on other nodes.
Resources Considered: Currently, balancing takes into account CPU, memory, and storage. Other resources, such as GPUs, may be added in the future.

Service config: visit https://github.com/aosedge/aos_protocols/blob/main/unit-cloud/aos-unit-messages.schema.json for details.
Node config: find NodeConfig in the scheme https://github.com/aosedge/aos_protocols/blob/main/unit-cloud/aos-unit-messages.schema.json for details.