Balancing Overview
This document explains the balancing mechanism, its triggers, rules, and the rebalancing process for services and instances within the system.
Key Points
- Initial Balancing: Occurs when a unit receives a desired state that differs from its previous state. This typically happens when services or instances are added or removed, triggering the system to rebalance.
- Rebalancing Behavior: If services or instances are moved to other nodes due to runtime rebalancing, they will be rescheduled according to the initial balancing rules when rebalancing occurs again.
When is Balancing Triggered?
Balancing is triggered by the following events:
- Changing the number of running service instances.
- Updating the unit configuration.
- Adding or removing computing nodes.
- Resource usage alerts.
Balancing Rules
- Instance Policies: Instances with the
balancingPolicy
set todisabled
are excluded from rebalancing. - Instance Sorting: Instances are sorted in descending order by
priority
. If instances have the samepriority
, they are further sorted by theirservice_id
in ascending order. - Node Sorting: Nodes are also sorted by
priority
in descending order. Nodes with the samepriority
are sorted by theirnode_id
in ascending order. - Node Eligibility: Only nodes in a provisioned state participate in balancing; unprovisioned or paused nodes are excluded.
- Filters: Instance runners, labels, and resources are used to filter eligible nodes for balancing.
- Node Exclusion: If a service instance has already been rebalanced from Node X to Node Y, Node X is excluded from further balancing considerations for that specific instance.
Factors Affecting Service Balancing Between Nodes
Balancing decisions are influenced by the following factors:
- The priority of the service and its subject.
- Service quotas and requested resources.
- Current resource consumption of the services.
- Node priority.
- Available node resources.
- Service annotations may include rules that prevent balancing or impose specific conditions for balancing to occur.
Migration and Resource Management
- Lower Priority Migration: When resources are limited, services with lower priority are migrated to other nodes to free up resources.
- Update Scheduler: If an update is scheduled, balancing will occur according to the defined schedule.
- Migration Process: Service instances are migrated by shutting them down and restarting them on another node. The requested resources for the instance are considered during the migration.
- Priority-Based Balancing: The system prioritizes balancing to nodes with lower priority. If no such nodes are available, the system will balance onto any available node, even those with higher priority. Once a service migrates away from a node, that node is temporarily excluded from further migration options to avoid "ping-pong" behavior until new balancing is triggered.
- Resource Allocation: If a device resource required by a new service instance is currently allocated to a lower-priority service instance, the lower-priority instance is migrated.
- Instance Distribution: Different instances of the same service can run on different nodes based on the rebalancing parameters.
- Resource Thresholds: After balancing, the current resource consumption on each node should fall below the configured thresholds if sufficient resources are available on other nodes.
- Resources Considered: Currently, balancing takes into account CPU, memory, and storage. Other resources, such as GPUs, may be added in the future.
JSON schema
- Service config: visit https://github.com/aosedge/aos_protocols/blob/main/unit-cloud/aos-unit-messages.schema.json for details.
- Node config: find
NodeConfig
in the scheme https://github.com/aosedge/aos_protocols/blob/main/unit-cloud/aos-unit-messages.schema.json for details.