Campaign Management component
This component is to provide a flexible infrastructure for installing and updating services on units. As different update scenarios are possible, provided infrastructure should cover as many cases as possible. Consider that service instances are linked to claims, so an updating entity has claims too (i.e. update notify that particular claims that should install a new version).
Update scenarios
The following update scenarios were identified:
Critical update
This scenario covers a situation when an urgent security update should be installed for all claims as soon as possible. Initially, online units receive a new version and then once a unit with any claim appears online, the new version of the service is installed automatically.
The main goal of the critical update - enroll updates on the unit fleet as soon as possible. There are some circumstances:
- Internet connection speed on units (units may download a lot of other updates, so the critical updates should have the highest priority). But current 3G/4G/WiFi speed allows transferring a lot of information in a short time. In the future, connection speed should increase.
- The deep queue of updates and other commands for units on the server-side, which waiting for processing So, the critical update may be implemented as a regular update with queue reordering on the server-side or as a high priority queue on the server-side. All other details should be the same as for a regular update
Regular update
This scenario provides an automatic update of services for claims. An immediate update of service for all claims introduces risks of becoming service unavailable for all claims by any reason (misconfiguration, bugs, etc.). To handle this situation it is suggested to use the canary release approach:
- Install new version on 20% of the overall amount of claims, that use the service, starting from online claims and then on offline claims, once it becomes connected
- Wait for a predefined period (24 hours)
- Install new version for all claims that goes online If something will go wrong, it is suggested to run a rollback scenario.
Semi-manual update
It is suggested to associate introduce a tagging concept to associate claim and service. One tag per claim/service is allowed. Here is some example of tags: "DHL Mercedes Drivers", "Insurance Private Cars" etc. The update is propagated per each tagged group in manual mode, i.e. responsible person selects a tagged group and click 'Update'. An update is propagated to all online claims of this group and for offline claims - immediately when it goes online.
Manual update
Similar to a semi-manual one. The main difference is creating an update group on the flight, without any limitations to tags, etc.
Rollback
It is suggested to not provide a rollback scenario directly. Instead, it is suggested to use the Critical Update instead.
Install new service to a unit
Installing a new service is the same that 'Update to Version 1' and any of the provided scenarios might be applied. The selection of units SHOULD be done using the Aos API.
Remove a service from a unit
Technically it is just removing the association between service and claim. The removing associations are done using the Aos API.
Infrastructure for campaign management
Presentation layer
it is suggested to provide the following functionality with both UI and API:
- Add a 'Critical Update' button to a service page that will trigger a critical update to the latest version
- Add a 'Regular Update' button to a service page that will trigger a regular update to the latest version
- Add an 'Update Tagged Claims' button to a service page that will allow to select a tag & service version and trigger a semi-manual update
- Add an 'Update Custom Claims' button to a service page that will allow to select a custom group & service version and trigger a manual update
- Create a 'Claims' page with a list of all pages and a 'Claim Detail' page with claim details.
- Extend 'Claim Detail' page with 'Tag' editor that will link service and tag to a claims
- Create a 'Tagged Grid' page which will allow to review/filter/sort tagged claims, add the claim to the tagged group, remove it
- Create a 'Custom Update Service Grid' page which will allow reviewing/filter/sort claims; add the claim to a custom group, remove it
- Create 'Service Statistic' page that will display aggregated information regarding the number of installed service version on units(42 units use v.2, 15 units use v.1), claims that waits for installation, status (success/error/ etc.)
Data layer
Operation database from Storage components should keep all needed information. ServiceUpdateGroupTable links claim and service version. It is the main table, responsible for the update. All types of updates will operate with this table and DesiredServiceInstance. Tags table stores information about linking tags to services and units. Major business rules are:
- On triggering Critical Update
- The new row is created in ServiceUpdateGroup table with UpdateEnabled set to true and UpdateType →Critical
- All DesiredServiceInstances for a given service (i.e. all units with installed service) are moved to created update group
- All online claims are notified about the new version
- Offline claims install new version according to general flow once it appears online
- Regular Update
- New row is created in ServiceUpdateGroup table with UpdateEnabled set to true, UpdateType →Regular, MaxUnitsInGroup → 20% of all claims(canary group)
- The new row is created in ServiceUpdateGroup table with UpdateEnabled set to false, UpdateType →Regular, (regular group)
- All online claims are moved to the canary group and receive the update
- Random online units are included in the canary group until MaxUnitsInGroup is reached. This is the only scenario for the usage MaxUnitsInGroup field.
- A task, triggered by a timer moves all units that have outdated version to the regular group and deactivate a canary group
- Semi Manual Update
- The new row is created in ServiceUpdateGroup table with UpdateEnabled set to true and UpdateType →SemiManual
- All tagged claims are moved to this group
- All online tagged claims are notified about the new version
- Offline tagged claims install new version according to general flow once it appears online
- Manual Update
- The new row is created in ServiceUpdateGroup table with UpdateEnabled set to false and UpdateType →Manual
- The group is filled with claims from UI grid
- Once the update is triggered, UpdateEnabled is set to true
- All online tagged claims are notified about the new version
- Offline tagged claims install new version according to general flow once it appears online
Security considerations
It was agreed that the most relevant attacks are:
- Fast Forward Attack
- Rollback Attack
- Wrong Software Installation
All these attacks require control of backend infrastructure. However, having control, the intruder will be able to install any service on any unit. So it is suggested to follow already defined security rules and do not invent new ones.
Components
Service blacklist
Each OEM SHOULD have a service blacklist. This component helps OEM to exclude some services from install/update for any reason (misconfiguration, support issues, etc)
Campaign template
Most campaigns will have the same scenarios and parameters. So, a campaign template should be used. A campaign template should belong to a service provider. Only the service provider can create, modify, delete the campaign template.
Campaign template SHOULD contain:
- the owner (service provider)
- unique identifier (UUID)
- default mark (default=true can be only one per service provider)
- title (unique for the owner)
- disable mark (simple deleting of the campaign template can be possible only if all of its campaigns finished, but service provider might want to disable template usage for future campaigns)
- number of stages (minimum allowed - 2)
- stage info:
- stage number
- percent of total units used for the stage
- allowed percents of install fail
- allowed percents of run fails
- minimum number of days for gathering statistics (report)
- minimum allowed percentage of the units which MUST receive updates before stage finished
- list of claims and units filters (if null - system should choose random units and claims)
The total number of units for all stages MUST be 100%. This value should be checked during the template creating or updating. Also, during service providers creating a default campaign template should be created (named 'canary' with 2 stages: 20% for the first stage and 80% for the second stage).
Campaigns
We have agreed that campaign logic will be applied for each update. But it will be applied for the installation procedure for new services only for business users.
Service installation/update API receives:
- service package
- service template UUID (optional, if not specified - will be used default service provider template)
- start date (optional, if not specified - current date should be used)
Campaign info contains:
- the owner (service provider)
- template reference
- state (new, started, canceled, finished, etc)
- schedule date (by default - starts immediately)
- stages info (number of stages depends on the template):
- total number of units for install
- total number of installed/updated units or claims
- number of installations with install error
- number of installations with run error
All procedures of units and claims filtering, start, stop the campaign, send messages, cancel, finishing, etc - should be done by celery job.
Campaign monitoring
Monitoring alert messages are used for collecting statistics for service installations and executions errors. AosEdge should avoid counting failures multiply times per unique unit.
Celery jobs in predefined times should check these statistics and make the decision for the switch to the next campaign stage, finishing or canceling campaign based on campaign parameters.
Time-based updates
Like any other update, the FOTA process consists of two different stages:
- Receive information about an available update, download, decrypt and verify the update bundle
- Install the update bundle on the device
The first step can be done at any time and does not interrupt or interfere with other processes on the device. An update bundle might be very large, so the download process might take some time. Possible download speed limits or using an appropriate internet connection should be described separately and do not belong to the campaign management component.
An update bundle install may cause interrupting system work, due to restarting some services or even full system restart. Such action is not suitable inappropriate for many cases. FOTA/SOTA and service updates should allow being installed on schedule (out of working time). The time-based mechanism allows doing this.
The common flow is: download and prepare an update bundle and then start the update install process on schedule.
The schedule (non-working or allowed hours) are created on AosEdge Cloud by the OEM. Each schedule can be applied for the unit or business user (which represents a group of units). OEM can create and apply different schedules for different campaigns, business users or units. The applied schedule must be sent to the unit for each update if such information is available. The service manager and update manager on the unit must apply the update according to the timetable.
All schedules are set up on AosEdge Cloud using the UTC time zone. Units receive information about schedules using the UTC time zone also. Applying the right timezone the unit must do in one of two ways: using the unit local timezone if available or such information must be retrieved from the cloud (using geo IP information or from the settings in the cloud unit DB)