Service Deployment Failures
Introduction
This page provides practical troubleshooting guidance for service deployment failures — situations where a Deployable Item cannot be successfully downloaded, verified, unpacked, or launched on a Node. These failures manifest as instances stuck in the Activating state or transitioning directly to the Failed state after a desired-state update is received.
Service deployment failures are among the most common operational issues because they involve multiple components (Communication Manager, Service Manager, Downloader, Image Manager, Launcher) and depend on external factors like network connectivity, disk space, and image integrity. This guide walks through each failure category with specific diagnostic steps and resolution actions.
Quick Diagnosis
When a service deployment fails, start by identifying which stage of the Image Deployment Pipeline encountered the error:
| Symptom | Likely Stage | First Check |
|---|---|---|
| Instance stuck in Activating, no download progress | Blob URL resolution | CM logs for GetBlobsInfo errors |
| Download progress starts then stops | Download (Stage 2) | SM logs for HTTP errors, network connectivity |
| Download completes but instance fails immediately | Verification (Stage 3) | SM logs for eInvalidChecksum errors |
| Instance fails after "unpacking" log entries | Layer unpacking (Stage 4) | SM logs for filesystem errors, disk space |
| Instance fails after "PrepareRootFS" log entries | Rootfs assembly (Stage 5) | SM logs for mount errors, missing layers |
| Instance reaches Active then immediately fails | Container start (Stage 5) | Service instance logs for process crash |
Image Download Failures
Download failures occur when the Downloader module cannot retrieve image blobs from the provided URL. The Downloader uses libcurl with automatic retry (3 attempts, exponential backoff from 1s to 5s) and resume support via HTTP range requests.
Symptoms
- Instance remains in Activating state for an extended period
unitStatusshows no progress on the affected instance- Download alerts in AosCloud show state
eInterrupted
Diagnostic Steps
1. Check Service Manager logs for download errors:
journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "download\|failed\|error"
Look for messages like:
"Failed to download"with retry count — indicates repeated download failures"HTTP error"with HTTP status code — server-side issue"failed to open file"— local filesystem issue"download cancelled"— superseded by a new desired state
2. Check network connectivity to the download server:
# Test connectivity to the cloud endpoint
curl -I <blob-url>
If the URL is not directly accessible, check whether the Node has internet connectivity and whether any proxy or firewall rules are blocking outbound HTTPS traffic.
3. Check disk space on the download partition:
df -h /var/aos/sm/
The Space Allocator reserves space before starting a download. If the partition is full, the allocation fails and the download never starts.
4. Check for concurrent download conflicts:
journalctl -u aos-servicemanager --since "10 minutes ago" | grep "already in progress"
If the same blob digest is already being downloaded (e.g., a shared layer between two services), the second request waits for the first. This is normal behavior but can appear as a stall if the first download is slow.
Root Causes and Resolutions
| Root Cause | Evidence | Resolution |
|---|---|---|
| Network unreachable | curl connection timeout, CURLE_COULDNT_CONNECT in logs | Restore network connectivity; check DNS resolution and firewall rules |
| HTTP 404 (blob not found) | HTTP_CODE: 404 in SM logs | Blob URL has expired or image was removed from cloud storage; trigger a new desired-state update |
| HTTP 403 (forbidden) | HTTP_CODE: 403 in SM logs | Authentication token expired or access revoked; check cloud credentials |
| HTTP 5xx (server error) | HTTP_CODE: 500/502/503 in SM logs | Cloud storage service issue; retry will occur automatically (3 attempts); wait or escalate to cloud provider |
| Disk space exhausted | Space allocation failure in logs, df shows full partition | Free space by removing outdated images or expanding the partition; the Space Allocator automatically removes outdated items when under pressure |
| Connection timeout | CURLE_OPERATION_TIMEDOUT after 10 seconds | Slow or unstable network; check bandwidth and latency to the download server |
| Download interrupted (partial) | Download starts but fails mid-transfer | The Downloader resumes automatically on retry if the server supports range requests; if all 3 retries fail, check for intermittent network issues |
Download Retry Behavior
The Downloader retries failed downloads with exponential backoff:
| Attempt | Delay Before Retry | Behavior |
|---|---|---|
| 1st attempt | — | Initial download attempt |
| 2nd attempt | 1 second | Retry with resume (if server supports range requests) |
| 3rd attempt | 2 seconds | Final retry attempt |
| All failed | — | Error propagated; instance marked as Failed |
If the server supports Accept-Ranges: bytes, the Downloader resumes from where it left off. Otherwise, it restarts the
download from the beginning on each retry.
Verification Failures
Verification failures occur when a downloaded blob's SHA-256 hash does not match the expected digest declared in the OCI manifest. This indicates data corruption during transfer or a tampered image.
Symptoms
- Instance transitions to Failed immediately after download completes
- SM logs show
eInvalidChecksumerror - The same image may have deployed successfully on other Nodes (ruling out a bad image at the source)
Diagnostic Steps
1. Check for checksum errors in SM logs:
journalctl -u aos-servicemanager --since "30 minutes ago" | grep -i "checksum\|InvalidChecksum\|digest"
Look for:
"eInvalidChecksum"— blob content does not match expected digest"wrong diff digest"— unpacked layer content does not match the declared diff ID"wrong layer checksum"— layer verification failed after unpacking
2. Identify which blob failed verification:
The log entry preceding the checksum error typically shows the digest being validated. Note the digest value — it identifies whether the failure is in the manifest, config, or a specific layer.
3. Check for disk corruption:
# Check filesystem health
dmesg | grep -i "error\|corrupt\|i/o"
Storage media errors can corrupt downloaded data between write and verification.
4. Check if the issue is reproducible:
If the same image fails verification repeatedly on the same Node but succeeds on others, suspect local storage issues. If it fails on all Nodes, suspect a corrupted image at the source.
Root Causes and Resolutions
| Root Cause | Evidence | Resolution |
|---|---|---|
| Network corruption | Intermittent eInvalidChecksum; succeeds on retry | Transient issue; a new desired-state update triggers re-download. Check for network equipment issues if recurring |
| Corrupted storage media | Repeated failures on same Node; dmesg shows I/O errors | Replace storage media; check filesystem integrity with fsck |
| Corrupted image at source | All Nodes fail verification for the same image | Re-upload the image to the cloud; verify the image digest matches what was built |
| Incomplete download treated as complete | Checksum mismatch on large blobs | Possible issue with range request handling; the Downloader deletes the corrupted blob and the next deployment attempt re-downloads from scratch |
Background Integrity Checks
The Image Manager also performs periodic integrity verification (every 24 hours) on stored blobs. If a previously-installed image fails this background check:
- The corrupted item is removed from storage
- Space is reclaimed
- The instance continues running (it is already loaded in memory)
- The next desired-state reconciliation will re-download the image
Check for background verification failures:
journalctl -u aos-servicemanager --since "24 hours ago" | grep -i "integrity\|verify.*blob"
Launch Failures
Launch failures occur after the image is successfully downloaded and verified, but the container cannot be started. These failures happen during rootfs assembly, OCI runtime config generation, or the actual container process startup.
Symptoms
- Instance transitions to Failed shortly after download completes (no extended Activating period)
- SM logs show errors in
PrepareRootFS,LoadConfigs, orStartInstance - No service instance logs exist (the container process never started)
Diagnostic Steps
1. Check Launcher logs for the specific failure:
journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "PrepareRootFS\|LoadConfigs\|StartInstance\|failed to start"
2. Check for missing layers or configuration:
journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "not found\|missing\|eNotFound"
If a layer path cannot be resolved (the Image Manager's GetLayerPath returns an error), the rootfs assembly fails.
3. Check for OverlayFS mount errors:
dmesg | grep -i "overlay"
journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "mount\|overlay"
OverlayFS mount failures can occur due to filesystem incompatibilities, missing kernel modules, or path length limits.
4. Check systemd unit creation:
systemctl status aos-service@<instance-id>.service
journalctl -u aos-service@<instance-id>.service --since "5 minutes ago"
If the systemd transient unit cannot be created, the container never starts.
5. Check resource availability:
# Check available memory
free -h
# Check available PIDs
cat /proc/sys/kernel/pid_max
ls /proc | grep -c '^[0-9]'
Root Causes and Resolutions
| Root Cause | Evidence | Resolution |
|---|---|---|
| Missing layer data | eNotFound when resolving layer path | Image storage may be corrupted; trigger re-download by issuing a new desired-state update |
| OverlayFS mount failure | Mount errors in dmesg or SM logs | Check kernel supports OverlayFS (modprobe overlay); verify filesystem supports extended attributes |
| Invalid image configuration | LoadConfigs error; malformed manifest or config | Image was built incorrectly; rebuild and re-publish the image |
| Insufficient system resources | OOM during container creation; PID limit reached | Free memory or increase resource limits; check for other services consuming excessive resources |
| Systemd unit creation failure | StartInstance error with systemd-related message | Check systemd health (systemctl --failed); verify D-Bus connection is functional |
| Invalid entrypoint | Container starts but exits immediately with code 127 | The configured entrypoint binary does not exist in the container image; fix the image build |
| Missing shared libraries | Container exits with code 127 or linker errors in instance logs | Required libraries not included in the image layers; rebuild image with correct dependencies |
| Permission denied | Container exits with code 126 | Entrypoint binary is not executable; fix file permissions in the image build |
Instance Crash Loops
Crash loops occur when a service instance starts successfully but repeatedly crashes, exhausting the systemd restart policy. After the configured burst limit is reached, the instance enters the permanent Failed state.
Symptoms
- Instance briefly reaches Active state then transitions to Failed
- Pattern repeats multiple times in quick succession (default: 3 times within 5 seconds)
- After burst limit is exhausted, instance remains in Failed state
- Service instance logs show application-level errors or crashes
Diagnostic Steps
1. Check service instance logs for crash details:
journalctl -u aos-service@<instance-id>.service --since "5 minutes ago"
This shows the stdout/stderr output from the service process, which typically contains the crash reason (unhandled exception, segfault, assertion failure, etc.).
2. Check the exit code:
journalctl -u aos-servicemanager --since "5 minutes ago" | grep "<instance-id>" | grep -i "exit\|failed\|state"
Common exit codes:
- 1 — generic application error
- 126 — permission denied (cannot execute)
- 127 — command not found (missing binary or library)
- 137 — killed by SIGKILL (OOM killer)
- 139 — segmentation fault (SIGSEGV)
3. Check resource consumption before crash:
journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "quota\|alert\|resource"
Look for InstanceQuotaAlert entries that indicate the service was approaching its resource limits before crashing.
4. Check restart policy parameters:
The restart behavior is controlled by the service's item configuration (delivered via desired state). Default values:
| Parameter | Default | Effect |
|---|---|---|
| Start Interval | 5 seconds | Time window for counting restart attempts |
| Start Burst | 3 | Maximum restarts allowed within the interval |
| Restart Interval | 1 second | Delay between crash and restart attempt |
5. Check if the crash is environment-dependent:
Compare the failing Node's environment with Nodes where the same service runs successfully:
- Available memory and CPU
- Mounted devices and hardware access
- Network configuration
- Environment variables injected by the runtime
Root Causes and Resolutions
| Root Cause | Evidence | Resolution |
|---|---|---|
| OOM kill | Exit code 137; dmesg shows OOM killer invocation | Increase the service's memory limit in the item configuration, or optimize the service's memory usage |
| Missing runtime dependency | Exit code 127; "not found" in instance logs | Add missing libraries or binaries to the service image |
| Configuration error | Application-specific error in instance logs | Fix the service configuration; check environment variables and mounted config files |
| Hardware access failure | Permission denied or device not found in instance logs | Verify the Resource Manager has granted the required device access; check device node existence |
| Network dependency unavailable | Connection refused or timeout in instance logs | Ensure required network services are reachable from the container's network namespace |
| Segmentation fault | Exit code 139; no useful output in logs | Debug the service binary; check for architecture mismatch (ARM vs x86) or corrupted binary |
| Resource limit too restrictive | Service works with higher limits; quota alerts before crash | Increase CPU quota, memory limit, or PID limit in the item configuration |
Recovery from Crash Loops
Once the restart burst limit is exhausted, the instance remains in the Failed state until a new UpdateInstances
command is received. Recovery options:
- Fix and redeploy — Fix the root cause in the service image or configuration, then publish a new desired state from the cloud
- Increase restart limits — If the crash is transient (e.g., a dependency that becomes available after a delay),
increase
startBurstorstartIntervalin the item configuration - Force restart via desired state — Issue an identical desired state from the cloud; CM will send a new
UpdateInstancesto SM, which resets the systemd unit's failed state and starts fresh
Deployment Failure Reporting
All deployment failures are reported to AosCloud through the standard status reporting chain:
- The SM Launcher detects the failure and sets the instance state to
eFailedwith anErrorcontaining the failure details - SM sends
UpdateInstancesStatusto CM via gRPC - CM includes the failed instance status in the next
unitStatusmessage to AosCloud
The ErrorInfo in the status report contains:
- Error code — classifies the failure type (e.g.,
eInvalidChecksum,eFailed,eNotFound) - Exit code — for container crashes, the process exit code
- Message — human-readable description of what went wrong
Operators monitoring AosCloud can use these error codes to quickly categorize failures and apply the appropriate resolution from this guide.
Related Pages
- Image Deployment Pipeline — the end-to-end deployment flow that these failures interrupt
- Service Failure Handling — how SM detects, reports, and recovers from service failures
- Service Instance States — the Activating and Failed states referenced in this guide
- Downloader — the HTTP download module with retry and resume logic
- Image Manager — blob storage, verification, and layer management
- Launcher — rootfs assembly and container lifecycle management
- Monitoring Pipeline — resource metrics and alert generation
- Troubleshooting Index — overview of the diagnostic approach and common error codes