Version: v1.1

Service Deployment Failures

Introduction

This page provides practical troubleshooting guidance for service deployment failures — situations where a Deployable Item cannot be successfully downloaded, verified, unpacked, or launched on a Node. These failures manifest as instances stuck in the Activating state or transitioning directly to the Failed state after a desired-state update is received.

Service deployment failures are among the most common operational issues because they involve multiple components (Communication Manager, Service Manager, Downloader, Image Manager, Launcher) and depend on external factors like network connectivity, disk space, and image integrity. This guide walks through each failure category with specific diagnostic steps and resolution actions.

Quick Diagnosis

When a service deployment fails, start by identifying which stage of the Image Deployment Pipeline encountered the error:

Symptom	Likely Stage	First Check
Instance stuck in Activating, no download progress	Blob URL resolution	CM logs for `GetBlobsInfo` errors
Download progress starts then stops	Download (Stage 2)	SM logs for HTTP errors, network connectivity
Download completes but instance fails immediately	Verification (Stage 3)	SM logs for `eInvalidChecksum` errors
Instance fails after "unpacking" log entries	Layer unpacking (Stage 4)	SM logs for filesystem errors, disk space
Instance fails after "PrepareRootFS" log entries	Rootfs assembly (Stage 5)	SM logs for mount errors, missing layers
Instance reaches Active then immediately fails	Container start (Stage 5)	Service instance logs for process crash

Image Download Failures

Download failures occur when the Downloader module cannot retrieve image blobs from the provided URL. The Downloader uses libcurl with automatic retry (3 attempts, exponential backoff from 1s to 5s) and resume support via HTTP range requests.

Symptoms

Instance remains in Activating state for an extended period
unitStatus shows no progress on the affected instance
Download alerts in AosCloud show state eInterrupted

Diagnostic Steps

1. Check Service Manager logs for download errors:

journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "download\|failed\|error"

Look for messages like:

"Failed to download" with retry count — indicates repeated download failures
"HTTP error" with HTTP status code — server-side issue
"failed to open file" — local filesystem issue
"download cancelled" — superseded by a new desired state

2. Check network connectivity to the download server:

# Test connectivity to the cloud endpoint
curl -I <blob-url>

If the URL is not directly accessible, check whether the Node has internet connectivity and whether any proxy or firewall rules are blocking outbound HTTPS traffic.

3. Check disk space on the download partition:

df -h /var/aos/sm/

The Space Allocator reserves space before starting a download. If the partition is full, the allocation fails and the download never starts.

4. Check for concurrent download conflicts:

journalctl -u aos-servicemanager --since "10 minutes ago" | grep "already in progress"

If the same blob digest is already being downloaded (e.g., a shared layer between two services), the second request waits for the first. This is normal behavior but can appear as a stall if the first download is slow.

Root Causes and Resolutions

Root Cause	Evidence	Resolution
Network unreachable	curl connection timeout, `CURLE_COULDNT_CONNECT` in logs	Restore network connectivity; check DNS resolution and firewall rules
HTTP 404 (blob not found)	`HTTP_CODE: 404` in SM logs	Blob URL has expired or image was removed from cloud storage; trigger a new desired-state update
HTTP 403 (forbidden)	`HTTP_CODE: 403` in SM logs	Authentication token expired or access revoked; check cloud credentials
HTTP 5xx (server error)	`HTTP_CODE: 500/502/503` in SM logs	Cloud storage service issue; retry will occur automatically (3 attempts); wait or escalate to cloud provider
Disk space exhausted	Space allocation failure in logs, `df` shows full partition	Free space by removing outdated images or expanding the partition; the Space Allocator automatically removes outdated items when under pressure
Connection timeout	`CURLE_OPERATION_TIMEDOUT` after 10 seconds	Slow or unstable network; check bandwidth and latency to the download server
Download interrupted (partial)	Download starts but fails mid-transfer	The Downloader resumes automatically on retry if the server supports range requests; if all 3 retries fail, check for intermittent network issues

Download Retry Behavior

The Downloader retries failed downloads with exponential backoff:

Attempt	Delay Before Retry	Behavior
1st attempt	—	Initial download attempt
2nd attempt	1 second	Retry with resume (if server supports range requests)
3rd attempt	2 seconds	Final retry attempt
All failed	—	Error propagated; instance marked as Failed

If the server supports Accept-Ranges: bytes, the Downloader resumes from where it left off. Otherwise, it restarts the download from the beginning on each retry.

Verification Failures

Verification failures occur when a downloaded blob's SHA-256 hash does not match the expected digest declared in the OCI manifest. This indicates data corruption during transfer or a tampered image.

Symptoms

Instance transitions to Failed immediately after download completes
SM logs show eInvalidChecksum error
The same image may have deployed successfully on other Nodes (ruling out a bad image at the source)

Diagnostic Steps

1. Check for checksum errors in SM logs:

journalctl -u aos-servicemanager --since "30 minutes ago" | grep -i "checksum\|InvalidChecksum\|digest"

Look for:

"eInvalidChecksum" — blob content does not match expected digest
"wrong diff digest" — unpacked layer content does not match the declared diff ID
"wrong layer checksum" — layer verification failed after unpacking

2. Identify which blob failed verification:

The log entry preceding the checksum error typically shows the digest being validated. Note the digest value — it identifies whether the failure is in the manifest, config, or a specific layer.

3. Check for disk corruption:

# Check filesystem health
dmesg | grep -i "error\|corrupt\|i/o"

Storage media errors can corrupt downloaded data between write and verification.

4. Check if the issue is reproducible:

If the same image fails verification repeatedly on the same Node but succeeds on others, suspect local storage issues. If it fails on all Nodes, suspect a corrupted image at the source.

Root Causes and Resolutions

Root Cause	Evidence	Resolution
Network corruption	Intermittent `eInvalidChecksum`; succeeds on retry	Transient issue; a new desired-state update triggers re-download. Check for network equipment issues if recurring
Corrupted storage media	Repeated failures on same Node; `dmesg` shows I/O errors	Replace storage media; check filesystem integrity with `fsck`
Corrupted image at source	All Nodes fail verification for the same image	Re-upload the image to the cloud; verify the image digest matches what was built
Incomplete download treated as complete	Checksum mismatch on large blobs	Possible issue with range request handling; the Downloader deletes the corrupted blob and the next deployment attempt re-downloads from scratch

Background Integrity Checks

The Image Manager also performs periodic integrity verification (every 24 hours) on stored blobs. If a previously-installed image fails this background check:

The corrupted item is removed from storage
Space is reclaimed
The instance continues running (it is already loaded in memory)
The next desired-state reconciliation will re-download the image

Check for background verification failures:

journalctl -u aos-servicemanager --since "24 hours ago" | grep -i "integrity\|verify.*blob"

Launch Failures

Launch failures occur after the image is successfully downloaded and verified, but the container cannot be started. These failures happen during rootfs assembly, OCI runtime config generation, or the actual container process startup.

Symptoms

Instance transitions to Failed shortly after download completes (no extended Activating period)
SM logs show errors in PrepareRootFS, LoadConfigs, or StartInstance
No service instance logs exist (the container process never started)

Diagnostic Steps

1. Check Launcher logs for the specific failure:

journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "PrepareRootFS\|LoadConfigs\|StartInstance\|failed to start"

2. Check for missing layers or configuration:

journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "not found\|missing\|eNotFound"

If a layer path cannot be resolved (the Image Manager's GetLayerPath returns an error), the rootfs assembly fails.

3. Check for OverlayFS mount errors:

dmesg | grep -i "overlay"
journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "mount\|overlay"

OverlayFS mount failures can occur due to filesystem incompatibilities, missing kernel modules, or path length limits.

4. Check systemd unit creation:

systemctl status aos-service@<instance-id>.service
journalctl -u aos-service@<instance-id>.service --since "5 minutes ago"

If the systemd transient unit cannot be created, the container never starts.

5. Check resource availability:

# Check available memory
free -h

# Check available PIDs
cat /proc/sys/kernel/pid_max
ls /proc | grep -c '^[0-9]'

Root Causes and Resolutions

Root Cause	Evidence	Resolution
Missing layer data	`eNotFound` when resolving layer path	Image storage may be corrupted; trigger re-download by issuing a new desired-state update
OverlayFS mount failure	Mount errors in `dmesg` or SM logs	Check kernel supports OverlayFS (`modprobe overlay`); verify filesystem supports extended attributes
Invalid image configuration	`LoadConfigs` error; malformed manifest or config	Image was built incorrectly; rebuild and re-publish the image
Insufficient system resources	OOM during container creation; PID limit reached	Free memory or increase resource limits; check for other services consuming excessive resources
Systemd unit creation failure	`StartInstance` error with systemd-related message	Check systemd health (`systemctl --failed`); verify D-Bus connection is functional
Invalid entrypoint	Container starts but exits immediately with code 127	The configured entrypoint binary does not exist in the container image; fix the image build
Missing shared libraries	Container exits with code 127 or linker errors in instance logs	Required libraries not included in the image layers; rebuild image with correct dependencies
Permission denied	Container exits with code 126	Entrypoint binary is not executable; fix file permissions in the image build

Instance Crash Loops

Crash loops occur when a service instance starts successfully but repeatedly crashes, exhausting the systemd restart policy. After the configured burst limit is reached, the instance enters the permanent Failed state.

Symptoms

Instance briefly reaches Active state then transitions to Failed
Pattern repeats multiple times in quick succession (default: 3 times within 5 seconds)
After burst limit is exhausted, instance remains in Failed state
Service instance logs show application-level errors or crashes

Diagnostic Steps

1. Check service instance logs for crash details:

journalctl -u aos-service@<instance-id>.service --since "5 minutes ago"

This shows the stdout/stderr output from the service process, which typically contains the crash reason (unhandled exception, segfault, assertion failure, etc.).

2. Check the exit code:

journalctl -u aos-servicemanager --since "5 minutes ago" | grep "<instance-id>" | grep -i "exit\|failed\|state"

Common exit codes:

1 — generic application error
126 — permission denied (cannot execute)
127 — command not found (missing binary or library)
137 — killed by SIGKILL (OOM killer)
139 — segmentation fault (SIGSEGV)

3. Check resource consumption before crash:

journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "quota\|alert\|resource"

Look for InstanceQuotaAlert entries that indicate the service was approaching its resource limits before crashing.

4. Check restart policy parameters:

The restart behavior is controlled by the service's item configuration (delivered via desired state). Default values:

Parameter	Default	Effect
Start Interval	5 seconds	Time window for counting restart attempts
Start Burst	3	Maximum restarts allowed within the interval
Restart Interval	1 second	Delay between crash and restart attempt

5. Check if the crash is environment-dependent:

Compare the failing Node's environment with Nodes where the same service runs successfully:

Available memory and CPU
Mounted devices and hardware access
Network configuration
Environment variables injected by the runtime

Root Causes and Resolutions

Root Cause	Evidence	Resolution
OOM kill	Exit code 137; `dmesg` shows OOM killer invocation	Increase the service's memory limit in the item configuration, or optimize the service's memory usage
Missing runtime dependency	Exit code 127; "not found" in instance logs	Add missing libraries or binaries to the service image
Configuration error	Application-specific error in instance logs	Fix the service configuration; check environment variables and mounted config files
Hardware access failure	Permission denied or device not found in instance logs	Verify the Resource Manager has granted the required device access; check device node existence
Network dependency unavailable	Connection refused or timeout in instance logs	Ensure required network services are reachable from the container's network namespace
Segmentation fault	Exit code 139; no useful output in logs	Debug the service binary; check for architecture mismatch (ARM vs x86) or corrupted binary
Resource limit too restrictive	Service works with higher limits; quota alerts before crash	Increase CPU quota, memory limit, or PID limit in the item configuration

Recovery from Crash Loops

Once the restart burst limit is exhausted, the instance remains in the Failed state until a new UpdateInstances command is received. Recovery options:

Fix and redeploy — Fix the root cause in the service image or configuration, then publish a new desired state from the cloud
Increase restart limits — If the crash is transient (e.g., a dependency that becomes available after a delay), increase startBurst or startInterval in the item configuration
Force restart via desired state — Issue an identical desired state from the cloud; CM will send a new UpdateInstances to SM, which resets the systemd unit's failed state and starts fresh

Deployment Failure Reporting

All deployment failures are reported to AosCloud through the standard status reporting chain:

The SM Launcher detects the failure and sets the instance state to eFailed with an Error containing the failure details
SM sends UpdateInstancesStatus to CM via gRPC
CM includes the failed instance status in the next unitStatus message to AosCloud

The ErrorInfo in the status report contains:

Error code — classifies the failure type (e.g., eInvalidChecksum, eFailed, eNotFound)
Exit code — for container crashes, the process exit code
Message — human-readable description of what went wrong

Operators monitoring AosCloud can use these error codes to quickly categorize failures and apply the appropriate resolution from this guide.

Image Deployment Pipeline — the end-to-end deployment flow that these failures interrupt
Service Failure Handling — how SM detects, reports, and recovers from service failures
Service Instance States — the Activating and Failed states referenced in this guide
Downloader — the HTTP download module with retry and resume logic
Image Manager — blob storage, verification, and layer management
Launcher — rootfs assembly and container lifecycle management
Monitoring Pipeline — resource metrics and alert generation
Troubleshooting Index — overview of the diagnostic approach and common error codes

Introduction​

Quick Diagnosis​

Image Download Failures​

Symptoms​

Diagnostic Steps​

Root Causes and Resolutions​

Download Retry Behavior​

Verification Failures​

Symptoms​

Diagnostic Steps​

Root Causes and Resolutions​

Background Integrity Checks​

Launch Failures​

Symptoms​

Diagnostic Steps​

Root Causes and Resolutions​

Instance Crash Loops​

Symptoms​

Diagnostic Steps​

Root Causes and Resolutions​

Recovery from Crash Loops​

Deployment Failure Reporting​

Related Pages​

Introduction

Quick Diagnosis

Image Download Failures

Symptoms

Diagnostic Steps

Root Causes and Resolutions

Download Retry Behavior

Verification Failures

Symptoms

Diagnostic Steps

Root Causes and Resolutions

Background Integrity Checks

Launch Failures

Symptoms

Diagnostic Steps

Root Causes and Resolutions

Instance Crash Loops

Symptoms

Diagnostic Steps

Root Causes and Resolutions

Recovery from Crash Loops

Deployment Failure Reporting

Related Pages