Skip to main content
Version: v1.1

Service Deployment Failures

Introduction

This page provides practical troubleshooting guidance for service deployment failures — situations where a Deployable Item cannot be successfully downloaded, verified, unpacked, or launched on a Node. These failures manifest as instances stuck in the Activating state or transitioning directly to the Failed state after a desired-state update is received.

Service deployment failures are among the most common operational issues because they involve multiple components (Communication Manager, Service Manager, Downloader, Image Manager, Launcher) and depend on external factors like network connectivity, disk space, and image integrity. This guide walks through each failure category with specific diagnostic steps and resolution actions.

Quick Diagnosis

When a service deployment fails, start by identifying which stage of the Image Deployment Pipeline encountered the error:

SymptomLikely StageFirst Check
Instance stuck in Activating, no download progressBlob URL resolutionCM logs for GetBlobsInfo errors
Download progress starts then stopsDownload (Stage 2)SM logs for HTTP errors, network connectivity
Download completes but instance fails immediatelyVerification (Stage 3)SM logs for eInvalidChecksum errors
Instance fails after "unpacking" log entriesLayer unpacking (Stage 4)SM logs for filesystem errors, disk space
Instance fails after "PrepareRootFS" log entriesRootfs assembly (Stage 5)SM logs for mount errors, missing layers
Instance reaches Active then immediately failsContainer start (Stage 5)Service instance logs for process crash

Image Download Failures

Download failures occur when the Downloader module cannot retrieve image blobs from the provided URL. The Downloader uses libcurl with automatic retry (3 attempts, exponential backoff from 1s to 5s) and resume support via HTTP range requests.

Symptoms

  • Instance remains in Activating state for an extended period
  • unitStatus shows no progress on the affected instance
  • Download alerts in AosCloud show state eInterrupted

Diagnostic Steps

1. Check Service Manager logs for download errors:

journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "download\|failed\|error"

Look for messages like:

  • "Failed to download" with retry count — indicates repeated download failures
  • "HTTP error" with HTTP status code — server-side issue
  • "failed to open file" — local filesystem issue
  • "download cancelled" — superseded by a new desired state

2. Check network connectivity to the download server:

# Test connectivity to the cloud endpoint
curl -I <blob-url>

If the URL is not directly accessible, check whether the Node has internet connectivity and whether any proxy or firewall rules are blocking outbound HTTPS traffic.

3. Check disk space on the download partition:

df -h /var/aos/sm/

The Space Allocator reserves space before starting a download. If the partition is full, the allocation fails and the download never starts.

4. Check for concurrent download conflicts:

journalctl -u aos-servicemanager --since "10 minutes ago" | grep "already in progress"

If the same blob digest is already being downloaded (e.g., a shared layer between two services), the second request waits for the first. This is normal behavior but can appear as a stall if the first download is slow.

Root Causes and Resolutions

Root CauseEvidenceResolution
Network unreachablecurl connection timeout, CURLE_COULDNT_CONNECT in logsRestore network connectivity; check DNS resolution and firewall rules
HTTP 404 (blob not found)HTTP_CODE: 404 in SM logsBlob URL has expired or image was removed from cloud storage; trigger a new desired-state update
HTTP 403 (forbidden)HTTP_CODE: 403 in SM logsAuthentication token expired or access revoked; check cloud credentials
HTTP 5xx (server error)HTTP_CODE: 500/502/503 in SM logsCloud storage service issue; retry will occur automatically (3 attempts); wait or escalate to cloud provider
Disk space exhaustedSpace allocation failure in logs, df shows full partitionFree space by removing outdated images or expanding the partition; the Space Allocator automatically removes outdated items when under pressure
Connection timeoutCURLE_OPERATION_TIMEDOUT after 10 secondsSlow or unstable network; check bandwidth and latency to the download server
Download interrupted (partial)Download starts but fails mid-transferThe Downloader resumes automatically on retry if the server supports range requests; if all 3 retries fail, check for intermittent network issues

Download Retry Behavior

The Downloader retries failed downloads with exponential backoff:

AttemptDelay Before RetryBehavior
1st attemptInitial download attempt
2nd attempt1 secondRetry with resume (if server supports range requests)
3rd attempt2 secondsFinal retry attempt
All failedError propagated; instance marked as Failed

If the server supports Accept-Ranges: bytes, the Downloader resumes from where it left off. Otherwise, it restarts the download from the beginning on each retry.

Verification Failures

Verification failures occur when a downloaded blob's SHA-256 hash does not match the expected digest declared in the OCI manifest. This indicates data corruption during transfer or a tampered image.

Symptoms

  • Instance transitions to Failed immediately after download completes
  • SM logs show eInvalidChecksum error
  • The same image may have deployed successfully on other Nodes (ruling out a bad image at the source)

Diagnostic Steps

1. Check for checksum errors in SM logs:

journalctl -u aos-servicemanager --since "30 minutes ago" | grep -i "checksum\|InvalidChecksum\|digest"

Look for:

  • "eInvalidChecksum" — blob content does not match expected digest
  • "wrong diff digest" — unpacked layer content does not match the declared diff ID
  • "wrong layer checksum" — layer verification failed after unpacking

2. Identify which blob failed verification:

The log entry preceding the checksum error typically shows the digest being validated. Note the digest value — it identifies whether the failure is in the manifest, config, or a specific layer.

3. Check for disk corruption:

# Check filesystem health
dmesg | grep -i "error\|corrupt\|i/o"

Storage media errors can corrupt downloaded data between write and verification.

4. Check if the issue is reproducible:

If the same image fails verification repeatedly on the same Node but succeeds on others, suspect local storage issues. If it fails on all Nodes, suspect a corrupted image at the source.

Root Causes and Resolutions

Root CauseEvidenceResolution
Network corruptionIntermittent eInvalidChecksum; succeeds on retryTransient issue; a new desired-state update triggers re-download. Check for network equipment issues if recurring
Corrupted storage mediaRepeated failures on same Node; dmesg shows I/O errorsReplace storage media; check filesystem integrity with fsck
Corrupted image at sourceAll Nodes fail verification for the same imageRe-upload the image to the cloud; verify the image digest matches what was built
Incomplete download treated as completeChecksum mismatch on large blobsPossible issue with range request handling; the Downloader deletes the corrupted blob and the next deployment attempt re-downloads from scratch

Background Integrity Checks

The Image Manager also performs periodic integrity verification (every 24 hours) on stored blobs. If a previously-installed image fails this background check:

  • The corrupted item is removed from storage
  • Space is reclaimed
  • The instance continues running (it is already loaded in memory)
  • The next desired-state reconciliation will re-download the image

Check for background verification failures:

journalctl -u aos-servicemanager --since "24 hours ago" | grep -i "integrity\|verify.*blob"

Launch Failures

Launch failures occur after the image is successfully downloaded and verified, but the container cannot be started. These failures happen during rootfs assembly, OCI runtime config generation, or the actual container process startup.

Symptoms

  • Instance transitions to Failed shortly after download completes (no extended Activating period)
  • SM logs show errors in PrepareRootFS, LoadConfigs, or StartInstance
  • No service instance logs exist (the container process never started)

Diagnostic Steps

1. Check Launcher logs for the specific failure:

journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "PrepareRootFS\|LoadConfigs\|StartInstance\|failed to start"

2. Check for missing layers or configuration:

journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "not found\|missing\|eNotFound"

If a layer path cannot be resolved (the Image Manager's GetLayerPath returns an error), the rootfs assembly fails.

3. Check for OverlayFS mount errors:

dmesg | grep -i "overlay"
journalctl -u aos-servicemanager --since "5 minutes ago" | grep -i "mount\|overlay"

OverlayFS mount failures can occur due to filesystem incompatibilities, missing kernel modules, or path length limits.

4. Check systemd unit creation:

systemctl status aos-service@<instance-id>.service
journalctl -u aos-service@<instance-id>.service --since "5 minutes ago"

If the systemd transient unit cannot be created, the container never starts.

5. Check resource availability:

# Check available memory
free -h

# Check available PIDs
cat /proc/sys/kernel/pid_max
ls /proc | grep -c '^[0-9]'

Root Causes and Resolutions

Root CauseEvidenceResolution
Missing layer dataeNotFound when resolving layer pathImage storage may be corrupted; trigger re-download by issuing a new desired-state update
OverlayFS mount failureMount errors in dmesg or SM logsCheck kernel supports OverlayFS (modprobe overlay); verify filesystem supports extended attributes
Invalid image configurationLoadConfigs error; malformed manifest or configImage was built incorrectly; rebuild and re-publish the image
Insufficient system resourcesOOM during container creation; PID limit reachedFree memory or increase resource limits; check for other services consuming excessive resources
Systemd unit creation failureStartInstance error with systemd-related messageCheck systemd health (systemctl --failed); verify D-Bus connection is functional
Invalid entrypointContainer starts but exits immediately with code 127The configured entrypoint binary does not exist in the container image; fix the image build
Missing shared librariesContainer exits with code 127 or linker errors in instance logsRequired libraries not included in the image layers; rebuild image with correct dependencies
Permission deniedContainer exits with code 126Entrypoint binary is not executable; fix file permissions in the image build

Instance Crash Loops

Crash loops occur when a service instance starts successfully but repeatedly crashes, exhausting the systemd restart policy. After the configured burst limit is reached, the instance enters the permanent Failed state.

Symptoms

  • Instance briefly reaches Active state then transitions to Failed
  • Pattern repeats multiple times in quick succession (default: 3 times within 5 seconds)
  • After burst limit is exhausted, instance remains in Failed state
  • Service instance logs show application-level errors or crashes

Diagnostic Steps

1. Check service instance logs for crash details:

journalctl -u aos-service@<instance-id>.service --since "5 minutes ago"

This shows the stdout/stderr output from the service process, which typically contains the crash reason (unhandled exception, segfault, assertion failure, etc.).

2. Check the exit code:

journalctl -u aos-servicemanager --since "5 minutes ago" | grep "<instance-id>" | grep -i "exit\|failed\|state"

Common exit codes:

  • 1 — generic application error
  • 126 — permission denied (cannot execute)
  • 127 — command not found (missing binary or library)
  • 137 — killed by SIGKILL (OOM killer)
  • 139 — segmentation fault (SIGSEGV)

3. Check resource consumption before crash:

journalctl -u aos-servicemanager --since "10 minutes ago" | grep -i "quota\|alert\|resource"

Look for InstanceQuotaAlert entries that indicate the service was approaching its resource limits before crashing.

4. Check restart policy parameters:

The restart behavior is controlled by the service's item configuration (delivered via desired state). Default values:

ParameterDefaultEffect
Start Interval5 secondsTime window for counting restart attempts
Start Burst3Maximum restarts allowed within the interval
Restart Interval1 secondDelay between crash and restart attempt

5. Check if the crash is environment-dependent:

Compare the failing Node's environment with Nodes where the same service runs successfully:

  • Available memory and CPU
  • Mounted devices and hardware access
  • Network configuration
  • Environment variables injected by the runtime

Root Causes and Resolutions

Root CauseEvidenceResolution
OOM killExit code 137; dmesg shows OOM killer invocationIncrease the service's memory limit in the item configuration, or optimize the service's memory usage
Missing runtime dependencyExit code 127; "not found" in instance logsAdd missing libraries or binaries to the service image
Configuration errorApplication-specific error in instance logsFix the service configuration; check environment variables and mounted config files
Hardware access failurePermission denied or device not found in instance logsVerify the Resource Manager has granted the required device access; check device node existence
Network dependency unavailableConnection refused or timeout in instance logsEnsure required network services are reachable from the container's network namespace
Segmentation faultExit code 139; no useful output in logsDebug the service binary; check for architecture mismatch (ARM vs x86) or corrupted binary
Resource limit too restrictiveService works with higher limits; quota alerts before crashIncrease CPU quota, memory limit, or PID limit in the item configuration

Recovery from Crash Loops

Once the restart burst limit is exhausted, the instance remains in the Failed state until a new UpdateInstances command is received. Recovery options:

  1. Fix and redeploy — Fix the root cause in the service image or configuration, then publish a new desired state from the cloud
  2. Increase restart limits — If the crash is transient (e.g., a dependency that becomes available after a delay), increase startBurst or startInterval in the item configuration
  3. Force restart via desired state — Issue an identical desired state from the cloud; CM will send a new UpdateInstances to SM, which resets the systemd unit's failed state and starts fresh

Deployment Failure Reporting

All deployment failures are reported to AosCloud through the standard status reporting chain:

  1. The SM Launcher detects the failure and sets the instance state to eFailed with an Error containing the failure details
  2. SM sends UpdateInstancesStatus to CM via gRPC
  3. CM includes the failed instance status in the next unitStatus message to AosCloud

The ErrorInfo in the status report contains:

  • Error code — classifies the failure type (e.g., eInvalidChecksum, eFailed, eNotFound)
  • Exit code — for container crashes, the process exit code
  • Message — human-readable description of what went wrong

Operators monitoring AosCloud can use these error codes to quickly categorize failures and apply the appropriate resolution from this guide.