Tutorial 2: Mastering the Manifest System

What You Will Learn

By the end of this tutorial you will:

  • Understand every section of a scalable.yaml manifest in depth.

  • Use environment variable expansion for portable, credential-free manifests.

  • Define multiple targets for local development, HPC, and cloud.

  • Configure components with images, mounts, environment variables, and tags.

  • Apply overlays to customize resources per deployment environment.

  • Validate manifests programmatically and interpret error codes.

Prerequisites

Scenario

You are building an energy modeling pipeline with two stages: a computationally expensive disaggregation step (Demeter) and a lighter post-processing step (NetCDF aggregation). The pipeline must run locally during development, on an HPC cluster for production, and eventually in the cloud. The manifest system lets you describe all three targets in a single file.

Step 1: Manifest Schema Overview

Every manifest has this top-level structure:

version: 1
project: { ... }
targets: { ... }
components: { ... }
tasks: { ... }
overlays: { ... }    # optional

The parser (scalable.manifest.parser) enforces:

  • version and project are required.

  • Unknown top-level keys are rejected (defense against typos).

  • Unknown keys inside a target block are passed through to the provider (forward compatibility for provider-specific options).

  • Unknown keys inside components are rejected (strict schema).

Step 2: The Project Block

project:
  name: demeter-lulcc
  default_storage: ./outputs
  local_cache: ./cache
name

Identifies the project in telemetry run IDs (e.g., run-20260520T...-demeter-lulcc-a1b2c3d4). Use lowercase with hyphens.

default_storage

Base URI for artifact output. Can be a local path, S3 URI (s3://bucket/prefix/), or GCS URI (gs://bucket/prefix/). Providers that support remote storage will use this as the destination for task outputs.

local_cache

Override for SCALABLE_CACHE_DIR. The manifest value takes precedence over the environment variable, which itself takes precedence over the compiled default (./cache).

Step 3: Defining Targets

Targets are named execution environments. You can define as many as you need:

targets:
  local:
    provider: local
    max_workers: 4
    threads_per_worker: 2
    processes: false
    containers: none

  hpc:
    provider: slurm
    queue: batch
    account: GCIMS
    walltime: "04:00:00"
    interface: ib0

  aws:
    provider: aws
    region: us-east-1
    cluster_type: fargate
    instance_type: m5.xlarge
    worker_cpu: 4096
    worker_mem: 16384
    image: 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
    adaptive:
      minimum: 1
      maximum: 10

Each target has one required key — provider — that maps to a registered provider class. All other keys are provider-specific options:

Provider

Key Options

local

max_workers, threads_per_worker, processes, containers

slurm

queue, account, walltime, interface

aws

region, cluster_type, instance_type, worker_cpu, worker_mem, image, adaptive, subnets, security_groups

kubernetes

namespace, image, adaptive, overlay

Why multiple targets? A single manifest can describe your entire promotion path: develop locally → validate on HPC → deploy to cloud. The --target flag (or SCALABLE_TARGET env var) selects which environment to activate.

Step 4: Components in Detail

Components are resource profiles for your workloads:

components:
  demeter:
    image: ghcr.io/jgcri/demeter:2.0.1
    runtime: apptainer
    cpus: 8
    memory: 32G
    mounts:
      ./demeter_data: /data
      /shared/outputs: /outputs
    env:
      DEMETER_DATA: /data
    tags: [lulcc, downscaling, gcam]
    preload_script: ./scripts/demeter_preload.sh

  postprocess:
    cpus: 2
    memory: 4G
    tags: [analysis]

Let’s break down each key:

image

Container image URI. Used by providers that support containerized workers (Slurm with Apptainer, Kubernetes, cloud). Omit for bare-metal local runs.

runtime

Container runtime hint (apptainer, docker). Providers use this to determine how to pull and launch the image.

cpus

CPU count allocated per worker in this component group. Maps to Dask worker resource annotations and scheduler affinity.

memory

Memory allocation string (e.g., 32G, 512M). Parsed by dask.utils.parse_bytes.

mounts

Volume mount mappings (host path → container path). Only meaningful for containerized providers.

env

Environment variables injected into the worker process. Useful for configuring model data paths, API keys (prefer ${VAR} references over literals), etc.

tags

Arbitrary labels for grouping and filtering. Tags propagate to telemetry events and can be used by the resource advisor for per-tag recommendations.

preload_script

Shell script executed before the Dask worker process starts. Useful for activating conda environments, loading modules, or mounting FUSE filesystems.

Step 5: Task Bindings

Tasks bind named work units to components:

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate_results:
    component: postprocess
    cache: true
component

Must reference a key in the components map. This determines which workers can execute the task and what resources are reserved.

cache

When true, results of functions submitted under this task are eligible for the cacheable() disk cache. Cache hits skip execution entirely on subsequent runs.

outputs

Declares expected output artifacts and their types (file or dir). The artifact store can persist these to remote storage when project.default_storage is configured.

Step 6: Environment Variable Expansion

Manifests support ${VAR} and ${VAR:-default} syntax for portability:

project:
  name: ${PROJECT_NAME:-energy-demo}
  default_storage: ${ARTIFACT_BUCKET:-./outputs}

targets:
  aws:
    provider: aws
    region: ${AWS_REGION:-us-east-1}
    execution_role_arn: ${EXECUTION_ROLE_ARN}

Expansion rules:

  • ${VAR} — replaced by the value of the environment variable. If unset, the parser raises ManifestParseError.

  • ${VAR:-default} — replaced by the variable if set, otherwise uses the literal default value.

  • Bare $HOME-style references are not expanded (to avoid ambiguity in mount paths). Always use curly braces.

This means you can commit scalable.yaml to version control without embedding secrets or machine-specific paths:

export AWS_REGION=us-west-2
export EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/myRole
scalable validate ./scalable.yaml

Step 7: Overlays for Environment-Specific Tuning

Overlays let you define named configuration deltas that are merged onto the base manifest when a target references them:

targets:
  hpc:
    provider: slurm
    queue: batch
    walltime: "04:00:00"
    overlay: hpc-large

components:
  demeter:
    cpus: 4
    memory: 16G

overlays:
  hpc-large:
    components:
      demeter:
        cpus: 16
        memory: 64G

  hpc-debug:
    components:
      demeter:
        cpus: 2
        memory: 8G

When target hpc is selected, the hpc-large overlay is merged: gcam.cpus becomes 16 and gcam.memory becomes 64G. The base values serve as defaults for targets that don’t reference an overlay.

Design rationale: Overlays avoid manifest duplication. Instead of maintaining separate YAML files per environment, you express deltas declaratively. The merge is shallow per-component-key (not deep recursive), keeping behavior predictable.

You can also override target options via overlays:

overlays:
  cloud-dev:
    targets:
      aws:
        adaptive:
          minimum: 1
          maximum: 3
    components:
      demeter:
        cpus: 4
        memory: 16G

Step 8: Multi-Target Workflow Selection

At runtime you select a target via:

CLI:

scalable run ./scalable.yaml --target hpc --workflow workflow.py

Python:

session = ScalableSession.from_yaml("./scalable.yaml", target="hpc")

Environment variable:

export SCALABLE_TARGET=hpc
python workflow.py   # Session auto-detects from env

The resolution order is: explicit target= argument → SCALABLE_TARGET env var → error (no implicit default target).

Step 9: Programmatic Validation

You can validate manifests from Python for CI/CD integration:

from scalable import ScalableSession

session = ScalableSession.from_yaml("./scalable.yaml", target="local")
report = session.validate()

if report.ok:
    print("Manifest is valid")
else:
    for issue in report.errors:
        print(f"ERROR [{issue.code}] {issue.path}: {issue.message}")
    for issue in report.warnings:
        print(f"WARN  [{issue.code}] {issue.path}: {issue.message}")

Common error codes:

Code

Meaning

E_MISSING_KEY

A required key (version, project) is absent.

E_BAD_VERSION

version is not a supported schema version.

E_UNKNOWN_TOP_KEY

Unrecognized top-level key (probable typo).

E_UNKNOWN_COMPONENT_KEY

Unrecognized key inside a component definition.

E_TASK_COMPONENT_REF

A task references a component that does not exist.

E_UNKNOWN_PROVIDER

The target’s provider is not installed or registered.

E_BAD_MAX_WORKERS

max_workers is not a positive integer.

Step 10: Complete Multi-Target Manifest

Here is a production-ready manifest combining all concepts:

version: 1
project:
  name: demeter-lulcc
  default_storage: ${ARTIFACT_STORAGE:-./outputs}

targets:
  local:
    provider: local
    max_workers: 4
    threads_per_worker: 2
    processes: false
    containers: none

  hpc:
    provider: slurm
    queue: batch
    account: ${SLURM_ACCOUNT}
    walltime: "08:00:00"
    interface: ib0
    overlay: hpc-prod

  aws:
    provider: aws
    region: ${AWS_REGION:-us-east-1}
    cluster_type: fargate
    worker_cpu: 4096
    worker_mem: 16384
    image: ${ECR_IMAGE}
    execution_role_arn: ${EXECUTION_ROLE_ARN}
    task_role_arn: ${TASK_ROLE_ARN}
    subnets: [${SUBNET_A}, ${SUBNET_B}]
    security_groups: [${SG_ID}]
    adaptive:
      minimum: 2
      maximum: 20

components:
  demeter:
    image: ghcr.io/jgcri/demeter:2.0.1
    cpus: 4
    memory: 16G
    tags: [lulcc, downscaling, gcam]
    env:
      DEMETER_DATA: /data

  postprocess:
    cpus: 2
    memory: 8G
    tags: [analysis]

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate:
    component: postprocess
    cache: true

overlays:
  hpc-prod:
    components:
      demeter:
        cpus: 16
        memory: 64G
      postprocess:
        cpus: 8
        memory: 32G

  hpc-debug:
    components:
      demeter:
        cpus: 2
        memory: 4G
      postprocess:
        cpus: 1
        memory: 2G

Troubleshooting

“ManifestParseError: unresolved variable ${VAR}”

You used ${VAR} without a default and the variable is not set in the environment. Either export it or use ${VAR:-fallback}.

“ManifestSchemaError: unknown component key ‘gpu’”

Only recognized component keys are allowed. GPU scheduling is expressed via the provider-specific target options, not component definitions.

Overlay changes not taking effect

Ensure the target block includes overlay: <name> and that the overlay name exactly matches a key under overlays:. Overlay merging only applies to the selected target.

“version: 2” rejected

Only schema version 1 is currently supported. The version field exists for future-proofing.

Next Steps