Tutorial 2: Mastering the Manifest System¶

What You Will Learn¶

By the end of this tutorial you will:

Understand every section of a scalable.yaml manifest in depth.
Use environment variable expansion for portable, credential-free manifests.
Define multiple targets for local development, HPC, and cloud.
Configure components with images, mounts, environment variables, and tags.
Apply overlays to customize resources per deployment environment.
Validate manifests programmatically and interpret error codes.

Prerequisites¶

Completed Tutorial 1: Getting Started with Scalable.
Scalable installed (pip install scalable).
A text editor and terminal.

Scenario¶

You are building an energy modeling pipeline with two stages: a computationally expensive disaggregation step (Demeter) and a lighter post-processing step (NetCDF aggregation). The pipeline must run locally during development, on an HPC cluster for production, and eventually in the cloud. The manifest system lets you describe all three targets in a single file.

Step 1: Manifest Schema Overview¶

Every manifest has this top-level structure:

version: 1
project: { ... }
targets: { ... }
components: { ... }
tasks: { ... }
overlays: { ... }    # optional

The parser (scalable.manifest.parser) enforces:

version and project are required.
Unknown top-level keys are rejected (defense against typos).
Unknown keys inside a target block are passed through to the provider (forward compatibility for provider-specific options).
Unknown keys inside components are rejected (strict schema).

Step 2: The Project Block¶

project:
  name: demeter-lulcc
  default_storage: ./outputs
  local_cache: ./cache

name: Identifies the project in telemetry run IDs (e.g., run-20260520T...-demeter-lulcc-a1b2c3d4). Use lowercase with hyphens.
default_storage: Base URI for artifact output. Can be a local path, S3 URI (s3://bucket/prefix/), or GCS URI (gs://bucket/prefix/). Providers that support remote storage will use this as the destination for task outputs.
local_cache: Override for SCALABLE_CACHE_DIR. The manifest value takes precedence over the environment variable, which itself takes precedence over the compiled default (./cache).

Step 3: Defining Targets¶

Targets are named execution environments. You can define as many as you need:

targets:
  local:
    provider: local
    max_workers: 4
    threads_per_worker: 2
    processes: false
    containers: none

  hpc:
    provider: slurm
    queue: batch
    account: GCIMS
    walltime: "04:00:00"
    interface: ib0

  aws:
    provider: aws
    region: us-east-1
    cluster_type: fargate
    instance_type: m5.xlarge
    worker_cpu: 4096
    worker_mem: 16384
    image: 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
    adaptive:
      minimum: 1
      maximum: 10

Each target has one required key — provider — that maps to a registered provider class. All other keys are provider-specific options:

Provider	Key Options
`local`	`max_workers`, `threads_per_worker`, `processes`, `containers`
`slurm`	`queue`, `account`, `walltime`, `interface`
`aws`	`region`, `cluster_type`, `instance_type`, `worker_cpu`, `worker_mem`, `image`, `adaptive`, `subnets`, `security_groups`
`kubernetes`	`namespace`, `image`, `adaptive`, `overlay`

Why multiple targets? A single manifest can describe your entire promotion path: develop locally → validate on HPC → deploy to cloud. The --target flag (or SCALABLE_TARGET env var) selects which environment to activate.

Step 4: Components in Detail¶

Components are resource profiles for your workloads:

components:
  demeter:
    image: ghcr.io/jgcri/demeter:2.0.1
    runtime: apptainer
    cpus: 8
    memory: 32G
    mounts:
      ./demeter_data: /data
      /shared/outputs: /outputs
    env:
      DEMETER_DATA: /data
    tags: [lulcc, downscaling, gcam]
    preload_script: ./scripts/demeter_preload.sh

  postprocess:
    cpus: 2
    memory: 4G
    tags: [analysis]

Let’s break down each key:

image: Container image URI. Used by providers that support containerized workers (Slurm with Apptainer, Kubernetes, cloud). Omit for bare-metal local runs.
runtime: Container runtime hint (apptainer, docker). Providers use this to determine how to pull and launch the image.
cpus: CPU count allocated per worker in this component group. Maps to Dask worker resource annotations and scheduler affinity.
memory: Memory allocation string (e.g., 32G, 512M). Parsed by dask.utils.parse_bytes.
mounts: Volume mount mappings (host path → container path). Only meaningful for containerized providers.
env: Environment variables injected into the worker process. Useful for configuring model data paths, API keys (prefer ${VAR} references over literals), etc.
tags: Arbitrary labels for grouping and filtering. Tags propagate to telemetry events and can be used by the resource advisor for per-tag recommendations.
preload_script: Shell script executed before the Dask worker process starts. Useful for activating conda environments, loading modules, or mounting FUSE filesystems.

Step 5: Task Bindings¶

Tasks bind named work units to components:

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate_results:
    component: postprocess
    cache: true

component: Must reference a key in the components map. This determines which workers can execute the task and what resources are reserved.
cache: When true, results of functions submitted under this task are eligible for the cacheable() disk cache. Cache hits skip execution entirely on subsequent runs.
outputs: Declares expected output artifacts and their types (file or dir). The artifact store can persist these to remote storage when project.default_storage is configured.

Step 6: Environment Variable Expansion¶

Manifests support ${VAR} and ${VAR:-default} syntax for portability:

project:
  name: ${PROJECT_NAME:-energy-demo}
  default_storage: ${ARTIFACT_BUCKET:-./outputs}

targets:
  aws:
    provider: aws
    region: ${AWS_REGION:-us-east-1}
    execution_role_arn: ${EXECUTION_ROLE_ARN}

Expansion rules:

${VAR} — replaced by the value of the environment variable. If unset, the parser raises ManifestParseError.
${VAR:-default} — replaced by the variable if set, otherwise uses the literal default value.
Bare $HOME-style references are not expanded (to avoid ambiguity in mount paths). Always use curly braces.

This means you can commit scalable.yaml to version control without embedding secrets or machine-specific paths:

export AWS_REGION=us-west-2
export EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/myRole
scalable validate ./scalable.yaml

Step 7: Overlays for Environment-Specific Tuning¶

Overlays let you define named configuration deltas that are merged onto the base manifest when a target references them:

targets:
  hpc:
    provider: slurm
    queue: batch
    walltime: "04:00:00"
    overlay: hpc-large

components:
  demeter:
    cpus: 4
    memory: 16G

overlays:
  hpc-large:
    components:
      demeter:
        cpus: 16
        memory: 64G

  hpc-debug:
    components:
      demeter:
        cpus: 2
        memory: 8G

When target hpc is selected, the hpc-large overlay is merged: gcam.cpus becomes 16 and gcam.memory becomes 64G. The base values serve as defaults for targets that don’t reference an overlay.

Design rationale: Overlays avoid manifest duplication. Instead of maintaining separate YAML files per environment, you express deltas declaratively. The merge is shallow per-component-key (not deep recursive), keeping behavior predictable.

You can also override target options via overlays:

overlays:
  cloud-dev:
    targets:
      aws:
        adaptive:
          minimum: 1
          maximum: 3
    components:
      demeter:
        cpus: 4
        memory: 16G

Step 8: Multi-Target Workflow Selection¶

At runtime you select a target via:

CLI:

scalable run ./scalable.yaml --target hpc --workflow workflow.py

Python:

session = ScalableSession.from_yaml("./scalable.yaml", target="hpc")

Environment variable:

export SCALABLE_TARGET=hpc
python workflow.py   # Session auto-detects from env

The resolution order is: explicit target= argument → SCALABLE_TARGET env var → error (no implicit default target).

Step 9: Programmatic Validation¶

You can validate manifests from Python for CI/CD integration:

from scalable import ScalableSession

session = ScalableSession.from_yaml("./scalable.yaml", target="local")
report = session.validate()

if report.ok:
    print("Manifest is valid")
else:
    for issue in report.errors:
        print(f"ERROR [{issue.code}] {issue.path}: {issue.message}")
    for issue in report.warnings:
        print(f"WARN  [{issue.code}] {issue.path}: {issue.message}")

Common error codes:

Code	Meaning
`E_MISSING_KEY`	A required key (`version`, `project`) is absent.
`E_BAD_VERSION`	`version` is not a supported schema version.
`E_UNKNOWN_TOP_KEY`	Unrecognized top-level key (probable typo).
`E_UNKNOWN_COMPONENT_KEY`	Unrecognized key inside a component definition.
`E_TASK_COMPONENT_REF`	A task references a component that does not exist.
`E_UNKNOWN_PROVIDER`	The target’s provider is not installed or registered.
`E_BAD_MAX_WORKERS`	`max_workers` is not a positive integer.

Step 10: Complete Multi-Target Manifest¶

Here is a production-ready manifest combining all concepts:

version: 1
project:
  name: demeter-lulcc
  default_storage: ${ARTIFACT_STORAGE:-./outputs}

targets:
  local:
    provider: local
    max_workers: 4
    threads_per_worker: 2
    processes: false
    containers: none

  hpc:
    provider: slurm
    queue: batch
    account: ${SLURM_ACCOUNT}
    walltime: "08:00:00"
    interface: ib0
    overlay: hpc-prod

  aws:
    provider: aws
    region: ${AWS_REGION:-us-east-1}
    cluster_type: fargate
    worker_cpu: 4096
    worker_mem: 16384
    image: ${ECR_IMAGE}
    execution_role_arn: ${EXECUTION_ROLE_ARN}
    task_role_arn: ${TASK_ROLE_ARN}
    subnets: [${SUBNET_A}, ${SUBNET_B}]
    security_groups: [${SG_ID}]
    adaptive:
      minimum: 2
      maximum: 20

components:
  demeter:
    image: ghcr.io/jgcri/demeter:2.0.1
    cpus: 4
    memory: 16G
    tags: [lulcc, downscaling, gcam]
    env:
      DEMETER_DATA: /data

  postprocess:
    cpus: 2
    memory: 8G
    tags: [analysis]

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate:
    component: postprocess
    cache: true

overlays:
  hpc-prod:
    components:
      demeter:
        cpus: 16
        memory: 64G
      postprocess:
        cpus: 8
        memory: 32G

  hpc-debug:
    components:
      demeter:
        cpus: 2
        memory: 4G
      postprocess:
        cpus: 1
        memory: 2G

Troubleshooting¶

“ManifestParseError: unresolved variable ${VAR}”: You used ${VAR} without a default and the variable is not set in the environment. Either export it or use ${VAR:-fallback}.
“ManifestSchemaError: unknown component key ‘gpu’”: Only recognized component keys are allowed. GPU scheduling is expressed via the provider-specific target options, not component definitions.
Overlay changes not taking effect: Ensure the target block includes overlay: <name> and that the overlay name exactly matches a key under overlays:. Overlay merging only applies to the selected target.
“version: 2” rejected: Only schema version 1 is currently supported. The version field exists for future-proofing.

Next Steps¶

Tutorial 3: Scaling Strategies with Providers — Learn how different providers scale workers and how to choose between them.
Tutorial 4: Performance Optimization and Caching — Cache expensive computations to accelerate iterative development.
Tutorial 5: Cloud Integration with AWS and GCP — Configure AWS and GCP targets with real credentials and IAM roles.