Tutorial 2: Mastering the Manifest System¶
What You Will Learn¶
By the end of this tutorial you will:
Understand every section of a
scalable.yamlmanifest in depth.Use environment variable expansion for portable, credential-free manifests.
Define multiple targets for local development, HPC, and cloud.
Configure components with images, mounts, environment variables, and tags.
Apply overlays to customize resources per deployment environment.
Validate manifests programmatically and interpret error codes.
Prerequisites¶
Completed Tutorial 1: Getting Started with Scalable.
Scalable installed (
pip install scalable).A text editor and terminal.
Scenario¶
You are building an energy modeling pipeline with two stages: a computationally expensive disaggregation step (Demeter) and a lighter post-processing step (NetCDF aggregation). The pipeline must run locally during development, on an HPC cluster for production, and eventually in the cloud. The manifest system lets you describe all three targets in a single file.
Step 1: Manifest Schema Overview¶
Every manifest has this top-level structure:
version: 1
project: { ... }
targets: { ... }
components: { ... }
tasks: { ... }
overlays: { ... } # optional
The parser (scalable.manifest.parser) enforces:
versionandprojectare required.Unknown top-level keys are rejected (defense against typos).
Unknown keys inside a target block are passed through to the provider (forward compatibility for provider-specific options).
Unknown keys inside
componentsare rejected (strict schema).
Step 2: The Project Block¶
project:
name: demeter-lulcc
default_storage: ./outputs
local_cache: ./cache
nameIdentifies the project in telemetry run IDs (e.g.,
run-20260520T...-demeter-lulcc-a1b2c3d4). Use lowercase with hyphens.default_storageBase URI for artifact output. Can be a local path, S3 URI (
s3://bucket/prefix/), or GCS URI (gs://bucket/prefix/). Providers that support remote storage will use this as the destination for task outputs.local_cacheOverride for
SCALABLE_CACHE_DIR. The manifest value takes precedence over the environment variable, which itself takes precedence over the compiled default (./cache).
Step 3: Defining Targets¶
Targets are named execution environments. You can define as many as you need:
targets:
local:
provider: local
max_workers: 4
threads_per_worker: 2
processes: false
containers: none
hpc:
provider: slurm
queue: batch
account: GCIMS
walltime: "04:00:00"
interface: ib0
aws:
provider: aws
region: us-east-1
cluster_type: fargate
instance_type: m5.xlarge
worker_cpu: 4096
worker_mem: 16384
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
adaptive:
minimum: 1
maximum: 10
Each target has one required key — provider — that maps to a registered
provider class. All other keys are provider-specific options:
Provider |
Key Options |
|---|---|
|
|
|
|
|
|
|
|
Why multiple targets? A single manifest can describe your entire promotion
path: develop locally → validate on HPC → deploy to cloud. The --target
flag (or SCALABLE_TARGET env var) selects which environment to activate.
Step 4: Components in Detail¶
Components are resource profiles for your workloads:
components:
demeter:
image: ghcr.io/jgcri/demeter:2.0.1
runtime: apptainer
cpus: 8
memory: 32G
mounts:
./demeter_data: /data
/shared/outputs: /outputs
env:
DEMETER_DATA: /data
tags: [lulcc, downscaling, gcam]
preload_script: ./scripts/demeter_preload.sh
postprocess:
cpus: 2
memory: 4G
tags: [analysis]
Let’s break down each key:
imageContainer image URI. Used by providers that support containerized workers (Slurm with Apptainer, Kubernetes, cloud). Omit for bare-metal local runs.
runtimeContainer runtime hint (
apptainer,docker). Providers use this to determine how to pull and launch the image.cpusCPU count allocated per worker in this component group. Maps to Dask worker resource annotations and scheduler affinity.
memoryMemory allocation string (e.g.,
32G,512M). Parsed bydask.utils.parse_bytes.mountsVolume mount mappings (host path → container path). Only meaningful for containerized providers.
envEnvironment variables injected into the worker process. Useful for configuring model data paths, API keys (prefer
${VAR}references over literals), etc.tagsArbitrary labels for grouping and filtering. Tags propagate to telemetry events and can be used by the resource advisor for per-tag recommendations.
preload_scriptShell script executed before the Dask worker process starts. Useful for activating conda environments, loading modules, or mounting FUSE filesystems.
Step 5: Task Bindings¶
Tasks bind named work units to components:
tasks:
run_demeter_scenario:
component: demeter
cache: true
outputs:
database: dir
aggregate_results:
component: postprocess
cache: true
componentMust reference a key in the
componentsmap. This determines which workers can execute the task and what resources are reserved.cacheWhen
true, results of functions submitted under this task are eligible for thecacheable()disk cache. Cache hits skip execution entirely on subsequent runs.outputsDeclares expected output artifacts and their types (
fileordir). The artifact store can persist these to remote storage whenproject.default_storageis configured.
Step 6: Environment Variable Expansion¶
Manifests support ${VAR} and ${VAR:-default} syntax for portability:
project:
name: ${PROJECT_NAME:-energy-demo}
default_storage: ${ARTIFACT_BUCKET:-./outputs}
targets:
aws:
provider: aws
region: ${AWS_REGION:-us-east-1}
execution_role_arn: ${EXECUTION_ROLE_ARN}
Expansion rules:
${VAR}— replaced by the value of the environment variable. If unset, the parser raisesManifestParseError.${VAR:-default}— replaced by the variable if set, otherwise uses the literal default value.Bare
$HOME-style references are not expanded (to avoid ambiguity in mount paths). Always use curly braces.
This means you can commit scalable.yaml to version control without
embedding secrets or machine-specific paths:
export AWS_REGION=us-west-2
export EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/myRole
scalable validate ./scalable.yaml
Step 7: Overlays for Environment-Specific Tuning¶
Overlays let you define named configuration deltas that are merged onto the base manifest when a target references them:
targets:
hpc:
provider: slurm
queue: batch
walltime: "04:00:00"
overlay: hpc-large
components:
demeter:
cpus: 4
memory: 16G
overlays:
hpc-large:
components:
demeter:
cpus: 16
memory: 64G
hpc-debug:
components:
demeter:
cpus: 2
memory: 8G
When target hpc is selected, the hpc-large overlay is merged:
gcam.cpus becomes 16 and gcam.memory becomes 64G. The base values
serve as defaults for targets that don’t reference an overlay.
Design rationale: Overlays avoid manifest duplication. Instead of maintaining separate YAML files per environment, you express deltas declaratively. The merge is shallow per-component-key (not deep recursive), keeping behavior predictable.
You can also override target options via overlays:
overlays:
cloud-dev:
targets:
aws:
adaptive:
minimum: 1
maximum: 3
components:
demeter:
cpus: 4
memory: 16G
Step 8: Multi-Target Workflow Selection¶
At runtime you select a target via:
CLI:
scalable run ./scalable.yaml --target hpc --workflow workflow.py
Python:
session = ScalableSession.from_yaml("./scalable.yaml", target="hpc")
Environment variable:
export SCALABLE_TARGET=hpc
python workflow.py # Session auto-detects from env
The resolution order is: explicit target= argument → SCALABLE_TARGET
env var → error (no implicit default target).
Step 9: Programmatic Validation¶
You can validate manifests from Python for CI/CD integration:
from scalable import ScalableSession
session = ScalableSession.from_yaml("./scalable.yaml", target="local")
report = session.validate()
if report.ok:
print("Manifest is valid")
else:
for issue in report.errors:
print(f"ERROR [{issue.code}] {issue.path}: {issue.message}")
for issue in report.warnings:
print(f"WARN [{issue.code}] {issue.path}: {issue.message}")
Common error codes:
Code |
Meaning |
|---|---|
|
A required key ( |
|
|
|
Unrecognized top-level key (probable typo). |
|
Unrecognized key inside a component definition. |
|
A task references a component that does not exist. |
|
The target’s provider is not installed or registered. |
|
|
Step 10: Complete Multi-Target Manifest¶
Here is a production-ready manifest combining all concepts:
version: 1
project:
name: demeter-lulcc
default_storage: ${ARTIFACT_STORAGE:-./outputs}
targets:
local:
provider: local
max_workers: 4
threads_per_worker: 2
processes: false
containers: none
hpc:
provider: slurm
queue: batch
account: ${SLURM_ACCOUNT}
walltime: "08:00:00"
interface: ib0
overlay: hpc-prod
aws:
provider: aws
region: ${AWS_REGION:-us-east-1}
cluster_type: fargate
worker_cpu: 4096
worker_mem: 16384
image: ${ECR_IMAGE}
execution_role_arn: ${EXECUTION_ROLE_ARN}
task_role_arn: ${TASK_ROLE_ARN}
subnets: [${SUBNET_A}, ${SUBNET_B}]
security_groups: [${SG_ID}]
adaptive:
minimum: 2
maximum: 20
components:
demeter:
image: ghcr.io/jgcri/demeter:2.0.1
cpus: 4
memory: 16G
tags: [lulcc, downscaling, gcam]
env:
DEMETER_DATA: /data
postprocess:
cpus: 2
memory: 8G
tags: [analysis]
tasks:
run_demeter_scenario:
component: demeter
cache: true
outputs:
database: dir
aggregate:
component: postprocess
cache: true
overlays:
hpc-prod:
components:
demeter:
cpus: 16
memory: 64G
postprocess:
cpus: 8
memory: 32G
hpc-debug:
components:
demeter:
cpus: 2
memory: 4G
postprocess:
cpus: 1
memory: 2G
Troubleshooting¶
- “ManifestParseError: unresolved variable ${VAR}”
You used
${VAR}without a default and the variable is not set in the environment. Either export it or use${VAR:-fallback}.- “ManifestSchemaError: unknown component key ‘gpu’”
Only recognized component keys are allowed. GPU scheduling is expressed via the provider-specific target options, not component definitions.
- Overlay changes not taking effect
Ensure the target block includes
overlay: <name>and that the overlay name exactly matches a key underoverlays:. Overlay merging only applies to the selected target.- “version: 2” rejected
Only schema version
1is currently supported. Theversionfield exists for future-proofing.
Next Steps¶
Tutorial 3: Scaling Strategies with Providers — Learn how different providers scale workers and how to choose between them.
Tutorial 4: Performance Optimization and Caching — Cache expensive computations to accelerate iterative development.
Tutorial 5: Cloud Integration with AWS and GCP — Configure AWS and GCP targets with real credentials and IAM roles.