Beginner Tutorial 2: Understanding the Manifest System¶
The Big Picture¶
In the previous tutorial, you wrote a simple scalable.yaml file. But what
is a manifest, really? Why does Scalable use one? And what’s this
“declarative programming” idea all about?
This tutorial takes you deep into the manifest system — not just the syntax, but the philosophy behind it. You’ll understand why configuration-as-code exists, how YAML works, what schemas enforce, and how overlays let you customize behavior across different environments.
💡 Key Concept: Configuration as Code
Configuration as code means storing your system’s settings in version- controlled text files rather than clicking through GUIs or typing ad-hoc commands.
Benefits:
Reproducibility — anyone can recreate your exact setup
History — Git shows who changed what and when
Review — teammates can review config changes like code changes
Automation — CI/CD pipelines can validate and deploy configs
Scalable’s manifest is configuration as code: your entire workflow setup lives in a single YAML file that you check into version control.
What You Will Learn¶
By the end of this tutorial you will:
Understand declarative programming deeply and why it matters.
Read and write YAML confidently (indentation, data types, references).
Know every section of a
scalable.yamlmanifest and its purpose.Use environment variables in manifests for portability.
Define multiple targets for different environments.
Apply overlays to customize settings per deployment.
Validate manifests and interpret error messages.
Prerequisites¶
Completed Beginner Tutorial 1: Your First Workflow.
Scalable installed (
pip install scalable).A text editor and terminal.
Key Concepts Explained¶
💡 Key Concept: Declarative Programming (Deep Dive)
In Beginner Tutorial 1: Your First Workflow, we introduced declarative vs. imperative. Let’s go deeper with a real example.
Imperative approach to setting up 4 workers:
# Pseudocode: imperative style
for i in range(4):
worker = start_process()
worker.set_memory("4G")
worker.set_cpus(2)
worker.connect_to_scheduler(scheduler_address)
if not worker.is_healthy():
worker.restart()
Declarative approach (what Scalable uses):
targets:
local:
provider: local
max_workers: 4
components:
analysis:
cpus: 2
memory: 4G
The declarative version doesn’t say how to start workers — it says what state you want. Scalable’s runtime figures out the “how.”
Why is declarative better here?
Portability — The same declaration works on your laptop or a 1000-node cluster. The “how” differs, but the “what” doesn’t.
Idempotency — You can apply the same manifest repeatedly; the system converges to the desired state without duplicating resources.
Separation of concerns — You (the scientist) declare what you need; the platform (Scalable) handles infrastructure details.
💡 Key Concept: YAML Syntax
YAML is a data serialization format designed to be human-readable. Here are the essential rules:
Indentation matters (use spaces, NEVER tabs):
parent:
child: value # 2-space indent = child of "parent"
another: value2
Data types are inferred:
string_value: hello # String
number_value: 42 # Integer
float_value: 3.14 # Float
boolean_value: true # Boolean (true/false)
quoted_string: "04:00:00" # Quoted to prevent time interpretation
null_value: null # Null/None
Lists use dashes:
fruits:
- apple
- banana
- cherry
Nested maps:
targets:
local:
provider: local
max_workers: 2
Comments start with #.
Common mistakes:
Using tabs instead of spaces (causes parse errors)
Inconsistent indentation (2 spaces is conventional)
Forgetting to quote strings that look like other types (
version: 1is a number,version: "1"is a string)
💡 Key Concept: Schema
A schema defines the valid structure for data. Think of it like a form with labeled fields — some fields are required, some are optional, and each has rules about what values are acceptable.
For Scalable’s manifest:
versionis required and must be an integerproject.nameis required and must be a stringtargetsmust be a map where each value has aproviderkeycomponentsmust havecpusandmemorykeys
The schema catches errors before you run (fail fast), saving you from discovering problems 30 minutes into an expensive cloud run.
💡 Key Concept: Environment Variables
Environment variables are system-level settings available to all programs. They store configuration that varies between machines or users:
# Setting an environment variable
export AWS_REGION=us-east-1
# Reading it in a program
echo $AWS_REGION # Prints: us-east-1
In Scalable manifests, you can reference them with ${VAR_NAME}
syntax. This keeps secrets (API keys, passwords) out of your config
files and makes manifests portable across environments.
💡 Key Concept: Single Source of Truth
The single source of truth (SSOT) principle means there’s exactly one authoritative place where a piece of information lives. If you need to change something, you change it in one place, and everything else picks up the change.
The manifest is Scalable’s SSOT for workflow configuration. You don’t need to remember “I set max_workers in the CLI, memory in an env var, and the image in a script.” It’s all in one file.
Step 1: The Complete Manifest Structure¶
Every scalable.yaml manifest has this top-level structure:
version: 1 # Required: schema version
project: { ... } # Required: project metadata
targets: { ... } # Required: where code runs
components: { ... } # Required: resource profiles
tasks: { ... } # Required: work unit definitions
overlays: { ... } # Optional: environment-specific overrides
Let’s explore each section in depth.
Step 2: The Project Block¶
project:
name: demeter-lulcc
default_storage: ./outputs
local_cache: ./cache
What each key does:
nameA human-readable identifier for your project. It appears in:
Telemetry run IDs (e.g.,
run-20260520T...-demeter-lulcc-a1b2c3d4)Log messages
Artifact storage paths
Use lowercase with hyphens (
my-project, notMy Project).default_storageWhere output artifacts are saved. Can be:
A local path:
./outputsAn S3 URI:
s3://my-bucket/scalable-runs/A GCS URI:
gs://my-bucket/scalable-runs/
local_cacheWhere cached results are stored locally. Defaults to
./cache. Can also be set via theSCALABLE_CACHE_DIRenvironment variable (the manifest value takes precedence).
Step 3: Defining Targets¶
Targets answer the question: “Where does my code run?”
targets:
local:
provider: local
max_workers: 4
threads_per_worker: 2
processes: false
containers: none
hpc:
provider: slurm
queue: batch
account: GCIMS
walltime: "04:00:00"
interface: ib0
aws:
provider: aws
region: us-east-1
cluster_type: fargate
instance_type: m5.xlarge
worker_cpu: 4096
worker_mem: 16384
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
adaptive:
minimum: 1
maximum: 10
💡 Key Concept: Provider Pattern
A provider is an abstraction over an execution backend. It’s like an electrical outlet standard — you can plug any appliance into any outlet because they share a common interface.
Scalable’s providers share a common interface but work differently internally:
local— spawns workers on your machineslurm— submits jobs to an HPC scheduleraws— launches containers on AWS Fargate/EC2kubernetes— creates pods in a K8s cluster
Why multiple targets in one file? A single manifest can describe your entire promotion path:
Develop locally (
--target local)Validate on HPC (
--target hpc)Deploy to cloud (
--target aws)
The --target flag (or SCALABLE_TARGET env var) selects which
environment to activate.
Key options by provider:
Provider |
Key Options |
|---|---|
|
|
|
|
|
|
|
|
Step 4: Components — Resource Profiles¶
Components define how much computational resources each piece of work needs:
components:
demeter:
image: ghcr.io/jgcri/demeter:2.0.1
runtime: apptainer
cpus: 8
memory: 32G
mounts:
./demeter_data: /data
/shared/outputs: /outputs
env:
DEMETER_DATA: /data
tags: [lulcc, downscaling, gcam]
postprocess:
cpus: 2
memory: 4G
tags: [analysis]
Why not just specify resources per task directly?
Separating components from tasks follows the DRY principle (Don’t Repeat Yourself). If 20 tasks all need the same resources, you define the component once and reference it 20 times. Change the resource allocation in one place → all 20 tasks update.
Component keys explained:
cpusNumber of CPU cores allocated per worker. Maps to Dask worker resource annotations.
memoryMemory allocation (e.g.,
32G,512M,2T). Parsed using standard byte suffixes.image(optional)Container image URI for containerized providers. Ignored for bare-metal local runs.
runtime(optional)Container runtime hint:
apptainer(HPC) ordocker(cloud/local).mounts(optional)Volume mappings (host path → container path). Only meaningful for containerized runs.
env(optional)Environment variables injected into the worker process. Useful for model paths or configuration.
tags(optional)Labels for grouping and filtering. Appear in telemetry and can inform resource recommendations.
Step 5: Task Bindings¶
Tasks connect your Python functions to resource profiles:
tasks:
run_demeter_scenario:
component: demeter
aggregate_demeter_outputs:
component: postprocess
When you write Python code like:
client.submit(my_function, args, tag="demeter")
Scalable looks up the run_demeter_scenario task, finds it uses the demeter
component, and schedules it on a worker with 4 CPUs and 16G memory.
💡 Key Concept: Binding
Binding means creating a connection between two things. Here, we bind:
Task name → component (resource profile)
Python function → task name (at submit time)
This indirection lets you change resource allocations without touching your Python code, and vice versa.
Step 6: Environment Variable Expansion¶
Manifests support ${VAR} syntax for environment variables:
project:
name: demeter-lulcc
default_storage: s3://${S3_BUCKET}/scalable-runs/
targets:
aws:
provider: aws
region: ${AWS_REGION:-us-east-1}
The ${AWS_REGION:-us-east-1} syntax means “use the AWS_REGION
environment variable if set, otherwise default to us-east-1.”
Why use environment variables instead of hardcoding?
Security — Keep secrets (API keys, bucket names) out of Git
Portability — Same manifest works across team members and CI/CD
12-Factor compliance — Configuration should come from the environment (a best practice from the Twelve-Factor App methodology)
Step 7: Overlays — Environment-Specific Customization¶
💡 Key Concept: Overlays
An overlay is a set of patches applied on top of a base configuration. Think of it like Photoshop layers — you have a base image (your manifest) and layers that add or modify specific parts.
Why overlays? You might want:
Development: 2 workers, 1G memory, local storage
Production: 64 workers, 32G memory, S3 storage
CI testing: 1 worker, minimal memory, ephemeral storage
Rather than maintaining 3 separate manifests (which drift apart over time), you maintain ONE base manifest + overlays for differences.
# In the manifest itself
overlays:
production:
targets:
hpc:
max_workers: 64
components:
demeter:
memory: 64G
ci:
targets:
local:
max_workers: 1
components:
demeter:
memory: 2G
cpus: 1
To apply an overlay:
scalable run ./scalable.yaml --target hpc --overlay production
The overlay merges on top of the base configuration — only the keys specified in the overlay are changed; everything else stays the same.
💡 Key Concept: Deep Merge
Deep merge means overlays are applied recursively. If your overlay
specifies components.demeter.memory: 64G, it only changes that one
field — all other demeter settings (cpus, image, mounts)
remain as defined in the base manifest.
This is different from a shallow merge where replacing any key in a section would replace the entire section.
Step 8: Programmatic Validation¶
You’ve used scalable validate from the CLI. You can also validate from
Python:
from scalable.manifest.parser import load_manifest
from scalable.manifest.validate import validate_manifest
# Parse the YAML into a structured object
manifest = load_manifest("./scalable.yaml")
# Validate returns a list of errors (empty = valid)
report = validate_manifest(manifest)
if not report.ok:
for issue in report.errors:
print(f"ERROR: {err}")
else:
print("✓ Manifest is valid")
💡 Key Concept: Parse vs. Validate
These are two distinct steps:
Parsing = reading the YAML text and converting it to a Python data structure (dict). This catches syntax errors (bad indentation, invalid YAML).
Validating = checking that the parsed data meets the schema rules. This catches semantic errors (missing required fields, invalid references, type mismatches).
You need both: a YAML file can be syntactically valid but semantically wrong (like a grammatically correct sentence that makes no sense).
Common Questions¶
Q: Can I split my manifest into multiple files?
Not directly — the manifest is a single source of truth. But overlays let you customize per environment, and environment variables let you inject external values. This keeps the manifest self-contained and auditable.
Q: What if I make a typo in a component key?
The validator catches it. Unknown keys inside components are rejected
(strict schema). Unknown keys inside targets are passed through to the
provider (forward compatibility), but invalid provider-specific keys will
fail at runtime with a clear error message.
Q: YAML vs. JSON vs. TOML — why YAML?
JSON — No comments, verbose (lots of brackets/braces), hard to edit by hand
TOML — Good for flat config, awkward for deeply nested structures
YAML — Human-readable, supports comments, good for nested data, widely used in DevOps (Docker Compose, Kubernetes, GitHub Actions)
The downside of YAML (indentation sensitivity) is mitigated by validation.
Q: What’s the difference between ``project.default_storage`` and ``project.local_cache``?
default_storage= where outputs go (can be remote: S3, GCS)local_cache= where cached intermediate results are stored (always local, for speed)
What You Learned¶
Term |
Definition |
|---|---|
Declarative Programming |
Describing what you want rather than how to achieve it |
YAML |
Human-readable data serialization format using indentation |
Schema |
Rules defining valid structure for data |
Environment Variables |
System-level key-value settings available to programs |
Single Source of Truth |
One authoritative location for configuration |
Provider |
Abstraction over an execution backend |
Overlay |
Patches applied on top of base configuration |
Deep Merge |
Recursive combination where only specified keys are overridden |
Binding |
Connecting a task name to a component (resource profile) |
Parsing |
Converting text (YAML) into structured data (Python dict) |
Validation |
Checking that structured data meets schema rules |
Configuration as Code |
Storing settings in version-controlled text files |
Next Steps¶
You now understand how Scalable’s manifest system works and the philosophy behind declarative configuration.
Next beginner tutorial: Beginner Tutorial 3: How Distributed Computing Works — how distributed computing actually works
Standard tutorial: Tutorial 2: Mastering the Manifest System — advanced manifest patterns and production deployment
Try it: Add a second target (copy the
localtarget, name itdev, and changemax_workersto 1). Validate it. Try adding an overlay that doubles the memory for production.