.. _beginner_manifest_system: ====================================================== Beginner Tutorial 2: Understanding the Manifest System ====================================================== The Big Picture ---------------- In the previous tutorial, you wrote a simple ``scalable.yaml`` file. But what *is* a manifest, really? Why does Scalable use one? And what's this "declarative programming" idea all about? This tutorial takes you deep into the manifest system — not just the syntax, but the *philosophy* behind it. You'll understand why configuration-as-code exists, how YAML works, what schemas enforce, and how overlays let you customize behavior across different environments. .. admonition:: 💡 Key Concept: Configuration as Code :class: tip **Configuration as code** means storing your system's settings in version- controlled text files rather than clicking through GUIs or typing ad-hoc commands. Benefits: * **Reproducibility** — anyone can recreate your exact setup * **History** — Git shows who changed what and when * **Review** — teammates can review config changes like code changes * **Automation** — CI/CD pipelines can validate and deploy configs Scalable's manifest is configuration as code: your entire workflow setup lives in a single YAML file that you check into version control. What You Will Learn -------------------- By the end of this tutorial you will: * Understand declarative programming deeply and why it matters. * Read and write YAML confidently (indentation, data types, references). * Know every section of a ``scalable.yaml`` manifest and its purpose. * Use environment variables in manifests for portability. * Define multiple targets for different environments. * Apply overlays to customize settings per deployment. * Validate manifests and interpret error messages. Prerequisites -------------- * Completed :ref:`beginner_getting_started`. * Scalable installed (``pip install scalable``). * A text editor and terminal. Key Concepts Explained ----------------------- .. admonition:: 💡 Key Concept: Declarative Programming (Deep Dive) :class: tip In :ref:`beginner_getting_started`, we introduced declarative vs. imperative. Let's go deeper with a real example. **Imperative approach** to setting up 4 workers: .. code-block:: python # Pseudocode: imperative style for i in range(4): worker = start_process() worker.set_memory("4G") worker.set_cpus(2) worker.connect_to_scheduler(scheduler_address) if not worker.is_healthy(): worker.restart() **Declarative approach** (what Scalable uses): .. code-block:: yaml targets: local: provider: local max_workers: 4 components: analysis: cpus: 2 memory: 4G The declarative version doesn't say *how* to start workers — it says *what state you want*. Scalable's runtime figures out the "how." **Why is declarative better here?** 1. **Portability** — The same declaration works on your laptop or a 1000-node cluster. The "how" differs, but the "what" doesn't. 2. **Idempotency** — You can apply the same manifest repeatedly; the system converges to the desired state without duplicating resources. 3. **Separation of concerns** — You (the scientist) declare what you need; the platform (Scalable) handles infrastructure details. .. admonition:: 💡 Key Concept: YAML Syntax :class: tip YAML is a data serialization format designed to be human-readable. Here are the essential rules: **Indentation matters** (use spaces, NEVER tabs): .. code-block:: yaml parent: child: value # 2-space indent = child of "parent" another: value2 **Data types** are inferred: .. code-block:: yaml string_value: hello # String number_value: 42 # Integer float_value: 3.14 # Float boolean_value: true # Boolean (true/false) quoted_string: "04:00:00" # Quoted to prevent time interpretation null_value: null # Null/None **Lists** use dashes: .. code-block:: yaml fruits: - apple - banana - cherry **Nested maps**: .. code-block:: yaml targets: local: provider: local max_workers: 2 **Comments** start with ``#``. **Common mistakes:** * Using tabs instead of spaces (causes parse errors) * Inconsistent indentation (2 spaces is conventional) * Forgetting to quote strings that look like other types (``version: 1`` is a number, ``version: "1"`` is a string) .. admonition:: 💡 Key Concept: Schema :class: tip A **schema** defines the valid structure for data. Think of it like a form with labeled fields — some fields are required, some are optional, and each has rules about what values are acceptable. For Scalable's manifest: * ``version`` is required and must be an integer * ``project.name`` is required and must be a string * ``targets`` must be a map where each value has a ``provider`` key * ``components`` must have ``cpus`` and ``memory`` keys The schema catches errors *before* you run (fail fast), saving you from discovering problems 30 minutes into an expensive cloud run. .. admonition:: 💡 Key Concept: Environment Variables :class: tip **Environment variables** are system-level settings available to all programs. They store configuration that varies between machines or users: .. code-block:: bash # Setting an environment variable export AWS_REGION=us-east-1 # Reading it in a program echo $AWS_REGION # Prints: us-east-1 In Scalable manifests, you can reference them with ``${VAR_NAME}`` syntax. This keeps secrets (API keys, passwords) out of your config files and makes manifests portable across environments. .. admonition:: 💡 Key Concept: Single Source of Truth :class: tip The **single source of truth** (SSOT) principle means there's exactly one authoritative place where a piece of information lives. If you need to change something, you change it in one place, and everything else picks up the change. The manifest is Scalable's SSOT for workflow configuration. You don't need to remember "I set max_workers in the CLI, memory in an env var, and the image in a script." It's all in one file. Step 1: The Complete Manifest Structure ----------------------------------------- Every ``scalable.yaml`` manifest has this top-level structure: .. code-block:: yaml version: 1 # Required: schema version project: { ... } # Required: project metadata targets: { ... } # Required: where code runs components: { ... } # Required: resource profiles tasks: { ... } # Required: work unit definitions overlays: { ... } # Optional: environment-specific overrides Let's explore each section in depth. Step 2: The Project Block --------------------------- .. code-block:: yaml project: name: demeter-lulcc default_storage: ./outputs local_cache: ./cache **What each key does:** ``name`` A human-readable identifier for your project. It appears in: * Telemetry run IDs (e.g., ``run-20260520T...-demeter-lulcc-a1b2c3d4``) * Log messages * Artifact storage paths Use lowercase with hyphens (``my-project``, not ``My Project``). ``default_storage`` Where output artifacts are saved. Can be: * A local path: ``./outputs`` * An S3 URI: ``s3://my-bucket/scalable-runs/`` * A GCS URI: ``gs://my-bucket/scalable-runs/`` ``local_cache`` Where cached results are stored locally. Defaults to ``./cache``. Can also be set via the ``SCALABLE_CACHE_DIR`` environment variable (the manifest value takes precedence). Step 3: Defining Targets -------------------------- Targets answer the question: **"Where does my code run?"** .. code-block:: yaml targets: local: provider: local max_workers: 4 threads_per_worker: 2 processes: false containers: none hpc: provider: slurm queue: batch account: GCIMS walltime: "04:00:00" interface: ib0 aws: provider: aws region: us-east-1 cluster_type: fargate instance_type: m5.xlarge worker_cpu: 4096 worker_mem: 16384 image: 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1 adaptive: minimum: 1 maximum: 10 .. admonition:: 💡 Key Concept: Provider Pattern :class: tip A **provider** is an abstraction over an execution backend. It's like an electrical outlet standard — you can plug any appliance into any outlet because they share a common interface. Scalable's providers share a common interface but work differently internally: * ``local`` — spawns workers on your machine * ``slurm`` — submits jobs to an HPC scheduler * ``aws`` — launches containers on AWS Fargate/EC2 * ``kubernetes`` — creates pods in a K8s cluster **Why multiple targets in one file?** A single manifest can describe your entire promotion path: 1. Develop locally (``--target local``) 2. Validate on HPC (``--target hpc``) 3. Deploy to cloud (``--target aws``) The ``--target`` flag (or ``SCALABLE_TARGET`` env var) selects which environment to activate. **Key options by provider:** .. list-table:: :header-rows: 1 :widths: 15 85 * - Provider - Key Options * - ``local`` - ``max_workers``, ``threads_per_worker``, ``processes``, ``containers`` * - ``slurm`` - ``queue``, ``account``, ``walltime``, ``interface`` * - ``aws`` - ``region``, ``cluster_type``, ``instance_type``, ``worker_cpu``, ``worker_mem``, ``image``, ``adaptive`` * - ``kubernetes`` - ``namespace``, ``image``, ``adaptive``, ``overlay`` Step 4: Components — Resource Profiles ---------------------------------------- Components define how much computational resources each piece of work needs: .. code-block:: yaml components: demeter: image: ghcr.io/jgcri/demeter:2.0.1 runtime: apptainer cpus: 8 memory: 32G mounts: ./demeter_data: /data /shared/outputs: /outputs env: DEMETER_DATA: /data tags: [lulcc, downscaling, gcam] postprocess: cpus: 2 memory: 4G tags: [analysis] .. admonition:: Why not just specify resources per task directly? :class: hint Separating components from tasks follows the **DRY principle** (Don't Repeat Yourself). If 20 tasks all need the same resources, you define the component once and reference it 20 times. Change the resource allocation in one place → all 20 tasks update. **Component keys explained:** ``cpus`` Number of CPU cores allocated per worker. Maps to Dask worker resource annotations. ``memory`` Memory allocation (e.g., ``32G``, ``512M``, ``2T``). Parsed using standard byte suffixes. ``image`` (optional) Container image URI for containerized providers. Ignored for bare-metal local runs. ``runtime`` (optional) Container runtime hint: ``apptainer`` (HPC) or ``docker`` (cloud/local). ``mounts`` (optional) Volume mappings (host path → container path). Only meaningful for containerized runs. ``env`` (optional) Environment variables injected into the worker process. Useful for model paths or configuration. ``tags`` (optional) Labels for grouping and filtering. Appear in telemetry and can inform resource recommendations. Step 5: Task Bindings ----------------------- Tasks connect your Python functions to resource profiles: .. code-block:: yaml tasks: run_demeter_scenario: component: demeter aggregate_demeter_outputs: component: postprocess When you write Python code like: .. code-block:: python client.submit(my_function, args, tag="demeter") Scalable looks up the ``run_demeter_scenario`` task, finds it uses the ``demeter`` component, and schedules it on a worker with 4 CPUs and 16G memory. .. admonition:: 💡 Key Concept: Binding :class: tip **Binding** means creating a connection between two things. Here, we bind: * Task name → component (resource profile) * Python function → task name (at submit time) This indirection lets you change resource allocations without touching your Python code, and vice versa. Step 6: Environment Variable Expansion ---------------------------------------- Manifests support ``${VAR}`` syntax for environment variables: .. code-block:: yaml project: name: demeter-lulcc default_storage: s3://${S3_BUCKET}/scalable-runs/ targets: aws: provider: aws region: ${AWS_REGION:-us-east-1} The ``${AWS_REGION:-us-east-1}`` syntax means "use the ``AWS_REGION`` environment variable if set, otherwise default to ``us-east-1``." .. admonition:: Why use environment variables instead of hardcoding? :class: hint * **Security** — Keep secrets (API keys, bucket names) out of Git * **Portability** — Same manifest works across team members and CI/CD * **12-Factor compliance** — Configuration should come from the environment (a best practice from the `Twelve-Factor App `_ methodology) Step 7: Overlays — Environment-Specific Customization ------------------------------------------------------ .. admonition:: 💡 Key Concept: Overlays :class: tip An **overlay** is a set of patches applied on top of a base configuration. Think of it like Photoshop layers — you have a base image (your manifest) and layers that add or modify specific parts. **Why overlays?** You might want: * Development: 2 workers, 1G memory, local storage * Production: 64 workers, 32G memory, S3 storage * CI testing: 1 worker, minimal memory, ephemeral storage Rather than maintaining 3 separate manifests (which drift apart over time), you maintain ONE base manifest + overlays for differences. .. code-block:: yaml # In the manifest itself overlays: production: targets: hpc: max_workers: 64 components: demeter: memory: 64G ci: targets: local: max_workers: 1 components: demeter: memory: 2G cpus: 1 To apply an overlay: .. code-block:: bash scalable run ./scalable.yaml --target hpc --overlay production The overlay merges on top of the base configuration — only the keys specified in the overlay are changed; everything else stays the same. .. admonition:: 💡 Key Concept: Deep Merge :class: tip **Deep merge** means overlays are applied recursively. If your overlay specifies ``components.demeter.memory: 64G``, it only changes that one field — all other ``demeter`` settings (``cpus``, ``image``, ``mounts``) remain as defined in the base manifest. This is different from a **shallow merge** where replacing any key in a section would replace the entire section. Step 8: Programmatic Validation --------------------------------- You've used ``scalable validate`` from the CLI. You can also validate from Python: .. code-block:: python from scalable.manifest.parser import load_manifest from scalable.manifest.validate import validate_manifest # Parse the YAML into a structured object manifest = load_manifest("./scalable.yaml") # Validate returns a list of errors (empty = valid) report = validate_manifest(manifest) if not report.ok: for issue in report.errors: print(f"ERROR: {err}") else: print("✓ Manifest is valid") .. admonition:: 💡 Key Concept: Parse vs. Validate :class: tip These are two distinct steps: 1. **Parsing** = reading the YAML text and converting it to a Python data structure (dict). This catches syntax errors (bad indentation, invalid YAML). 2. **Validating** = checking that the parsed data meets the schema rules. This catches semantic errors (missing required fields, invalid references, type mismatches). You need both: a YAML file can be syntactically valid but semantically wrong (like a grammatically correct sentence that makes no sense). Common Questions ----------------- **Q: Can I split my manifest into multiple files?** Not directly — the manifest is a single source of truth. But overlays let you customize per environment, and environment variables let you inject external values. This keeps the manifest self-contained and auditable. **Q: What if I make a typo in a component key?** The validator catches it. Unknown keys inside ``components`` are rejected (strict schema). Unknown keys inside ``targets`` are passed through to the provider (forward compatibility), but invalid provider-specific keys will fail at runtime with a clear error message. **Q: YAML vs. JSON vs. TOML — why YAML?** * **JSON** — No comments, verbose (lots of brackets/braces), hard to edit by hand * **TOML** — Good for flat config, awkward for deeply nested structures * **YAML** — Human-readable, supports comments, good for nested data, widely used in DevOps (Docker Compose, Kubernetes, GitHub Actions) The downside of YAML (indentation sensitivity) is mitigated by validation. **Q: What's the difference between ``project.default_storage`` and ``project.local_cache``?** * ``default_storage`` = where **outputs** go (can be remote: S3, GCS) * ``local_cache`` = where **cached intermediate results** are stored (always local, for speed) What You Learned ----------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Term - Definition * - Declarative Programming - Describing *what* you want rather than *how* to achieve it * - YAML - Human-readable data serialization format using indentation * - Schema - Rules defining valid structure for data * - Environment Variables - System-level key-value settings available to programs * - Single Source of Truth - One authoritative location for configuration * - Provider - Abstraction over an execution backend * - Overlay - Patches applied on top of base configuration * - Deep Merge - Recursive combination where only specified keys are overridden * - Binding - Connecting a task name to a component (resource profile) * - Parsing - Converting text (YAML) into structured data (Python dict) * - Validation - Checking that structured data meets schema rules * - Configuration as Code - Storing settings in version-controlled text files Next Steps ----------- You now understand how Scalable's manifest system works and the philosophy behind declarative configuration. * **Next beginner tutorial:** :ref:`beginner_scaling_strategies` — how distributed computing actually works * **Standard tutorial:** :ref:`tutorial_manifest_system` — advanced manifest patterns and production deployment * **Try it:** Add a second target (copy the ``local`` target, name it ``dev``, and change ``max_workers`` to 1). Validate it. Try adding an overlay that doubles the memory for production.