.. _tutorial_manifest_system: ====================================================== Tutorial 2: Mastering the Manifest System ====================================================== What You Will Learn ------------------- By the end of this tutorial you will: * Understand every section of a ``scalable.yaml`` manifest in depth. * Use environment variable expansion for portable, credential-free manifests. * Define multiple targets for local development, HPC, and cloud. * Configure components with images, mounts, environment variables, and tags. * Apply overlays to customize resources per deployment environment. * Validate manifests programmatically and interpret error codes. Prerequisites ------------- * Completed :ref:`tutorial_getting_started`. * Scalable installed (``pip install scalable``). * A text editor and terminal. Scenario -------- You are building an energy modeling pipeline with two stages: a computationally expensive disaggregation step (Demeter) and a lighter post-processing step (NetCDF aggregation). The pipeline must run locally during development, on an HPC cluster for production, and eventually in the cloud. The manifest system lets you describe all three targets in a single file. Step 1: Manifest Schema Overview --------------------------------- Every manifest has this top-level structure: .. code-block:: yaml version: 1 project: { ... } targets: { ... } components: { ... } tasks: { ... } overlays: { ... } # optional The parser (:mod:`scalable.manifest.parser`) enforces: * ``version`` and ``project`` are **required**. * Unknown top-level keys are rejected (defense against typos). * Unknown keys *inside* a target block are passed through to the provider (forward compatibility for provider-specific options). * Unknown keys inside ``components`` are rejected (strict schema). Step 2: The Project Block -------------------------- .. code-block:: yaml project: name: demeter-lulcc default_storage: ./outputs local_cache: ./cache ``name`` Identifies the project in telemetry run IDs (e.g., ``run-20260520T...-demeter-lulcc-a1b2c3d4``). Use lowercase with hyphens. ``default_storage`` Base URI for artifact output. Can be a local path, S3 URI (``s3://bucket/prefix/``), or GCS URI (``gs://bucket/prefix/``). Providers that support remote storage will use this as the destination for task outputs. ``local_cache`` Override for ``SCALABLE_CACHE_DIR``. The manifest value takes precedence over the environment variable, which itself takes precedence over the compiled default (``./cache``). Step 3: Defining Targets ------------------------- Targets are named execution environments. You can define as many as you need: .. code-block:: yaml targets: local: provider: local max_workers: 4 threads_per_worker: 2 processes: false containers: none hpc: provider: slurm queue: batch account: GCIMS walltime: "04:00:00" interface: ib0 aws: provider: aws region: us-east-1 cluster_type: fargate instance_type: m5.xlarge worker_cpu: 4096 worker_mem: 16384 image: 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1 adaptive: minimum: 1 maximum: 10 Each target has one required key — ``provider`` — that maps to a registered provider class. All other keys are provider-specific options: .. list-table:: :header-rows: 1 :widths: 20 80 * - Provider - Key Options * - ``local`` - ``max_workers``, ``threads_per_worker``, ``processes``, ``containers`` * - ``slurm`` - ``queue``, ``account``, ``walltime``, ``interface`` * - ``aws`` - ``region``, ``cluster_type``, ``instance_type``, ``worker_cpu``, ``worker_mem``, ``image``, ``adaptive``, ``subnets``, ``security_groups`` * - ``kubernetes`` - ``namespace``, ``image``, ``adaptive``, ``overlay`` **Why multiple targets?** A single manifest can describe your entire promotion path: develop locally → validate on HPC → deploy to cloud. The ``--target`` flag (or ``SCALABLE_TARGET`` env var) selects which environment to activate. Step 4: Components in Detail ------------------------------ Components are resource profiles for your workloads: .. code-block:: yaml components: demeter: image: ghcr.io/jgcri/demeter:2.0.1 runtime: apptainer cpus: 8 memory: 32G mounts: ./demeter_data: /data /shared/outputs: /outputs env: DEMETER_DATA: /data tags: [lulcc, downscaling, gcam] preload_script: ./scripts/demeter_preload.sh postprocess: cpus: 2 memory: 4G tags: [analysis] Let's break down each key: ``image`` Container image URI. Used by providers that support containerized workers (Slurm with Apptainer, Kubernetes, cloud). Omit for bare-metal local runs. ``runtime`` Container runtime hint (``apptainer``, ``docker``). Providers use this to determine how to pull and launch the image. ``cpus`` CPU count allocated per worker in this component group. Maps to Dask worker resource annotations and scheduler affinity. ``memory`` Memory allocation string (e.g., ``32G``, ``512M``). Parsed by ``dask.utils.parse_bytes``. ``mounts`` Volume mount mappings (host path → container path). Only meaningful for containerized providers. ``env`` Environment variables injected into the worker process. Useful for configuring model data paths, API keys (prefer ``${VAR}`` references over literals), etc. ``tags`` Arbitrary labels for grouping and filtering. Tags propagate to telemetry events and can be used by the resource advisor for per-tag recommendations. ``preload_script`` Shell script executed before the Dask worker process starts. Useful for activating conda environments, loading modules, or mounting FUSE filesystems. Step 5: Task Bindings ---------------------- Tasks bind named work units to components: .. code-block:: yaml tasks: run_demeter_scenario: component: demeter cache: true outputs: database: dir aggregate_results: component: postprocess cache: true ``component`` Must reference a key in the ``components`` map. This determines which workers can execute the task and what resources are reserved. ``cache`` When ``true``, results of functions submitted under this task are eligible for the :func:`~scalable.caching.cacheable` disk cache. Cache hits skip execution entirely on subsequent runs. ``outputs`` Declares expected output artifacts and their types (``file`` or ``dir``). The artifact store can persist these to remote storage when ``project.default_storage`` is configured. Step 6: Environment Variable Expansion ---------------------------------------- Manifests support ``${VAR}`` and ``${VAR:-default}`` syntax for portability: .. code-block:: yaml project: name: ${PROJECT_NAME:-energy-demo} default_storage: ${ARTIFACT_BUCKET:-./outputs} targets: aws: provider: aws region: ${AWS_REGION:-us-east-1} execution_role_arn: ${EXECUTION_ROLE_ARN} Expansion rules: * ``${VAR}`` — replaced by the value of the environment variable. If unset, the parser raises :class:`~scalable.manifest.errors.ManifestParseError`. * ``${VAR:-default}`` — replaced by the variable if set, otherwise uses the literal default value. * Bare ``$HOME``-style references are **not** expanded (to avoid ambiguity in mount paths). Always use curly braces. This means you can commit ``scalable.yaml`` to version control without embedding secrets or machine-specific paths: .. code-block:: bash export AWS_REGION=us-west-2 export EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/myRole scalable validate ./scalable.yaml Step 7: Overlays for Environment-Specific Tuning -------------------------------------------------- Overlays let you define named configuration deltas that are merged onto the base manifest when a target references them: .. code-block:: yaml targets: hpc: provider: slurm queue: batch walltime: "04:00:00" overlay: hpc-large components: demeter: cpus: 4 memory: 16G overlays: hpc-large: components: demeter: cpus: 16 memory: 64G hpc-debug: components: demeter: cpus: 2 memory: 8G When target ``hpc`` is selected, the ``hpc-large`` overlay is merged: ``gcam.cpus`` becomes 16 and ``gcam.memory`` becomes ``64G``. The base values serve as defaults for targets that don't reference an overlay. **Design rationale:** Overlays avoid manifest duplication. Instead of maintaining separate YAML files per environment, you express deltas declaratively. The merge is shallow per-component-key (not deep recursive), keeping behavior predictable. You can also override target options via overlays: .. code-block:: yaml overlays: cloud-dev: targets: aws: adaptive: minimum: 1 maximum: 3 components: demeter: cpus: 4 memory: 16G Step 8: Multi-Target Workflow Selection ---------------------------------------- At runtime you select a target via: **CLI:** .. code-block:: bash scalable run ./scalable.yaml --target hpc --workflow workflow.py **Python:** .. code-block:: python session = ScalableSession.from_yaml("./scalable.yaml", target="hpc") **Environment variable:** .. code-block:: bash export SCALABLE_TARGET=hpc python workflow.py # Session auto-detects from env The resolution order is: explicit ``target=`` argument → ``SCALABLE_TARGET`` env var → error (no implicit default target). Step 9: Programmatic Validation -------------------------------- You can validate manifests from Python for CI/CD integration: .. code-block:: python from scalable import ScalableSession session = ScalableSession.from_yaml("./scalable.yaml", target="local") report = session.validate() if report.ok: print("Manifest is valid") else: for issue in report.errors: print(f"ERROR [{issue.code}] {issue.path}: {issue.message}") for issue in report.warnings: print(f"WARN [{issue.code}] {issue.path}: {issue.message}") Common error codes: .. list-table:: :header-rows: 1 :widths: 25 75 * - Code - Meaning * - ``E_MISSING_KEY`` - A required key (``version``, ``project``) is absent. * - ``E_BAD_VERSION`` - ``version`` is not a supported schema version. * - ``E_UNKNOWN_TOP_KEY`` - Unrecognized top-level key (probable typo). * - ``E_UNKNOWN_COMPONENT_KEY`` - Unrecognized key inside a component definition. * - ``E_TASK_COMPONENT_REF`` - A task references a component that does not exist. * - ``E_UNKNOWN_PROVIDER`` - The target's provider is not installed or registered. * - ``E_BAD_MAX_WORKERS`` - ``max_workers`` is not a positive integer. Step 10: Complete Multi-Target Manifest ---------------------------------------- Here is a production-ready manifest combining all concepts: .. code-block:: yaml version: 1 project: name: demeter-lulcc default_storage: ${ARTIFACT_STORAGE:-./outputs} targets: local: provider: local max_workers: 4 threads_per_worker: 2 processes: false containers: none hpc: provider: slurm queue: batch account: ${SLURM_ACCOUNT} walltime: "08:00:00" interface: ib0 overlay: hpc-prod aws: provider: aws region: ${AWS_REGION:-us-east-1} cluster_type: fargate worker_cpu: 4096 worker_mem: 16384 image: ${ECR_IMAGE} execution_role_arn: ${EXECUTION_ROLE_ARN} task_role_arn: ${TASK_ROLE_ARN} subnets: [${SUBNET_A}, ${SUBNET_B}] security_groups: [${SG_ID}] adaptive: minimum: 2 maximum: 20 components: demeter: image: ghcr.io/jgcri/demeter:2.0.1 cpus: 4 memory: 16G tags: [lulcc, downscaling, gcam] env: DEMETER_DATA: /data postprocess: cpus: 2 memory: 8G tags: [analysis] tasks: run_demeter_scenario: component: demeter cache: true outputs: database: dir aggregate: component: postprocess cache: true overlays: hpc-prod: components: demeter: cpus: 16 memory: 64G postprocess: cpus: 8 memory: 32G hpc-debug: components: demeter: cpus: 2 memory: 4G postprocess: cpus: 1 memory: 2G Troubleshooting --------------- **"ManifestParseError: unresolved variable ${VAR}"** You used ``${VAR}`` without a default and the variable is not set in the environment. Either export it or use ``${VAR:-fallback}``. **"ManifestSchemaError: unknown component key 'gpu'"** Only recognized component keys are allowed. GPU scheduling is expressed via the provider-specific target options, not component definitions. **Overlay changes not taking effect** Ensure the target block includes ``overlay: `` and that the overlay name exactly matches a key under ``overlays:``. Overlay merging only applies to the selected target. **"version: 2" rejected** Only schema version ``1`` is currently supported. The ``version`` field exists for future-proofing. Next Steps ---------- * :ref:`tutorial_scaling_strategies` — Learn how different providers scale workers and how to choose between them. * :ref:`tutorial_caching` — Cache expensive computations to accelerate iterative development. * :ref:`tutorial_cloud_integration` — Configure AWS and GCP targets with real credentials and IAM roles.