Beginner Tutorial 10: AI-Assisted Workflow Development¶

The Big Picture¶

Writing configuration files, diagnosing errors, and composing workflows requires expertise — you need to know Scalable’s manifest schema, provider options, component settings, and best practices. What if an AI assistant could help with these tasks?

Scalable includes AI-powered assistants that can onboard new model components, diagnose run failures, explain execution plans, compose workflows from descriptions, and migrate between providers. These assistants work in two modes: a fast deterministic mode (heuristics) and an intelligent LLM-powered mode.

This tutorial explains what LLMs are, how Scalable uses them, and how to leverage AI assistance in your workflow development.

What You Will Learn¶

By the end of this tutorial you will:

Understand what Large Language Models (LLMs) are at a high level.
Know the difference between heuristic and LLM-powered modes.
Use scalable init-component to onboard new models.
Use scalable diagnose to analyze failures.
Use scalable explain to understand execution plans.
Use scalable compose to generate workflows from descriptions.
Use scalable migrate to convert between providers.
Understand when to trust (and verify) AI-generated output.

Prerequisites¶

Completed Beginner Tutorial 1: Your First Workflow and Beginner Tutorial 2: Understanding the Manifest System.
pip install scalable[ai] (installs jinja2, rich).
Tutorial Setup: Run the Demeter Example End-to-End — the running example throughout this tutorial onboards the real Demeter model that lives in capabilities/demeter.
For LLM mode (optional): an API key for OpenAI, or a running Ollama instance.
Heuristic mode works without any AI setup.

Key Concepts Explained¶

💡 Key Concept: What is a Large Language Model (LLM)?

A Large Language Model is an AI system trained on massive amounts of text data that can generate human-like text, answer questions, and perform reasoning tasks.

How LLMs work (simplified):

Trained on billions of words from the internet (books, code, documentation)
Learns patterns: “given this input text, what text is likely to come next?”
At inference time: given your prompt (question), generates a response word by word, each word chosen based on what’s most likely to follow

Examples: ChatGPT (OpenAI), Claude (Anthropic), Llama (Meta), Gemini (Google)

Key properties:

Can generate configuration files, code, explanations
Not deterministic — same input may give slightly different outputs
Can be wrong (hallucination) — always verify output
Requires API access (cloud) or local hardware (Ollama)

💡 Key Concept: Heuristic vs. AI-Powered

Scalable’s assistants work in two modes:

Heuristic mode (rules-based):

Uses predefined rules, templates, and pattern matching
Deterministic: same input → always same output
Works offline (no API calls)
Fast and free
Best for: CI/CD pipelines, reproducible outputs, no AI budget

LLM-enhanced mode (AI-powered):

Uses an LLM for intelligent generation and reasoning
Non-deterministic: may give slightly different outputs
Requires API access (and costs money per call)
Slower but more flexible
Best for: creative composition, complex diagnosis, migration

Why both? Heuristic mode ensures Scalable works without external dependencies. LLM mode adds intelligence for complex tasks. The system gracefully degrades: if the LLM is unavailable, it falls back to heuristics.

💡 Key Concept: Templates

A template is a pre-structured document with placeholders that get filled in with specific values. Think of it like a form letter:

Dear {{ name }},
Your order of {{ item }} will arrive on {{ date }}.

In Scalable’s AI assistants:

Heuristic mode uses templates extensively (predictable, fast)
LLM mode uses templates as “prompts” — instructions to the AI about what to generate

Templates use Jinja2 syntax ({{ variable }}, {% if %}) which is the most popular Python templating language.

💡 Key Concept: Prompt Engineering

Prompt engineering is the art of crafting inputs to LLMs to get desired outputs. LLMs are sensitive to how you ask:

Bad prompt:: “Make me a manifest”
Good prompt:: “Generate a Scalable manifest for an energy modeling workflow with: - 2 targets: local (4 workers) and AWS Fargate - 1 component: demeter (4 CPUs, 16GB RAM, Apptainer container) - 1 task: run_demeter_scenario bound to demeter”

Scalable’s AI assistants handle prompt engineering internally — they construct detailed prompts from your high-level commands.

💡 Key Concept: Code Generation

Code generation is using AI to automatically write code or configuration. In Scalable’s context:

Generate manifest YAML from descriptions
Generate component definitions from model documentation
Generate migration plans between providers

Trust but verify: AI-generated code should always be reviewed by a human. It might be syntactically correct but semantically wrong (e.g., reasonable-looking but incorrect resource allocations).

💡 Key Concept: Deterministic vs. Non-Deterministic

Deterministic: Same input always produces the same output.: 2 + 2 = 4 (always). Heuristic mode is deterministic.
Non-deterministic: Same input may produce different outputs.: LLMs generate different text each time (due to random sampling in the generation process). LLM mode is non-deterministic.

Why this matters:

For CI/CD and testing → use heuristic mode (reproducible)
For creative tasks → LLM mode is fine (you review the output anyway)

💡 Key Concept: API (Application Programming Interface)

An API is a standardized way for programs to communicate. When Scalable uses OpenAI’s LLM, it sends a request to OpenAI’s API (over the internet) and receives the LLM’s response.

Your computer                     OpenAI servers
┌──────────┐    HTTP request     ┌──────────────┐
│ Scalable │───────────────────▶│  GPT-4 model │
│          │◀───────────────────│              │
└──────────┘    JSON response    └──────────────┘

API keys authenticate you (prove you’re allowed to use the service). Each API call costs money (typically fractions of a cent).

Step 1: Choosing Your Mode¶

Configure the AI backend via environment variable or .env file:

# Heuristic mode (default, no AI required)
export SCALABLE_AI_BACKEND=none

# OpenAI mode (requires API key)
export SCALABLE_AI_BACKEND=openai
export AI_API_KEY=sk-your-key-here

# Ollama mode (local LLM, no cloud dependency)
export SCALABLE_AI_BACKEND=ollama
# (requires Ollama running locally with a model loaded)

For this tutorial, all examples work in heuristic mode (no API key needed). LLM mode enhances the output quality but isn’t required.

Step 2: Onboarding the Demeter Component¶

You’re adding a real model — the Demeter land-use / land-cover disaggregation model — to your pipeline. Instead of writing the component definition manually, let the assistant analyze the cloned repository for you:

scalable init-component ./capabilities/demeter --name demeter --no-ai

Output (heuristic mode):

# Generated component definition
components:
  demeter:
    image: ghcr.io/jgcri/demeter:2.0.1
    cpus: 4
    memory: 16G
    tags: [lulcc, downscaling, gcam]
    mounts:
      ./demeter_data: /data
    env:
      DEMETER_DATA: /data

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true

What happened here

The assistant:

Read setup.py and requirements.txt to determine that this is a Python 3.9+ package
Detected Dockerfile.scalable and proposed a matching image tag
Inferred tags from the README (“downscaling”, “GCAM”, “land-use”)
Generated matching task bindings with caching enabled (Demeter runs are deterministic per-config, so caching is safe by default)
Suggested a mount for the example data directory created by demeter.get_package_data(...)

In LLM mode, it could also read the module docstrings to suggest optimal resource allocations per spatial resolution, and generate a preload script that warms the constraint files into memory before the first task executes.

Step 3: Diagnosing Failures¶

When a run fails, the diagnostic assistant helps identify root causes:

scalable diagnose --run run-20260520T...-demeter-lulcc-abc123

Output:

═══════════════════════════════════════
Diagnosis Report
═══════════════════════════════════════

Failures: 3 of 50 tasks

Root Cause Analysis:
────────────────────
1. MEMORY_EXHAUSTION (2 tasks)
   Tasks: run_demeter_scenario(ssp1_0p05),
          run_demeter_scenario(ssp5_0p05)
   Evidence: MemoryError raised inside ProcessStep, peak memory
   15.8GB exceeds the 16GB limit. Both scenarios use
   ``spatial_resolution = 0.05``.
   Recommendation: Apply the ``k8s-fine-resolution`` overlay (which
   bumps ``demeter.memory`` to 64G) for fine-resolution scenarios.

2. INVALID_INPUT (1 task)
   Task: run_demeter_scenario(reference_v3)
   Evidence: IOError raised in 0.1s (fast fail pattern):
   ``constraints/soil_quality.csv not found``.
   Recommendation: Add ``constraints/`` to the demeter component's
   ``mounts:`` block, or copy the file into ``demeter_data/`` before
   fan-out.

Suggested Fixes:
────────────────
• Apply overlay to increase memory:
  overlays:
    fix-oom:
      components:
        demeter:
          memory: 24G

💡 Key Concept: Root Cause Analysis

Root cause analysis means identifying the underlying reason for a failure, not just the symptom.

Symptom: “Task failed with MemoryError”
Root cause: “Component memory (16G) is insufficient for Demeter scenarios at 0.05° resolution, which expand the projected-LU CSV to 500k+ grid cells and need ~20GB during the kernel-density step”

The diagnostic assistant uses patterns in telemetry (failure timing, error types, resource usage) to infer root causes.

Step 4: Explaining Execution Plans¶

Get a human-readable explanation of what a plan will do:

scalable explain ./docs/examples/scalable.demeter.yaml --target aws

Output:

Plan Explanation
═══════════════

This execution plan will:

1. Deploy the demeter-lulcc project to AWS Fargate in us-east-1 region
2. Start with 1 demeter worker, scaling up to 10 based on the scenario
   backlog
3. Each demeter worker has 4 vCPUs and 16GB RAM
4. Workers run the ghcr.io/jgcri/demeter:2.0.1 container
5. Per-scenario outputs stored to s3://${ARTIFACT_STORAGE}/demeter-lulcc/

Estimated cost: $4.82 for a 50-scenario run (≈ 2.5 hours of Fargate
compute + S3 storage)

Key decisions:
• Adaptive scaling chosen (min=1, max=10) — cost-efficient because
  scenario count is variable
• Fargate selected — no server management overhead
• S3 storage — durable, accessible from any future Demeter run for
  comparison

This is especially useful for:

Reviewing a plan before running in production
Explaining to stakeholders what a workflow does
Documenting deployment decisions for team members

Step 5: Composing Workflows from Descriptions¶

The most powerful assistant — generate manifests from natural language:

scalable compose \
  --description "Demeter LULCC pipeline that downscales GCAM \
  scenarios in parallel (4 CPUs, 16GB RAM, containerized) followed by \
  NetCDF aggregation (2 CPUs, 8GB). Needs local and AWS targets with \
  adaptive scaling."

Output:

# Generated by scalable compose
version: 1
project:
  name: demeter-lulcc

targets:
  local:
    provider: local
    max_workers: 4
    threads_per_worker: 1
    processes: true
    containers: none

  aws:
    provider: aws
    region: us-east-1
    cluster_type: fargate
    worker_cpu: 4096
    worker_mem: 16384
    image: ${ECR_DEMETER_IMAGE}
    adaptive:
      minimum: 1
      maximum: 10

components:
  demeter:
    cpus: 4
    memory: 16G
    image: ghcr.io/jgcri/demeter:2.0.1
    tags: [lulcc, downscaling, gcam]

  postprocess:
    cpus: 2
    memory: 8G
    tags: [lulcc, aggregation]

tasks:
  run_demeter_scenario:
    component: demeter

  aggregate_demeter_outputs:
    component: postprocess

Heuristic vs. LLM composition

Heuristic mode: Parses your description for keywords (CPUs, memory, provider names) and fills templates. Works well for straightforward requests.

LLM mode: Understands context and nuance. Can handle complex descriptions like “similar to our reference Demeter pipeline but for the SSP1-5 ensemble, with the k8s-fine-resolution overlay applied for spatial_resolution <= 0.1° scenarios.” Generates more tailored output.

Step 6: Migrating Between Providers¶

Moving a workflow from one provider to another:

scalable migrate ./docs/examples/scalable.demeter.yaml \
  --from slurm --to kubernetes

Output:

# Migration: slurm → kubernetes
# Changes applied:

targets:
  k8s:  # Replaces 'hpc' target
    provider: kubernetes
    namespace: demeter-prod
    image: ghcr.io/jgcri/demeter:2.0.1   # already on the demeter component
    adaptive:
      minimum: 2
      maximum: 20                    # Mapped from Slurm max_workers

# Migration notes:
# - Slurm 'queue: short' → K8s namespace 'demeter-prod'
# - Slurm 'walltime' → K8s pod activeDeadlineSeconds (no direct equivalent)
# - Slurm 'interface: ib0' → removed (K8s uses pod networking)
# - Apptainer mount './demeter_data:/data' → PVC 'demeter-data-pvc'
# - The hpc-large overlay (demeter.memory: 64G) is preserved as
#   k8s-fine-resolution so it can be re-applied per-target.

Why migration is complex

Providers have different capabilities and concepts:

Slurm has queues, walltimes, accounts → no direct K8s equivalent
K8s has namespaces, pod specs, operators → no Slurm equivalent
Cloud has regions, instance types, VPCs → not applicable to HPC

The migration assistant maps concepts where possible and flags differences that require human decision.

Step 7: Human-in-the-Loop Verification¶

💡 Key Concept: Human-in-the-Loop

Human-in-the-loop means AI generates suggestions but a human makes the final decision. This is important because:

AI can generate plausible-looking but incorrect configuration
Resource allocations affect cost and correctness
Provider-specific nuances may be missed
Security implications (IAM roles, network access) need human review

Scalable’s approach: AI generates → human reviews → human applies. All generated output requires explicit confirmation before being used.

Best practices for verifying AI-generated output:

Always validate: Run scalable validate on generated manifests
Dry-run first: Use --dry-run to see effects without committing
Check resource allocations: Are they sensible for your workload?
Review security: Are IAM roles, images, and network settings correct?
Test locally first: Use --target local before deploying to cloud

Common Questions¶

Q: Do I need to pay for an LLM API to use the AI features?

No! Heuristic mode works without any API key and handles most common cases. LLM mode is an enhancement for complex or creative tasks.

Q: Is the AI generating code that could be insecure?

The AI generates configuration (YAML), not executable code. Always review generated manifests before running, especially for:

Container image sources (trust the registry?)
IAM/permission settings
Network exposure (public vs. private subnets)
Resource allocations (could generate expensive configurations)

Q: How much does LLM mode cost?

Typically $0.01–$0.10 per AI assistant call (depending on the model and prompt length). The explain command is cheapest (short output). The compose command is most expensive (longer generation).

Q: Can I use a local LLM instead of OpenAI?

Yes! Set SCALABLE_AI_BACKEND=ollama and run an Ollama instance locally. This is free (no API costs) but requires a machine with enough RAM for the model (8–32GB depending on model size).

Q: What if the AI gives a wrong answer?

That’s why validation exists. Generated manifests go through the same validation as hand-written ones. scalable validate catches structural errors. Semantic errors (wrong but valid resource allocations) require human judgment.

Q: Are heuristic outputs always correct?

Heuristic mode is deterministic and template-based, so it’s predictable. But it may not handle edge cases as well as LLM mode. For standard workflows, heuristics work great. For unusual configurations, LLM mode provides better results.

What You Learned¶

Term	Definition
Large Language Model (LLM)	AI trained on text that can generate human-like responses
Heuristic Mode	Rule-based, deterministic processing (no AI required)
LLM-Enhanced Mode	AI-powered processing with richer understanding
Template	Pre-structured document with fill-in-the-blank placeholders
Prompt Engineering	Crafting inputs to LLMs to get desired outputs
Code Generation	Using AI to automatically write code or configuration
Deterministic	Same input always produces the same output
Non-Deterministic	Same input may produce different outputs (LLM behavior)
API	Standardized interface for programs to communicate
Human-in-the-Loop	AI suggests, human decides and validates
Root Cause Analysis	Identifying the underlying reason for a failure
Graceful Degradation	Falling back to simpler mode when advanced features unavailable

Next Steps¶

Tutorial Setup: Run the Demeter Example End-to-End — One-time setup (clone, install, demeter.get_package_data, optional Docker image build) for the examples in this tutorial.

You’ve completed all 10 beginner tutorials! You now have a solid foundation in:

Distributed computing and workflow orchestration
Declarative configuration with manifests
Scaling strategies and provider architecture
Caching and performance optimization
Cloud computing and container technology
Telemetry and observability
Error handling and fault tolerance
Kubernetes and container orchestration
Machine learning for workflow optimization
AI-assisted development

Where to go from here:

Standard tutorials: Work through Tutorials for deeper technical content and production patterns
API documentation: Explore the API for detailed reference
Real project: Apply what you’ve learned to your own workflow!
Community: Contribute improvements via How to Contribute

🎉 Congratulations!

You’ve gone from “what is distributed computing?” to understanding ML optimization and AI-assisted development. The beginner tutorials gave you the conceptual foundation — the standard tutorials and real-world practice will build expertise on top of it.