Beginner Tutorial 1: Your First Workflow¶

The Big Picture¶

Imagine you have a Python script that processes data — maybe it analyzes energy scenarios, runs simulations, or trains models. When the data grows, running everything on your laptop becomes painfully slow. You need a way to split the work across multiple processors (or multiple computers) without rewriting your entire program.

That’s what Scalable does. It takes your Python functions and orchestrates them across multiple workers — whether those workers are threads on your laptop, processes on an HPC cluster, or containers in the cloud. And it does this through a simple configuration file rather than requiring you to write complex parallel programming code.

This tutorial walks you through your very first Scalable workflow, explaining every concept along the way.

Where this is going

This tutorial uses a deliberately trivial hello-scalable project so you can verify your install in a few minutes. Tutorials 2–10 then graduate to a realistic running example: downscaling Demeter land-use / land-cover projections across many GCAM scenarios in parallel. When you’re ready to actually execute that pipeline, Tutorial Setup: Run the Demeter Example End-to-End walks through the one-time setup.

💡 Key Concept: What is a Workflow?

A workflow is a sequence of computational steps that transforms inputs into outputs. Think of it like a recipe: you have ingredients (data), steps (functions), and a final dish (results).

In Scalable, a workflow consists of:

A manifest (configuration file) describing what resources you need
Python functions that do the actual work
A target (where the work runs — your laptop, a cluster, the cloud)

What You Will Learn¶

By the end of this tutorial you will:

Understand what Scalable is and why it exists.
Know what Dask is and why Scalable uses it under the hood.
Create and activate a Python virtual environment.
Install Scalable and use its command-line interface (CLI).
Write your first manifest file (scalable.yaml).
Validate, plan, and run a workflow end-to-end.
Read the telemetry output to see what happened.

Prerequisites¶

Python 3.11 or later installed on your computer.
A terminal (Terminal on macOS/Linux, PowerShell or Command Prompt on Windows).
Basic Python knowledge: you can write functions, use import, and know what pip is (even if you don’t use it daily).

No HPC cluster, Docker, or cloud account is needed — everything runs locally.

Key Concepts Explained¶

Before we write any code, let’s define the foundational ideas you’ll encounter.

💡 Key Concept: Distributed Computing

Distributed computing means splitting work across multiple processors or computers that work together. Instead of one CPU doing all 1000 tasks sequentially (one after another), you might have 10 CPUs each handling 100 tasks simultaneously.

Analogy: Imagine stuffing 1000 envelopes. Doing it alone takes hours. With 10 friends helping, each person stuffs 100 envelopes and you finish 10× faster. Distributed computing is getting those friends organized.

💡 Key Concept: What is Dask?

Dask is a Python library for parallel and distributed computing. It’s the “engine” that Scalable uses under the hood to actually run your functions on multiple workers.

Think of Dask as the engine in a car — you don’t need to understand every piston to drive, but knowing it’s there helps you understand what’s happening.

Why Dask? Scalable chose Dask because it:

Integrates natively with Python’s scientific ecosystem (NumPy, pandas)
Scales from a single laptop to thousands of nodes
Has a mature scheduler that handles task dependencies
Supports dynamic scaling (adding/removing workers at runtime)
Is widely adopted in the scientific computing community

Alternatives like Ray (more ML-focused) or Celery (more web-focused) exist, but Dask’s strength is scientific workflows — exactly what Scalable targets.

💡 Key Concept: Command-Line Interface (CLI)

A CLI is a text-based way to interact with a program. Instead of clicking buttons in a graphical interface, you type commands like scalable run ./scalable.yaml.

CLIs are preferred for:

Automation — easy to script and repeat
Remote work — works over SSH where GUIs don’t
Reproducibility — commands can be saved and re-run exactly

💡 Key Concept: Virtual Environment

A virtual environment is an isolated Python installation. It has its own copy of pip and installed packages, separate from your system Python.

Why bother? Without virtual environments, installing a package for Project A might break Project B (if they need different versions of the same library). Virtual environments keep projects isolated.

Analogy: Virtual environments are like separate kitchen pantries for each recipe — what you put in one doesn’t affect the others.

Step 1: Set Up Your Environment¶

Let’s create an isolated Python environment for this tutorial.

Open your terminal and run:

# Create a new virtual environment named ".venv"
python -m venv .venv

# Activate it (this changes your terminal's Python to use the isolated one)
source .venv/bin/activate   # macOS/Linux
# On Windows: .venv\Scripts\activate

What just happened?

python -m venv .venv created a folder called .venv containing a fresh Python installation. source .venv/bin/activate tells your terminal “use this Python instead of the system one.” You’ll see your prompt change (often showing (.venv) at the beginning).

Now install Scalable:

pip install scalable

Verify it worked:

scalable --help

You should see output like:

usage: scalable [-h] {validate,plan,run,report,advise,...} ...

Scalable CLI — orchestrate distributed workflows.

positional arguments:
  {validate,plan,run,report,advise,...}

Under the Hood

When you ran pip install scalable, Python downloaded Scalable and all its dependencies (including Dask). The scalable command is a CLI entry point — a small script that Python created in your virtual environment’s bin/ directory that launches Scalable’s command handler.

Step 2: Create a Project Directory¶

Scalable expects your workflow to live in a dedicated directory:

mkdir my-first-workflow && cd my-first-workflow

The minimal layout is:

my-first-workflow/
├── scalable.yaml       # The manifest (configuration)
└── workflow.py          # Your Python code

💡 Key Concept: Project Structure

Keeping configuration (scalable.yaml) and code (workflow.py) in a dedicated directory makes your workflow:

Portable — zip it up and it works elsewhere
Version-controllable — put it in Git
Self-documenting — everything needed is in one place

Step 3: Write Your First Manifest¶

💡 Key Concept: What is a Manifest?

A manifest is a configuration file that declares the desired state of your system. In Scalable, the manifest (scalable.yaml) answers:

What is this project?
Where should it run? (local machine? cloud? HPC cluster?)
How much resources does each piece need? (CPU, memory)
What are the work units?

The manifest is declarative — more on this below.

💡 Key Concept: Declarative vs. Imperative Programming

This is a fundamental programming paradigm distinction:

Imperative (how to do it):: “SSH into server. Run this command. Check the output. If it failed, retry. Allocate 4GB of RAM by calling this API…”
Declarative (what you want):: “I need 2 workers with 1 CPU and 1GB RAM each.”

The manifest is declarative — you describe your desired state and Scalable figures out how to achieve it. This is the same philosophy behind:

SQL (SELECT name FROM users — you say what data, not how to fetch it)
HTML (<h1>Title</h1> — you say what it is, not how to render it)
Kubernetes YAML (you describe desired state, K8s makes it happen)

Why declarative? It separates intent from implementation. Your manifest works whether you’re running locally, on an HPC cluster, or in AWS — only the “target” section changes.

Create the file scalable.yaml:

# scalable.yaml — Your first Scalable manifest
version: 1
project:
  name: hello-scalable

targets:
  local:
    provider: local
    max_workers: 2
    threads_per_worker: 1
    processes: false
    containers: none

components:
  analysis:
    cpus: 1
    memory: 1G

tasks:
  run_analysis:
    component: analysis

Let’s break this down line by line:

💡 Key Concept: YAML

YAML (YAML Ain’t Markup Language) is a human-readable data format. It uses indentation (spaces, not tabs!) to show structure:

# This is a comment
key: value              # A simple key-value pair
nested:
  child_key: child_val  # Indented = nested inside "nested"
list:
  - item1               # Lists use dashes
  - item2

YAML was chosen over JSON (harder to read/write by hand) and TOML (less expressive for nested structures).

Section-by-section explanation:

version: 1

The schema version. This tells Scalable which format rules to apply when reading your manifest. Currently 1 is the only version.

project: { name: hello-scalable }

Metadata about your project. The name appears in logs, telemetry data, and artifact paths so you can identify which project a run belongs to.

targets:

Targets are where your code runs. You can have multiple targets (local, HPC, cloud) in one manifest and switch between them. Here we define one target called local:

provider: local — Use the built-in local provider (runs on your machine)
max_workers: 2 — Create up to 2 workers (parallel executors)
threads_per_worker: 1 — Each worker uses 1 thread
processes: false — Workers run as threads (not separate processes)
containers: none — No containerization (bare metal)

components:

Components define resource profiles — how much CPU and memory a piece of work needs. The analysis component requests 1 CPU and 1 gigabyte of RAM.

tasks:

Tasks are named work units that bind to a component. When you submit a function to Scalable, you associate it with a task name, which tells the system what resources it needs.

Why separate targets, components, and tasks?

This separation follows the separation of concerns principle:

Targets = where (infrastructure)
Components = how much (resources)
Tasks = what (work units)

You can change where you run (swap the target) without changing what you run (tasks and components stay the same). This is what makes Scalable truly portable.

Step 4: Validate Your Manifest¶

Before running anything, check that your manifest is correctly written:

scalable validate ./scalable.yaml

Expected output:

✓ Manifest is valid (0 errors, 0 warnings)

💡 Key Concept: Validation

Validation means checking that something meets expected rules before using it. It’s like spell-check for your configuration.

Scalable’s validator checks:

Required sections exist (version, project)
Key names are spelled correctly (catches typos like providr)
References are valid (a task’s component actually exists)
Values are the right type (max_workers must be a positive number)

Why validate first? It’s much faster and cheaper to catch errors in a config file than to discover them 30 minutes into a cloud run that’s costing you money.

Try introducing a deliberate error to see what happens:

# Change "provider" to "providr" (typo) and validate again
targets:
  local:
    providr: local   # <-- typo!

ERROR targets.local: unknown provider 'providr'

Step 5: Plan the Execution¶

Planning shows you what would happen without actually doing it:

scalable plan ./scalable.yaml --target local --dry-run

Plan created for target 'local' (provider: local)
Workers: 2 × analysis (1 cpu, 1G memory)
Manifest lock: sha256:a3b8f1...

💡 Key Concept: Dry Run

A dry run simulates an operation without executing it. It answers “what would happen if I ran this?” without consuming real resources.

This is valuable because:

You can verify your configuration before spending time/money
You can review the plan and catch mistakes
In cloud environments, you can see estimated costs before committing

The --dry-run flag is common across many tools (terraform plan, kubectl --dry-run, rsync --dry-run).

💡 Key Concept: Manifest Lock (Hash)

The sha256:a3b8f1... is a hash — a fingerprint of your manifest’s contents. If you change anything in the manifest, the hash changes. This enables:

Reproducibility — you can verify that a run used the exact same configuration as a previous run
Caching — Scalable knows if the manifest changed since last run

Step 6: Write Your Workflow Code¶

Now let’s write the Python function that does actual work. Create workflow.py:

"""My first Scalable workflow."""
import time
from scalable import ScalableSession


def analyze_scenario(scenario_id: int) -> dict:
    """Simulate an analysis task.

    In a real workflow this might run an energy model, process
    satellite data, or train a machine learning model. Here we
    just simulate work with a sleep.
    """
    time.sleep(0.5)  # Simulate 0.5 seconds of computation
    return {
        "scenario_id": scenario_id,
        "result": scenario_id * 42,
        "status": "complete",
    }


def main():
    """Run the workflow using a ScalableSession."""
    # Create a session from our manifest
    session = ScalableSession.from_yaml(
        "./scalable.yaml",
        target="local",
    )
    plan = session.plan()
    client = session.start(plan)

    # Submit 6 tasks to be executed in parallel
    futures = []
    for i in range(6):
        future = client.submit(analyze_scenario, i, tag="analysis")
        futures.append(future)

    # Gather results (blocks until all tasks complete)
    results = client.gather(futures)

    print(f"Completed {len(results)} scenarios!")
    for r in results:
        print(f"  Scenario {r['scenario_id']}: result = {r['result']}")

    # Clean up
    session.close()


if __name__ == "__main__":
    main()

Let’s understand what this code does:

Under the Hood: What happens when you call client.submit()

Your function (analyze_scenario) and its arguments (scenario_id) are serialized (converted to bytes that can be sent over a network).
The serialized task is sent to Dask’s scheduler.
The scheduler finds an available worker and assigns the task.
The worker deserializes the function, executes it, and sends the result back.
You get a future — a placeholder for the result that will be available later.

With max_workers: 2, Scalable runs 2 tasks at a time. Since we submitted 6 tasks, they execute in 3 batches of 2 (total ~1.5 seconds instead of 3 seconds sequentially).

💡 Key Concept: Futures

A future is a promise of a result that hasn’t been computed yet. When you call client.submit(), the task starts running in the background and you immediately get back a future object.

Later, when you call client.gather(futures), Python waits until all the futures have their results ready, then returns them.

Analogy: Ordering food at a counter — you get a receipt number (future) immediately. The food is being prepared in the background. When you hear your number called, you pick up your food (gather the result).

Step 7: Run the Workflow¶

Execute your workflow:

python workflow.py

Expected output:

Completed 6 scenarios!
  Scenario 0: result = 0
  Scenario 1: result = 42
  Scenario 2: result = 84
  Scenario 3: result = 126
  Scenario 4: result = 168
  Scenario 5: result = 210

You can also run workflows via the CLI (for manifests that define entry points), but the Python API gives you the most control.

🤔 Think About It

With 6 tasks and 2 workers, how long should this take?

Sequential (no parallelism): 6 × 0.5s = 3.0 seconds
Parallel with 2 workers: 3 batches × 0.5s = ~1.5 seconds

The speedup is approximately 2× with 2 workers. This is the fundamental value of distributed computing — trading more hardware for less time.

Step 8: Inspect Telemetry¶

💡 Key Concept: Telemetry

Telemetry is automated data collection about what happened during execution. Think of it like a flight recorder (black box) for your workflow — it records events so you can understand what happened after the fact.

After your run completes, Scalable has recorded telemetry data. Generate a report:

scalable report --last

This shows a summary of your most recent run: how many tasks succeeded, how long they took, and resource utilization.

Common Questions¶

Q: Do I always need a manifest file?

Yes — the manifest is the single source of truth for your workflow’s resource requirements. This is by design: it makes workflows reproducible and portable.

Q: Why not just use Python’s ``multiprocessing`` module?

Python’s multiprocessing works for simple parallelism on one machine. But it can’t:

Scale to multiple machines (HPC clusters, cloud)
Manage heterogeneous resources (different CPU/memory per task type)
Cache results between runs
Provide telemetry and observability
Handle worker failures gracefully

Scalable (via Dask) provides all of these.

Q: What’s the difference between threads and processes?

Threads share memory (fast communication, but Python’s GIL limits true CPU parallelism).
Processes have separate memory (true parallelism, but higher overhead to start and communicate).

For I/O-bound work (network calls, file reading), threads work well. For CPU-bound work (heavy math), processes are better. The processes: false setting in our manifest uses threads for simplicity.

Q: What is the GIL?

The Global Interpreter Lock (GIL) is a Python implementation detail that prevents multiple threads from executing Python code simultaneously. It exists for memory safety but means CPU-bound threads don’t truly run in parallel. This is why processes: true is better for computation-heavy tasks.

What You Learned¶

Term	Definition
Workflow	A sequence of computational steps transforming inputs to outputs
Distributed Computing	Splitting work across multiple processors/computers
Dask	Python library for parallel computing (Scalable’s engine)
CLI	Text-based interface for running commands
Virtual Environment	Isolated Python installation for dependency management
Manifest	Declarative configuration file describing desired state
Declarative Programming	Describing what you want, not how to achieve it
Provider	Abstraction over an execution backend (local, HPC, cloud)
Worker	A process/thread that executes tasks
Future	A placeholder for a result being computed asynchronously
Validation	Checking correctness before execution
Dry Run	Simulating an operation without executing it
Telemetry	Automated recording of execution data

Next Steps¶

You’ve run your first Scalable workflow! You now understand the fundamental concepts that everything else builds on.

Next beginner tutorial: Beginner Tutorial 2: Understanding the Manifest System — deep dive into declarative configuration and YAML
Standard tutorial: Tutorial 1: Getting Started with Scalable — same topic with less explanation, more advanced patterns
Try modifying: Change max_workers to 4 and re-run. Is it faster? Why or why not? (Hint: you only have 6 tasks.)