Beginner Tutorial 1: Your First Workflow¶
The Big Picture¶
Imagine you have a Python script that processes data — maybe it analyzes energy scenarios, runs simulations, or trains models. When the data grows, running everything on your laptop becomes painfully slow. You need a way to split the work across multiple processors (or multiple computers) without rewriting your entire program.
That’s what Scalable does. It takes your Python functions and orchestrates them across multiple workers — whether those workers are threads on your laptop, processes on an HPC cluster, or containers in the cloud. And it does this through a simple configuration file rather than requiring you to write complex parallel programming code.
This tutorial walks you through your very first Scalable workflow, explaining every concept along the way.
Where this is going
This tutorial uses a deliberately trivial hello-scalable project so
you can verify your install in a few minutes. Tutorials 2–10 then graduate
to a realistic running example: downscaling
Demeter land-use / land-cover
projections across many GCAM scenarios in parallel. When you’re ready to
actually execute that pipeline, Tutorial Setup: Run the Demeter Example End-to-End walks
through the one-time setup.
💡 Key Concept: What is a Workflow?
A workflow is a sequence of computational steps that transforms inputs into outputs. Think of it like a recipe: you have ingredients (data), steps (functions), and a final dish (results).
In Scalable, a workflow consists of:
A manifest (configuration file) describing what resources you need
Python functions that do the actual work
A target (where the work runs — your laptop, a cluster, the cloud)
What You Will Learn¶
By the end of this tutorial you will:
Understand what Scalable is and why it exists.
Know what Dask is and why Scalable uses it under the hood.
Create and activate a Python virtual environment.
Install Scalable and use its command-line interface (CLI).
Write your first manifest file (
scalable.yaml).Validate, plan, and run a workflow end-to-end.
Read the telemetry output to see what happened.
Prerequisites¶
Python 3.11 or later installed on your computer.
A terminal (Terminal on macOS/Linux, PowerShell or Command Prompt on Windows).
Basic Python knowledge: you can write functions, use
import, and know whatpipis (even if you don’t use it daily).
No HPC cluster, Docker, or cloud account is needed — everything runs locally.
Key Concepts Explained¶
Before we write any code, let’s define the foundational ideas you’ll encounter.
💡 Key Concept: Distributed Computing
Distributed computing means splitting work across multiple processors or computers that work together. Instead of one CPU doing all 1000 tasks sequentially (one after another), you might have 10 CPUs each handling 100 tasks simultaneously.
Analogy: Imagine stuffing 1000 envelopes. Doing it alone takes hours. With 10 friends helping, each person stuffs 100 envelopes and you finish 10× faster. Distributed computing is getting those friends organized.
💡 Key Concept: What is Dask?
Dask is a Python library for parallel and distributed computing. It’s the “engine” that Scalable uses under the hood to actually run your functions on multiple workers.
Think of Dask as the engine in a car — you don’t need to understand every piston to drive, but knowing it’s there helps you understand what’s happening.
Why Dask? Scalable chose Dask because it:
Integrates natively with Python’s scientific ecosystem (NumPy, pandas)
Scales from a single laptop to thousands of nodes
Has a mature scheduler that handles task dependencies
Supports dynamic scaling (adding/removing workers at runtime)
Is widely adopted in the scientific computing community
Alternatives like Ray (more ML-focused) or Celery (more web-focused) exist, but Dask’s strength is scientific workflows — exactly what Scalable targets.
💡 Key Concept: Command-Line Interface (CLI)
A CLI is a text-based way to interact with a program. Instead of
clicking buttons in a graphical interface, you type commands like
scalable run ./scalable.yaml.
CLIs are preferred for:
Automation — easy to script and repeat
Remote work — works over SSH where GUIs don’t
Reproducibility — commands can be saved and re-run exactly
💡 Key Concept: Virtual Environment
A virtual environment is an isolated Python installation. It has its
own copy of pip and installed packages, separate from your system
Python.
Why bother? Without virtual environments, installing a package for Project A might break Project B (if they need different versions of the same library). Virtual environments keep projects isolated.
Analogy: Virtual environments are like separate kitchen pantries for each recipe — what you put in one doesn’t affect the others.
Step 1: Set Up Your Environment¶
Let’s create an isolated Python environment for this tutorial.
Open your terminal and run:
# Create a new virtual environment named ".venv"
python -m venv .venv
# Activate it (this changes your terminal's Python to use the isolated one)
source .venv/bin/activate # macOS/Linux
# On Windows: .venv\Scripts\activate
What just happened?
python -m venv .venv created a folder called .venv containing a
fresh Python installation. source .venv/bin/activate tells your terminal
“use this Python instead of the system one.” You’ll see your prompt change
(often showing (.venv) at the beginning).
Now install Scalable:
pip install scalable
Verify it worked:
scalable --help
You should see output like:
usage: scalable [-h] {validate,plan,run,report,advise,...} ...
Scalable CLI — orchestrate distributed workflows.
positional arguments:
{validate,plan,run,report,advise,...}
Under the Hood
When you ran pip install scalable, Python downloaded Scalable and all
its dependencies (including Dask). The scalable command is a CLI entry
point — a small script that Python created in your virtual environment’s
bin/ directory that launches Scalable’s command handler.
Step 2: Create a Project Directory¶
Scalable expects your workflow to live in a dedicated directory:
mkdir my-first-workflow && cd my-first-workflow
The minimal layout is:
my-first-workflow/
├── scalable.yaml # The manifest (configuration)
└── workflow.py # Your Python code
💡 Key Concept: Project Structure
Keeping configuration (scalable.yaml) and code (workflow.py) in a
dedicated directory makes your workflow:
Portable — zip it up and it works elsewhere
Version-controllable — put it in Git
Self-documenting — everything needed is in one place
Step 3: Write Your First Manifest¶
💡 Key Concept: What is a Manifest?
A manifest is a configuration file that declares the desired state of
your system. In Scalable, the manifest (scalable.yaml) answers:
What is this project?
Where should it run? (local machine? cloud? HPC cluster?)
How much resources does each piece need? (CPU, memory)
What are the work units?
The manifest is declarative — more on this below.
💡 Key Concept: Declarative vs. Imperative Programming
This is a fundamental programming paradigm distinction:
- Imperative (how to do it):
“SSH into server. Run this command. Check the output. If it failed, retry. Allocate 4GB of RAM by calling this API…”
- Declarative (what you want):
“I need 2 workers with 1 CPU and 1GB RAM each.”
The manifest is declarative — you describe your desired state and Scalable figures out how to achieve it. This is the same philosophy behind:
SQL (
SELECT name FROM users— you say what data, not how to fetch it)HTML (
<h1>Title</h1>— you say what it is, not how to render it)Kubernetes YAML (you describe desired state, K8s makes it happen)
Why declarative? It separates intent from implementation. Your manifest works whether you’re running locally, on an HPC cluster, or in AWS — only the “target” section changes.
Create the file scalable.yaml:
# scalable.yaml — Your first Scalable manifest
version: 1
project:
name: hello-scalable
targets:
local:
provider: local
max_workers: 2
threads_per_worker: 1
processes: false
containers: none
components:
analysis:
cpus: 1
memory: 1G
tasks:
run_analysis:
component: analysis
Let’s break this down line by line:
💡 Key Concept: YAML
YAML (YAML Ain’t Markup Language) is a human-readable data format. It uses indentation (spaces, not tabs!) to show structure:
# This is a comment
key: value # A simple key-value pair
nested:
child_key: child_val # Indented = nested inside "nested"
list:
- item1 # Lists use dashes
- item2
YAML was chosen over JSON (harder to read/write by hand) and TOML (less expressive for nested structures).
Section-by-section explanation:
version: 1The schema version. This tells Scalable which format rules to apply when reading your manifest. Currently
1is the only version.project: { name: hello-scalable }Metadata about your project. The
nameappears in logs, telemetry data, and artifact paths so you can identify which project a run belongs to.targets:Targets are where your code runs. You can have multiple targets (local, HPC, cloud) in one manifest and switch between them. Here we define one target called
local:provider: local— Use the built-in local provider (runs on your machine)max_workers: 2— Create up to 2 workers (parallel executors)threads_per_worker: 1— Each worker uses 1 threadprocesses: false— Workers run as threads (not separate processes)containers: none— No containerization (bare metal)
components:Components define resource profiles — how much CPU and memory a piece of work needs. The
analysiscomponent requests 1 CPU and 1 gigabyte of RAM.tasks:Tasks are named work units that bind to a component. When you submit a function to Scalable, you associate it with a task name, which tells the system what resources it needs.
Why separate targets, components, and tasks?
This separation follows the separation of concerns principle:
Targets = where (infrastructure)
Components = how much (resources)
Tasks = what (work units)
You can change where you run (swap the target) without changing what you run (tasks and components stay the same). This is what makes Scalable truly portable.
Step 4: Validate Your Manifest¶
Before running anything, check that your manifest is correctly written:
scalable validate ./scalable.yaml
Expected output:
✓ Manifest is valid (0 errors, 0 warnings)
💡 Key Concept: Validation
Validation means checking that something meets expected rules before using it. It’s like spell-check for your configuration.
Scalable’s validator checks:
Required sections exist (
version,project)Key names are spelled correctly (catches typos like
providr)References are valid (a task’s
componentactually exists)Values are the right type (
max_workersmust be a positive number)
Why validate first? It’s much faster and cheaper to catch errors in a config file than to discover them 30 minutes into a cloud run that’s costing you money.
Try introducing a deliberate error to see what happens:
# Change "provider" to "providr" (typo) and validate again
targets:
local:
providr: local # <-- typo!
ERROR targets.local: unknown provider 'providr'
Step 5: Plan the Execution¶
Planning shows you what would happen without actually doing it:
scalable plan ./scalable.yaml --target local --dry-run
Plan created for target 'local' (provider: local)
Workers: 2 × analysis (1 cpu, 1G memory)
Manifest lock: sha256:a3b8f1...
💡 Key Concept: Dry Run
A dry run simulates an operation without executing it. It answers “what would happen if I ran this?” without consuming real resources.
This is valuable because:
You can verify your configuration before spending time/money
You can review the plan and catch mistakes
In cloud environments, you can see estimated costs before committing
The --dry-run flag is common across many tools (terraform plan,
kubectl --dry-run, rsync --dry-run).
💡 Key Concept: Manifest Lock (Hash)
The sha256:a3b8f1... is a hash — a fingerprint of your manifest’s
contents. If you change anything in the manifest, the hash changes. This
enables:
Reproducibility — you can verify that a run used the exact same configuration as a previous run
Caching — Scalable knows if the manifest changed since last run
Step 6: Write Your Workflow Code¶
Now let’s write the Python function that does actual work. Create
workflow.py:
"""My first Scalable workflow."""
import time
from scalable import ScalableSession
def analyze_scenario(scenario_id: int) -> dict:
"""Simulate an analysis task.
In a real workflow this might run an energy model, process
satellite data, or train a machine learning model. Here we
just simulate work with a sleep.
"""
time.sleep(0.5) # Simulate 0.5 seconds of computation
return {
"scenario_id": scenario_id,
"result": scenario_id * 42,
"status": "complete",
}
def main():
"""Run the workflow using a ScalableSession."""
# Create a session from our manifest
session = ScalableSession.from_yaml(
"./scalable.yaml",
target="local",
)
plan = session.plan()
client = session.start(plan)
# Submit 6 tasks to be executed in parallel
futures = []
for i in range(6):
future = client.submit(analyze_scenario, i, tag="analysis")
futures.append(future)
# Gather results (blocks until all tasks complete)
results = client.gather(futures)
print(f"Completed {len(results)} scenarios!")
for r in results:
print(f" Scenario {r['scenario_id']}: result = {r['result']}")
# Clean up
session.close()
if __name__ == "__main__":
main()
Let’s understand what this code does:
Under the Hood: What happens when you call client.submit()
Your function (
analyze_scenario) and its arguments (scenario_id) are serialized (converted to bytes that can be sent over a network).The serialized task is sent to Dask’s scheduler.
The scheduler finds an available worker and assigns the task.
The worker deserializes the function, executes it, and sends the result back.
You get a future — a placeholder for the result that will be available later.
With max_workers: 2, Scalable runs 2 tasks at a time. Since we
submitted 6 tasks, they execute in 3 batches of 2 (total ~1.5 seconds
instead of 3 seconds sequentially).
💡 Key Concept: Futures
A future is a promise of a result that hasn’t been computed yet. When
you call client.submit(), the task starts running in the background
and you immediately get back a future object.
Later, when you call client.gather(futures), Python waits until all
the futures have their results ready, then returns them.
Analogy: Ordering food at a counter — you get a receipt number (future) immediately. The food is being prepared in the background. When you hear your number called, you pick up your food (gather the result).
Step 7: Run the Workflow¶
Execute your workflow:
python workflow.py
Expected output:
Completed 6 scenarios!
Scenario 0: result = 0
Scenario 1: result = 42
Scenario 2: result = 84
Scenario 3: result = 126
Scenario 4: result = 168
Scenario 5: result = 210
You can also run workflows via the CLI (for manifests that define entry points), but the Python API gives you the most control.
🤔 Think About It
With 6 tasks and 2 workers, how long should this take?
Sequential (no parallelism): 6 × 0.5s = 3.0 seconds
Parallel with 2 workers: 3 batches × 0.5s = ~1.5 seconds
The speedup is approximately 2× with 2 workers. This is the fundamental value of distributed computing — trading more hardware for less time.
Step 8: Inspect Telemetry¶
💡 Key Concept: Telemetry
Telemetry is automated data collection about what happened during execution. Think of it like a flight recorder (black box) for your workflow — it records events so you can understand what happened after the fact.
After your run completes, Scalable has recorded telemetry data. Generate a report:
scalable report --last
This shows a summary of your most recent run: how many tasks succeeded, how long they took, and resource utilization.
Common Questions¶
Q: Do I always need a manifest file?
Yes — the manifest is the single source of truth for your workflow’s resource requirements. This is by design: it makes workflows reproducible and portable.
Q: Why not just use Python’s ``multiprocessing`` module?
Python’s multiprocessing works for simple parallelism on one machine. But
it can’t:
Scale to multiple machines (HPC clusters, cloud)
Manage heterogeneous resources (different CPU/memory per task type)
Cache results between runs
Provide telemetry and observability
Handle worker failures gracefully
Scalable (via Dask) provides all of these.
Q: What’s the difference between threads and processes?
Threads share memory (fast communication, but Python’s GIL limits true CPU parallelism).
Processes have separate memory (true parallelism, but higher overhead to start and communicate).
For I/O-bound work (network calls, file reading), threads work well. For
CPU-bound work (heavy math), processes are better. The processes: false
setting in our manifest uses threads for simplicity.
Q: What is the GIL?
The Global Interpreter Lock (GIL) is a Python implementation detail that
prevents multiple threads from executing Python code simultaneously. It
exists for memory safety but means CPU-bound threads don’t truly run in
parallel. This is why processes: true is better for computation-heavy
tasks.
What You Learned¶
Term |
Definition |
|---|---|
Workflow |
A sequence of computational steps transforming inputs to outputs |
Distributed Computing |
Splitting work across multiple processors/computers |
Dask |
Python library for parallel computing (Scalable’s engine) |
CLI |
Text-based interface for running commands |
Virtual Environment |
Isolated Python installation for dependency management |
Manifest |
Declarative configuration file describing desired state |
Declarative Programming |
Describing what you want, not how to achieve it |
Provider |
Abstraction over an execution backend (local, HPC, cloud) |
Worker |
A process/thread that executes tasks |
Future |
A placeholder for a result being computed asynchronously |
Validation |
Checking correctness before execution |
Dry Run |
Simulating an operation without executing it |
Telemetry |
Automated recording of execution data |
Next Steps¶
You’ve run your first Scalable workflow! You now understand the fundamental concepts that everything else builds on.
Next beginner tutorial: Beginner Tutorial 2: Understanding the Manifest System — deep dive into declarative configuration and YAML
Standard tutorial: Tutorial 1: Getting Started with Scalable — same topic with less explanation, more advanced patterns
Try modifying: Change
max_workersto 4 and re-run. Is it faster? Why or why not? (Hint: you only have 6 tasks.)