Tutorial 5: Cloud Integration with AWS and GCP¶

What You Will Learn¶

By the end of this tutorial you will:

Configure AWS Fargate and EC2-backed Dask clusters via Scalable.
Set up GCP Cloud Run / GKE-based execution.
Use the artifact store for persistent cloud storage (S3, GCS).
Estimate costs before running with dry-run planning.
Deploy multi-target manifests that promote from local to cloud.
Manage IAM roles, networking, and container registries.

Prerequisites¶

Completed Tutorial 1: Getting Started with Scalable and Tutorial 2: Mastering the Manifest System.
pip install scalable[cloud] (installs s3fs, gcsfs, dask-cloudprovider, fsspec).
AWS credentials configured (~/.aws/credentials or environment variables).
(For GCP) gcloud CLI authenticated or GOOGLE_APPLICATION_CREDENTIALS set.

Scenario¶

Your energy forecasting pipeline works locally but needs to scale to 50+ concurrent scenarios for a production run. Your organization uses AWS for burst compute and GCS for long-term data storage. You need to deploy the same workflow to cloud infrastructure with cost visibility.

Step 1: AWS Target Configuration¶

The AWS provider uses dask-cloudprovider to launch Dask workers on Fargate (serverless containers) or EC2 instances:

# scalable.yaml
version: 1
project:
  name: demeter-lulcc-aws
  default_storage: s3://${S3_BUCKET}/scalable-runs/

targets:
  aws:
    provider: aws
    region: ${AWS_REGION:-us-east-1}
    cluster_type: fargate
    instance_type: m5.xlarge       # For EC2-backed mode
    worker_cpu: 4096               # Fargate CPU units (1024 = 1 vCPU)
    worker_mem: 16384              # Fargate memory in MiB
    image: ${ECR_IMAGE}
    execution_role_arn: ${EXECUTION_ROLE_ARN}
    task_role_arn: ${TASK_ROLE_ARN}
    subnets:
      - ${SUBNET_A}
      - ${SUBNET_B}
    security_groups:
      - ${SG_ID}
    adaptive:
      minimum: 2
      maximum: 20

components:
  demeter:
    image: ${ECR_DEMETER_IMAGE}
    cpus: 4
    memory: 16G
    tags: [lulcc, downscaling, gcam]

  postprocess:
    cpus: 2
    memory: 8G
    tags: [analysis]

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate:
    component: postprocess
    cache: true

Key configuration explained:

cluster_type

fargate for serverless (no EC2 management) or ec2 for instance-backed clusters (lower cost at scale, more control over instance types).

worker_cpu / worker_mem

Fargate task sizing. CPU is in units of 1024 (= 1 vCPU). Common configurations:

CPU	Memory	Use Case
1024	4096	Light tasks, I/O-bound
4096	16384	Standard compute tasks
16384	65536	Memory-intensive models

execution_role_arn

IAM role assumed by the ECS agent to pull images and write logs. Needs ecr:GetAuthorizationToken, ecr:BatchGetImage, logs:CreateLogStream permissions.

task_role_arn

IAM role assumed by the running task. Needs S3 read/write for artifacts, network access for Dask scheduler communication.

Step 2: Set Up AWS Infrastructure¶

Before running, ensure these AWS resources exist:

1. ECR Repository (Container Registry):

aws ecr create-repository --repository-name demeter
# Push your image
docker build -t demeter:2.0.1 .
docker tag demeter:2.0.1 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1

2. VPC + Subnets:

Workers need outbound internet access (for Dask scheduler communication) and access to S3. Use a VPC with NAT Gateway or VPC endpoints.

3. Security Group:

# Allow inbound from scheduler, outbound to internet
aws ec2 create-security-group \
  --group-name scalable-workers \
  --description "Scalable Dask workers"
aws ec2 authorize-security-group-ingress \
  --group-id sg-xyz789 \
  --protocol tcp --port 8786-8787 \
  --source-group sg-xyz789

4. IAM Roles:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}

Step 3: Dry-Run Cost Estimation¶

Before launching real cloud resources, estimate costs:

scalable run ./scalable.yaml --target aws --dry-run

Dry-run plan for target 'aws' (provider: aws):
  Workers: 10 × gcam (4 vCPU, 16 GiB)
           5 × postprocess (2 vCPU, 8 GiB)
  Estimated duration: 2.5 hours
  Estimated cost: $4.82
    Fargate compute: $3.90
    Data transfer: $0.12
    S3 storage: $0.80

Programmatic cost access:

from scalable import ScalableSession

session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
plan = session.plan(dry_run=True)

if plan.cost_estimate:
    print(f"Estimated cost: ${plan.cost_estimate.total:.2f}")
    print(f"  Compute: ${plan.cost_estimate.compute:.2f}")
    print(f"  Storage: ${plan.cost_estimate.storage:.2f}")
    print(f"  Transfer: ${plan.cost_estimate.transfer:.2f}")

How cost estimation works: Scalable uses the scalable.providers.cloud.cost_tables module which contains region-specific pricing for Fargate vCPU-hours, memory-hours, and S3 operations. Estimates are based on the planned worker count, predicted task duration (from telemetry history if available), and declared storage outputs.

Step 4: GCP Target Configuration¶

For Google Cloud Platform, use GCS for storage and either Cloud Run or GKE for compute:

targets:
  gcp:
    provider: gcp
    region: us-central1
    project_id: ${GCP_PROJECT_ID}
    cluster_type: cloud_run
    worker_cpu: 4
    worker_mem: 16Gi
    image: gcr.io/${GCP_PROJECT_ID}/demeter:2.0.1
    service_account: ${GCP_SERVICE_ACCOUNT}
    adaptive:
      minimum: 1
      maximum: 15

project:
  default_storage: gs://${GCS_BUCKET}/scalable-runs/

GCP-specific setup:

# Authenticate
gcloud auth application-default login

# Push image to GCR
gcloud builds submit --tag gcr.io/my-project/demeter:2.0.1 .

# Create GCS bucket for artifacts
gsutil mb -l us-central1 gs://my-bucket/

Step 5: Artifact Store — Cloud Storage¶

The artifact store provides a unified interface for persisting outputs across storage backends:

from scalable.artifacts import build_artifact_store

# Local storage (default)
local_store = build_artifact_store("./artifacts")

# S3 storage
s3_store = build_artifact_store("s3://my-bucket/artifacts/")

# GCS storage
gcs_store = build_artifact_store("gs://my-bucket/artifacts/")

# Store a file
ref = s3_store.put("local/output.csv", "runs/run-001/output.csv")
print(ref)
# ArtifactRef(uri='s3://my-bucket/artifacts/runs/run-001/output.csv')

# Retrieve a file
local_path = s3_store.get("runs/run-001/output.csv", "./downloads/output.csv")

The store is protocol-aware via fsspec: it detects the URI scheme and uses the appropriate backend (s3fs for S3, gcsfs for GCS, local filesystem for paths).

Integration with workflow output:

from scalable import ScalableSession
from scalable.artifacts import build_artifact_store

session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
client = session.start()

# Run simulation
result = client.submit(run_demeter_scenario, scenario_params, tag="demeter").result()

# Persist output artifact to configured storage
store = build_artifact_store(session.manifest.project.default_storage)
ref = store.put(
    result["output_path"],
    f"runs/{session._telemetry.run_id}/gcam-output.tar.gz",
)
print(f"Artifact persisted: {ref.uri}")

Step 6: Multi-Region Deployment¶

For global workflows, define targets in multiple regions:

targets:
  aws-east:
    provider: aws
    region: us-east-1
    # ... config ...
    adaptive:
      minimum: 5
      maximum: 50

  aws-west:
    provider: aws
    region: us-west-2
    # ... config ...
    adaptive:
      minimum: 2
      maximum: 20

  gcp-europe:
    provider: gcp
    region: europe-west1
    # ... config ...

Select at runtime:

# Heavy production run in us-east-1
scalable run ./scalable.yaml --target aws-east --workflow pipeline.py

# Quick validation in us-west-2
scalable run ./scalable.yaml --target aws-west --workflow pipeline.py --dry-run

Step 7: Cloud + Cache Integration¶

Combine cloud execution with remote caching so repeated runs across different machines share results:

export SCALABLE_CACHE_REMOTE=s3://my-bucket/scalable-cache/

project:
  name: demeter-lulcc
  default_storage: s3://my-bucket/outputs/

Now:

First cloud run computes all scenarios and caches results to S3.
Subsequent runs (from any machine) hit the shared cache.
Only modified scenarios recompute.

This is particularly powerful for CI/CD: your PR validation pipeline benefits from the cache populated by previous runs.

Step 8: Environment Variable Template¶

For production deployments, maintain a .env template:

# .env.cloud (do not commit secrets — use secrets manager)
AWS_REGION=us-east-1
S3_BUCKET=demeter-prod-artifacts
ECR_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
ECR_DEMETER_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/ecsTaskExecutionRole
TASK_ROLE_ARN=arn:aws:iam::123456789:role/scalableTaskRole
SUBNET_A=subnet-abc123
SUBNET_B=subnet-def456
SG_ID=sg-xyz789
SCALABLE_CACHE_REMOTE=s3://demeter-prod-artifacts/cache/

Load before running:

set -a && source .env.cloud && set +a
scalable run ./scalable.yaml --target aws --workflow pipeline.py

Troubleshooting¶

“botocore.exceptions.NoCredentialsError”: AWS credentials are not configured. Run aws configure or set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. For EC2/ECS, ensure the instance profile or task role has necessary permissions.
Fargate task fails with “CannotPullContainerError”: The execution role lacks ECR permissions, the image URI is wrong, or the image doesn’t exist in the specified region. Verify with: aws ecr describe-images --repository-name demeter.
Workers can’t connect to scheduler: Security group must allow inbound TCP on the Dask scheduler port (8786) from the worker security group. Subnets must have a route to the scheduler host (typically your local machine or a bastion).
GCS “403 Forbidden”: The service account lacks storage.objects.create permission on the bucket. Grant the roles/storage.objectAdmin role.
Cost estimate shows $0.00: Cost tables may not have pricing for your specific region or instance type. Check that scalable.providers.cloud.cost_tables includes your region.

Next Steps¶

Tutorial 6: Monitoring and Observability with Telemetry — Monitor cloud run costs and performance through telemetry.
Tutorial 8: Deployment Workflows with Kubernetes — Deploy to Kubernetes for container-native orchestration.
Tutorial 7: Error Handling and Resilience Patterns — Handle cloud-specific transient failures (timeouts, preemption).