Tutorial 5: Cloud Integration with AWS and GCP

What You Will Learn

By the end of this tutorial you will:

  • Configure AWS Fargate and EC2-backed Dask clusters via Scalable.

  • Set up GCP Cloud Run / GKE-based execution.

  • Use the artifact store for persistent cloud storage (S3, GCS).

  • Estimate costs before running with dry-run planning.

  • Deploy multi-target manifests that promote from local to cloud.

  • Manage IAM roles, networking, and container registries.

Prerequisites

Scenario

Your energy forecasting pipeline works locally but needs to scale to 50+ concurrent scenarios for a production run. Your organization uses AWS for burst compute and GCS for long-term data storage. You need to deploy the same workflow to cloud infrastructure with cost visibility.

Step 1: AWS Target Configuration

The AWS provider uses dask-cloudprovider to launch Dask workers on Fargate (serverless containers) or EC2 instances:

# scalable.yaml
version: 1
project:
  name: demeter-lulcc-aws
  default_storage: s3://${S3_BUCKET}/scalable-runs/

targets:
  aws:
    provider: aws
    region: ${AWS_REGION:-us-east-1}
    cluster_type: fargate
    instance_type: m5.xlarge       # For EC2-backed mode
    worker_cpu: 4096               # Fargate CPU units (1024 = 1 vCPU)
    worker_mem: 16384              # Fargate memory in MiB
    image: ${ECR_IMAGE}
    execution_role_arn: ${EXECUTION_ROLE_ARN}
    task_role_arn: ${TASK_ROLE_ARN}
    subnets:
      - ${SUBNET_A}
      - ${SUBNET_B}
    security_groups:
      - ${SG_ID}
    adaptive:
      minimum: 2
      maximum: 20

components:
  demeter:
    image: ${ECR_DEMETER_IMAGE}
    cpus: 4
    memory: 16G
    tags: [lulcc, downscaling, gcam]

  postprocess:
    cpus: 2
    memory: 8G
    tags: [analysis]

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate:
    component: postprocess
    cache: true

Key configuration explained:

cluster_type

fargate for serverless (no EC2 management) or ec2 for instance-backed clusters (lower cost at scale, more control over instance types).

worker_cpu / worker_mem

Fargate task sizing. CPU is in units of 1024 (= 1 vCPU). Common configurations:

CPU

Memory

Use Case

1024

4096

Light tasks, I/O-bound

4096

16384

Standard compute tasks

16384

65536

Memory-intensive models

execution_role_arn

IAM role assumed by the ECS agent to pull images and write logs. Needs ecr:GetAuthorizationToken, ecr:BatchGetImage, logs:CreateLogStream permissions.

task_role_arn

IAM role assumed by the running task. Needs S3 read/write for artifacts, network access for Dask scheduler communication.

Step 2: Set Up AWS Infrastructure

Before running, ensure these AWS resources exist:

1. ECR Repository (Container Registry):

aws ecr create-repository --repository-name demeter
# Push your image
docker build -t demeter:2.0.1 .
docker tag demeter:2.0.1 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1

2. VPC + Subnets:

Workers need outbound internet access (for Dask scheduler communication) and access to S3. Use a VPC with NAT Gateway or VPC endpoints.

3. Security Group:

# Allow inbound from scheduler, outbound to internet
aws ec2 create-security-group \
  --group-name scalable-workers \
  --description "Scalable Dask workers"
aws ec2 authorize-security-group-ingress \
  --group-id sg-xyz789 \
  --protocol tcp --port 8786-8787 \
  --source-group sg-xyz789

4. IAM Roles:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}

Step 3: Dry-Run Cost Estimation

Before launching real cloud resources, estimate costs:

scalable run ./scalable.yaml --target aws --dry-run
Dry-run plan for target 'aws' (provider: aws):
  Workers: 10 × gcam (4 vCPU, 16 GiB)
           5 × postprocess (2 vCPU, 8 GiB)
  Estimated duration: 2.5 hours
  Estimated cost: $4.82
    Fargate compute: $3.90
    Data transfer: $0.12
    S3 storage: $0.80

Programmatic cost access:

from scalable import ScalableSession

session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
plan = session.plan(dry_run=True)

if plan.cost_estimate:
    print(f"Estimated cost: ${plan.cost_estimate.total:.2f}")
    print(f"  Compute: ${plan.cost_estimate.compute:.2f}")
    print(f"  Storage: ${plan.cost_estimate.storage:.2f}")
    print(f"  Transfer: ${plan.cost_estimate.transfer:.2f}")

How cost estimation works: Scalable uses the scalable.providers.cloud.cost_tables module which contains region-specific pricing for Fargate vCPU-hours, memory-hours, and S3 operations. Estimates are based on the planned worker count, predicted task duration (from telemetry history if available), and declared storage outputs.

Step 4: GCP Target Configuration

For Google Cloud Platform, use GCS for storage and either Cloud Run or GKE for compute:

targets:
  gcp:
    provider: gcp
    region: us-central1
    project_id: ${GCP_PROJECT_ID}
    cluster_type: cloud_run
    worker_cpu: 4
    worker_mem: 16Gi
    image: gcr.io/${GCP_PROJECT_ID}/demeter:2.0.1
    service_account: ${GCP_SERVICE_ACCOUNT}
    adaptive:
      minimum: 1
      maximum: 15

project:
  default_storage: gs://${GCS_BUCKET}/scalable-runs/

GCP-specific setup:

# Authenticate
gcloud auth application-default login

# Push image to GCR
gcloud builds submit --tag gcr.io/my-project/demeter:2.0.1 .

# Create GCS bucket for artifacts
gsutil mb -l us-central1 gs://my-bucket/

Step 5: Artifact Store — Cloud Storage

The artifact store provides a unified interface for persisting outputs across storage backends:

from scalable.artifacts import build_artifact_store

# Local storage (default)
local_store = build_artifact_store("./artifacts")

# S3 storage
s3_store = build_artifact_store("s3://my-bucket/artifacts/")

# GCS storage
gcs_store = build_artifact_store("gs://my-bucket/artifacts/")

# Store a file
ref = s3_store.put("local/output.csv", "runs/run-001/output.csv")
print(ref)
# ArtifactRef(uri='s3://my-bucket/artifacts/runs/run-001/output.csv')

# Retrieve a file
local_path = s3_store.get("runs/run-001/output.csv", "./downloads/output.csv")

The store is protocol-aware via fsspec: it detects the URI scheme and uses the appropriate backend (s3fs for S3, gcsfs for GCS, local filesystem for paths).

Integration with workflow output:

from scalable import ScalableSession
from scalable.artifacts import build_artifact_store

session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
client = session.start()

# Run simulation
result = client.submit(run_demeter_scenario, scenario_params, tag="demeter").result()

# Persist output artifact to configured storage
store = build_artifact_store(session.manifest.project.default_storage)
ref = store.put(
    result["output_path"],
    f"runs/{session._telemetry.run_id}/gcam-output.tar.gz",
)
print(f"Artifact persisted: {ref.uri}")

Step 6: Multi-Region Deployment

For global workflows, define targets in multiple regions:

targets:
  aws-east:
    provider: aws
    region: us-east-1
    # ... config ...
    adaptive:
      minimum: 5
      maximum: 50

  aws-west:
    provider: aws
    region: us-west-2
    # ... config ...
    adaptive:
      minimum: 2
      maximum: 20

  gcp-europe:
    provider: gcp
    region: europe-west1
    # ... config ...

Select at runtime:

# Heavy production run in us-east-1
scalable run ./scalable.yaml --target aws-east --workflow pipeline.py

# Quick validation in us-west-2
scalable run ./scalable.yaml --target aws-west --workflow pipeline.py --dry-run

Step 7: Cloud + Cache Integration

Combine cloud execution with remote caching so repeated runs across different machines share results:

export SCALABLE_CACHE_REMOTE=s3://my-bucket/scalable-cache/
project:
  name: demeter-lulcc
  default_storage: s3://my-bucket/outputs/

Now:

  1. First cloud run computes all scenarios and caches results to S3.

  2. Subsequent runs (from any machine) hit the shared cache.

  3. Only modified scenarios recompute.

This is particularly powerful for CI/CD: your PR validation pipeline benefits from the cache populated by previous runs.

Step 8: Environment Variable Template

For production deployments, maintain a .env template:

# .env.cloud (do not commit secrets — use secrets manager)
AWS_REGION=us-east-1
S3_BUCKET=demeter-prod-artifacts
ECR_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
ECR_DEMETER_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/ecsTaskExecutionRole
TASK_ROLE_ARN=arn:aws:iam::123456789:role/scalableTaskRole
SUBNET_A=subnet-abc123
SUBNET_B=subnet-def456
SG_ID=sg-xyz789
SCALABLE_CACHE_REMOTE=s3://demeter-prod-artifacts/cache/

Load before running:

set -a && source .env.cloud && set +a
scalable run ./scalable.yaml --target aws --workflow pipeline.py

Troubleshooting

“botocore.exceptions.NoCredentialsError”

AWS credentials are not configured. Run aws configure or set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. For EC2/ECS, ensure the instance profile or task role has necessary permissions.

Fargate task fails with “CannotPullContainerError”

The execution role lacks ECR permissions, the image URI is wrong, or the image doesn’t exist in the specified region. Verify with: aws ecr describe-images --repository-name demeter.

Workers can’t connect to scheduler

Security group must allow inbound TCP on the Dask scheduler port (8786) from the worker security group. Subnets must have a route to the scheduler host (typically your local machine or a bastion).

GCS “403 Forbidden”

The service account lacks storage.objects.create permission on the bucket. Grant the roles/storage.objectAdmin role.

Cost estimate shows $0.00

Cost tables may not have pricing for your specific region or instance type. Check that scalable.providers.cloud.cost_tables includes your region.

Next Steps