Tutorial 5: Cloud Integration with AWS and GCP¶
What You Will Learn¶
By the end of this tutorial you will:
Configure AWS Fargate and EC2-backed Dask clusters via Scalable.
Set up GCP Cloud Run / GKE-based execution.
Use the artifact store for persistent cloud storage (S3, GCS).
Estimate costs before running with dry-run planning.
Deploy multi-target manifests that promote from local to cloud.
Manage IAM roles, networking, and container registries.
Prerequisites¶
Completed Tutorial 1: Getting Started with Scalable and Tutorial 2: Mastering the Manifest System.
pip install scalable[cloud](installss3fs,gcsfs,dask-cloudprovider,fsspec).AWS credentials configured (
~/.aws/credentialsor environment variables).(For GCP)
gcloudCLI authenticated orGOOGLE_APPLICATION_CREDENTIALSset.
Scenario¶
Your energy forecasting pipeline works locally but needs to scale to 50+ concurrent scenarios for a production run. Your organization uses AWS for burst compute and GCS for long-term data storage. You need to deploy the same workflow to cloud infrastructure with cost visibility.
Step 1: AWS Target Configuration¶
The AWS provider uses dask-cloudprovider to launch Dask workers on Fargate
(serverless containers) or EC2 instances:
# scalable.yaml
version: 1
project:
name: demeter-lulcc-aws
default_storage: s3://${S3_BUCKET}/scalable-runs/
targets:
aws:
provider: aws
region: ${AWS_REGION:-us-east-1}
cluster_type: fargate
instance_type: m5.xlarge # For EC2-backed mode
worker_cpu: 4096 # Fargate CPU units (1024 = 1 vCPU)
worker_mem: 16384 # Fargate memory in MiB
image: ${ECR_IMAGE}
execution_role_arn: ${EXECUTION_ROLE_ARN}
task_role_arn: ${TASK_ROLE_ARN}
subnets:
- ${SUBNET_A}
- ${SUBNET_B}
security_groups:
- ${SG_ID}
adaptive:
minimum: 2
maximum: 20
components:
demeter:
image: ${ECR_DEMETER_IMAGE}
cpus: 4
memory: 16G
tags: [lulcc, downscaling, gcam]
postprocess:
cpus: 2
memory: 8G
tags: [analysis]
tasks:
run_demeter_scenario:
component: demeter
cache: true
outputs:
database: dir
aggregate:
component: postprocess
cache: true
Key configuration explained:
cluster_typefargatefor serverless (no EC2 management) orec2for instance-backed clusters (lower cost at scale, more control over instance types).worker_cpu/worker_memFargate task sizing. CPU is in units of 1024 (= 1 vCPU). Common configurations:
CPU
Memory
Use Case
1024
4096
Light tasks, I/O-bound
4096
16384
Standard compute tasks
16384
65536
Memory-intensive models
execution_role_arnIAM role assumed by the ECS agent to pull images and write logs. Needs
ecr:GetAuthorizationToken,ecr:BatchGetImage,logs:CreateLogStreampermissions.task_role_arnIAM role assumed by the running task. Needs S3 read/write for artifacts, network access for Dask scheduler communication.
Step 2: Set Up AWS Infrastructure¶
Before running, ensure these AWS resources exist:
1. ECR Repository (Container Registry):
aws ecr create-repository --repository-name demeter
# Push your image
docker build -t demeter:2.0.1 .
docker tag demeter:2.0.1 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
2. VPC + Subnets:
Workers need outbound internet access (for Dask scheduler communication) and access to S3. Use a VPC with NAT Gateway or VPC endpoints.
3. Security Group:
# Allow inbound from scheduler, outbound to internet
aws ec2 create-security-group \
--group-name scalable-workers \
--description "Scalable Dask workers"
aws ec2 authorize-security-group-ingress \
--group-id sg-xyz789 \
--protocol tcp --port 8786-8787 \
--source-group sg-xyz789
4. IAM Roles:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
]
}
]
}
Step 3: Dry-Run Cost Estimation¶
Before launching real cloud resources, estimate costs:
scalable run ./scalable.yaml --target aws --dry-run
Dry-run plan for target 'aws' (provider: aws):
Workers: 10 × gcam (4 vCPU, 16 GiB)
5 × postprocess (2 vCPU, 8 GiB)
Estimated duration: 2.5 hours
Estimated cost: $4.82
Fargate compute: $3.90
Data transfer: $0.12
S3 storage: $0.80
Programmatic cost access:
from scalable import ScalableSession
session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
plan = session.plan(dry_run=True)
if plan.cost_estimate:
print(f"Estimated cost: ${plan.cost_estimate.total:.2f}")
print(f" Compute: ${plan.cost_estimate.compute:.2f}")
print(f" Storage: ${plan.cost_estimate.storage:.2f}")
print(f" Transfer: ${plan.cost_estimate.transfer:.2f}")
How cost estimation works: Scalable uses the
scalable.providers.cloud.cost_tables module which contains region-specific
pricing for Fargate vCPU-hours, memory-hours, and S3 operations. Estimates are
based on the planned worker count, predicted task duration (from telemetry
history if available), and declared storage outputs.
Step 4: GCP Target Configuration¶
For Google Cloud Platform, use GCS for storage and either Cloud Run or GKE for compute:
targets:
gcp:
provider: gcp
region: us-central1
project_id: ${GCP_PROJECT_ID}
cluster_type: cloud_run
worker_cpu: 4
worker_mem: 16Gi
image: gcr.io/${GCP_PROJECT_ID}/demeter:2.0.1
service_account: ${GCP_SERVICE_ACCOUNT}
adaptive:
minimum: 1
maximum: 15
project:
default_storage: gs://${GCS_BUCKET}/scalable-runs/
GCP-specific setup:
# Authenticate
gcloud auth application-default login
# Push image to GCR
gcloud builds submit --tag gcr.io/my-project/demeter:2.0.1 .
# Create GCS bucket for artifacts
gsutil mb -l us-central1 gs://my-bucket/
Step 5: Artifact Store — Cloud Storage¶
The artifact store provides a unified interface for persisting outputs across storage backends:
from scalable.artifacts import build_artifact_store
# Local storage (default)
local_store = build_artifact_store("./artifacts")
# S3 storage
s3_store = build_artifact_store("s3://my-bucket/artifacts/")
# GCS storage
gcs_store = build_artifact_store("gs://my-bucket/artifacts/")
# Store a file
ref = s3_store.put("local/output.csv", "runs/run-001/output.csv")
print(ref)
# ArtifactRef(uri='s3://my-bucket/artifacts/runs/run-001/output.csv')
# Retrieve a file
local_path = s3_store.get("runs/run-001/output.csv", "./downloads/output.csv")
The store is protocol-aware via fsspec: it detects the URI scheme and uses
the appropriate backend (s3fs for S3, gcsfs for GCS, local filesystem
for paths).
Integration with workflow output:
from scalable import ScalableSession
from scalable.artifacts import build_artifact_store
session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
client = session.start()
# Run simulation
result = client.submit(run_demeter_scenario, scenario_params, tag="demeter").result()
# Persist output artifact to configured storage
store = build_artifact_store(session.manifest.project.default_storage)
ref = store.put(
result["output_path"],
f"runs/{session._telemetry.run_id}/gcam-output.tar.gz",
)
print(f"Artifact persisted: {ref.uri}")
Step 6: Multi-Region Deployment¶
For global workflows, define targets in multiple regions:
targets:
aws-east:
provider: aws
region: us-east-1
# ... config ...
adaptive:
minimum: 5
maximum: 50
aws-west:
provider: aws
region: us-west-2
# ... config ...
adaptive:
minimum: 2
maximum: 20
gcp-europe:
provider: gcp
region: europe-west1
# ... config ...
Select at runtime:
# Heavy production run in us-east-1
scalable run ./scalable.yaml --target aws-east --workflow pipeline.py
# Quick validation in us-west-2
scalable run ./scalable.yaml --target aws-west --workflow pipeline.py --dry-run
Step 7: Cloud + Cache Integration¶
Combine cloud execution with remote caching so repeated runs across different machines share results:
export SCALABLE_CACHE_REMOTE=s3://my-bucket/scalable-cache/
project:
name: demeter-lulcc
default_storage: s3://my-bucket/outputs/
Now:
First cloud run computes all scenarios and caches results to S3.
Subsequent runs (from any machine) hit the shared cache.
Only modified scenarios recompute.
This is particularly powerful for CI/CD: your PR validation pipeline benefits from the cache populated by previous runs.
Step 8: Environment Variable Template¶
For production deployments, maintain a .env template:
# .env.cloud (do not commit secrets — use secrets manager)
AWS_REGION=us-east-1
S3_BUCKET=demeter-prod-artifacts
ECR_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
ECR_DEMETER_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/ecsTaskExecutionRole
TASK_ROLE_ARN=arn:aws:iam::123456789:role/scalableTaskRole
SUBNET_A=subnet-abc123
SUBNET_B=subnet-def456
SG_ID=sg-xyz789
SCALABLE_CACHE_REMOTE=s3://demeter-prod-artifacts/cache/
Load before running:
set -a && source .env.cloud && set +a
scalable run ./scalable.yaml --target aws --workflow pipeline.py
Troubleshooting¶
- “botocore.exceptions.NoCredentialsError”
AWS credentials are not configured. Run
aws configureor setAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYenvironment variables. For EC2/ECS, ensure the instance profile or task role has necessary permissions.- Fargate task fails with “CannotPullContainerError”
The execution role lacks ECR permissions, the image URI is wrong, or the image doesn’t exist in the specified region. Verify with:
aws ecr describe-images --repository-name demeter.- Workers can’t connect to scheduler
Security group must allow inbound TCP on the Dask scheduler port (8786) from the worker security group. Subnets must have a route to the scheduler host (typically your local machine or a bastion).
- GCS “403 Forbidden”
The service account lacks
storage.objects.createpermission on the bucket. Grant theroles/storage.objectAdminrole.- Cost estimate shows $0.00
Cost tables may not have pricing for your specific region or instance type. Check that
scalable.providers.cloud.cost_tablesincludes your region.
Next Steps¶
Tutorial 6: Monitoring and Observability with Telemetry — Monitor cloud run costs and performance through telemetry.
Tutorial 8: Deployment Workflows with Kubernetes — Deploy to Kubernetes for container-native orchestration.
Tutorial 7: Error Handling and Resilience Patterns — Handle cloud-specific transient failures (timeouts, preemption).