.. _tutorial_cloud_integration:

======================================================
Tutorial 5: Cloud Integration with AWS and GCP
======================================================

What You Will Learn
-------------------

By the end of this tutorial you will:

* Configure AWS Fargate and EC2-backed Dask clusters via Scalable.
* Set up GCP Cloud Run / GKE-based execution.
* Use the artifact store for persistent cloud storage (S3, GCS).
* Estimate costs before running with dry-run planning.
* Deploy multi-target manifests that promote from local to cloud.
* Manage IAM roles, networking, and container registries.

Prerequisites
-------------

* Completed :ref:`tutorial_getting_started` and :ref:`tutorial_manifest_system`.
* ``pip install scalable[cloud]`` (installs ``s3fs``, ``gcsfs``,
  ``dask-cloudprovider``, ``fsspec``).
* AWS credentials configured (``~/.aws/credentials`` or environment variables).
* (For GCP) ``gcloud`` CLI authenticated or ``GOOGLE_APPLICATION_CREDENTIALS`` set.

Scenario
--------

Your energy forecasting pipeline works locally but needs to scale to 50+
concurrent scenarios for a production run. Your organization uses AWS for burst
compute and GCS for long-term data storage. You need to deploy the same
workflow to cloud infrastructure with cost visibility.

Step 1: AWS Target Configuration
----------------------------------

The AWS provider uses ``dask-cloudprovider`` to launch Dask workers on Fargate
(serverless containers) or EC2 instances:

.. code-block:: yaml

   # scalable.yaml
   version: 1
   project:
     name: demeter-lulcc-aws
     default_storage: s3://${S3_BUCKET}/scalable-runs/

   targets:
     aws:
       provider: aws
       region: ${AWS_REGION:-us-east-1}
       cluster_type: fargate
       instance_type: m5.xlarge       # For EC2-backed mode
       worker_cpu: 4096               # Fargate CPU units (1024 = 1 vCPU)
       worker_mem: 16384              # Fargate memory in MiB
       image: ${ECR_IMAGE}
       execution_role_arn: ${EXECUTION_ROLE_ARN}
       task_role_arn: ${TASK_ROLE_ARN}
       subnets:
         - ${SUBNET_A}
         - ${SUBNET_B}
       security_groups:
         - ${SG_ID}
       adaptive:
         minimum: 2
         maximum: 20

   components:
     demeter:
       image: ${ECR_DEMETER_IMAGE}
       cpus: 4
       memory: 16G
       tags: [lulcc, downscaling, gcam]

     postprocess:
       cpus: 2
       memory: 8G
       tags: [analysis]

   tasks:
     run_demeter_scenario:
       component: demeter
       cache: true
       outputs:
         database: dir

     aggregate:
       component: postprocess
       cache: true

**Key configuration explained:**

``cluster_type``
  ``fargate`` for serverless (no EC2 management) or ``ec2`` for instance-backed
  clusters (lower cost at scale, more control over instance types).

``worker_cpu`` / ``worker_mem``
  Fargate task sizing. CPU is in units of 1024 (= 1 vCPU). Common
  configurations:

  .. list-table::
     :header-rows: 1
     :widths: 20 20 60

     * - CPU
       - Memory
       - Use Case
     * - 1024
       - 4096
       - Light tasks, I/O-bound
     * - 4096
       - 16384
       - Standard compute tasks
     * - 16384
       - 65536
       - Memory-intensive models

``execution_role_arn``
  IAM role assumed by the ECS agent to pull images and write logs. Needs
  ``ecr:GetAuthorizationToken``, ``ecr:BatchGetImage``, ``logs:CreateLogStream``
  permissions.

``task_role_arn``
  IAM role assumed by the running task. Needs S3 read/write for artifacts,
  network access for Dask scheduler communication.

Step 2: Set Up AWS Infrastructure
-----------------------------------

Before running, ensure these AWS resources exist:

**1. ECR Repository (Container Registry):**

.. code-block:: bash

   aws ecr create-repository --repository-name demeter
   # Push your image
   docker build -t demeter:2.0.1 .
   docker tag demeter:2.0.1 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
   aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
   docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1

**2. VPC + Subnets:**

Workers need outbound internet access (for Dask scheduler communication) and
access to S3. Use a VPC with NAT Gateway or VPC endpoints.

**3. Security Group:**

.. code-block:: bash

   # Allow inbound from scheduler, outbound to internet
   aws ec2 create-security-group \
     --group-name scalable-workers \
     --description "Scalable Dask workers"
   aws ec2 authorize-security-group-ingress \
     --group-id sg-xyz789 \
     --protocol tcp --port 8786-8787 \
     --source-group sg-xyz789

**4. IAM Roles:**

.. code-block:: json

   {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "s3:GetObject",
           "s3:PutObject",
           "s3:ListBucket"
         ],
         "Resource": [
           "arn:aws:s3:::my-bucket",
           "arn:aws:s3:::my-bucket/*"
         ]
       }
     ]
   }

Step 3: Dry-Run Cost Estimation
--------------------------------

Before launching real cloud resources, estimate costs:

.. code-block:: bash

   scalable run ./scalable.yaml --target aws --dry-run

.. code-block:: text

   Dry-run plan for target 'aws' (provider: aws):
     Workers: 10 × gcam (4 vCPU, 16 GiB)
              5 × postprocess (2 vCPU, 8 GiB)
     Estimated duration: 2.5 hours
     Estimated cost: $4.82
       Fargate compute: $3.90
       Data transfer: $0.12
       S3 storage: $0.80

Programmatic cost access:

.. code-block:: python

   from scalable import ScalableSession

   session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
   plan = session.plan(dry_run=True)

   if plan.cost_estimate:
       print(f"Estimated cost: ${plan.cost_estimate.total:.2f}")
       print(f"  Compute: ${plan.cost_estimate.compute:.2f}")
       print(f"  Storage: ${plan.cost_estimate.storage:.2f}")
       print(f"  Transfer: ${plan.cost_estimate.transfer:.2f}")

**How cost estimation works:** Scalable uses the
:mod:`scalable.providers.cloud.cost_tables` module which contains region-specific
pricing for Fargate vCPU-hours, memory-hours, and S3 operations. Estimates are
based on the planned worker count, predicted task duration (from telemetry
history if available), and declared storage outputs.

Step 4: GCP Target Configuration
----------------------------------

For Google Cloud Platform, use GCS for storage and either Cloud Run or GKE for
compute:

.. code-block:: yaml

   targets:
     gcp:
       provider: gcp
       region: us-central1
       project_id: ${GCP_PROJECT_ID}
       cluster_type: cloud_run
       worker_cpu: 4
       worker_mem: 16Gi
       image: gcr.io/${GCP_PROJECT_ID}/demeter:2.0.1
       service_account: ${GCP_SERVICE_ACCOUNT}
       adaptive:
         minimum: 1
         maximum: 15

   project:
     default_storage: gs://${GCS_BUCKET}/scalable-runs/

GCP-specific setup:

.. code-block:: bash

   # Authenticate
   gcloud auth application-default login

   # Push image to GCR
   gcloud builds submit --tag gcr.io/my-project/demeter:2.0.1 .

   # Create GCS bucket for artifacts
   gsutil mb -l us-central1 gs://my-bucket/

Step 5: Artifact Store — Cloud Storage
----------------------------------------

The artifact store provides a unified interface for persisting outputs across
storage backends:

.. code-block:: python

   from scalable.artifacts import build_artifact_store

   # Local storage (default)
   local_store = build_artifact_store("./artifacts")

   # S3 storage
   s3_store = build_artifact_store("s3://my-bucket/artifacts/")

   # GCS storage
   gcs_store = build_artifact_store("gs://my-bucket/artifacts/")

   # Store a file
   ref = s3_store.put("local/output.csv", "runs/run-001/output.csv")
   print(ref)
   # ArtifactRef(uri='s3://my-bucket/artifacts/runs/run-001/output.csv')

   # Retrieve a file
   local_path = s3_store.get("runs/run-001/output.csv", "./downloads/output.csv")

The store is protocol-aware via ``fsspec``: it detects the URI scheme and uses
the appropriate backend (``s3fs`` for S3, ``gcsfs`` for GCS, local filesystem
for paths).

**Integration with workflow output:**

.. code-block:: python

   from scalable import ScalableSession
   from scalable.artifacts import build_artifact_store

   session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
   client = session.start()

   # Run simulation
   result = client.submit(run_demeter_scenario, scenario_params, tag="demeter").result()

   # Persist output artifact to configured storage
   store = build_artifact_store(session.manifest.project.default_storage)
   ref = store.put(
       result["output_path"],
       f"runs/{session._telemetry.run_id}/gcam-output.tar.gz",
   )
   print(f"Artifact persisted: {ref.uri}")

Step 6: Multi-Region Deployment
---------------------------------

For global workflows, define targets in multiple regions:

.. code-block:: yaml

   targets:
     aws-east:
       provider: aws
       region: us-east-1
       # ... config ...
       adaptive:
         minimum: 5
         maximum: 50

     aws-west:
       provider: aws
       region: us-west-2
       # ... config ...
       adaptive:
         minimum: 2
         maximum: 20

     gcp-europe:
       provider: gcp
       region: europe-west1
       # ... config ...

Select at runtime:

.. code-block:: bash

   # Heavy production run in us-east-1
   scalable run ./scalable.yaml --target aws-east --workflow pipeline.py

   # Quick validation in us-west-2
   scalable run ./scalable.yaml --target aws-west --workflow pipeline.py --dry-run

Step 7: Cloud + Cache Integration
-----------------------------------

Combine cloud execution with remote caching so repeated runs across different
machines share results:

.. code-block:: bash

   export SCALABLE_CACHE_REMOTE=s3://my-bucket/scalable-cache/

.. code-block:: yaml

   project:
     name: demeter-lulcc
     default_storage: s3://my-bucket/outputs/

Now:

1. First cloud run computes all scenarios and caches results to S3.
2. Subsequent runs (from any machine) hit the shared cache.
3. Only modified scenarios recompute.

This is particularly powerful for CI/CD: your PR validation pipeline benefits
from the cache populated by previous runs.

Step 8: Environment Variable Template
---------------------------------------

For production deployments, maintain a ``.env`` template:

.. code-block:: bash

   # .env.cloud (do not commit secrets — use secrets manager)
   AWS_REGION=us-east-1
   S3_BUCKET=demeter-prod-artifacts
   ECR_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
   ECR_DEMETER_IMAGE=123456789.dkr.ecr.us-east-1.amazonaws.com/demeter:2.0.1
   EXECUTION_ROLE_ARN=arn:aws:iam::123456789:role/ecsTaskExecutionRole
   TASK_ROLE_ARN=arn:aws:iam::123456789:role/scalableTaskRole
   SUBNET_A=subnet-abc123
   SUBNET_B=subnet-def456
   SG_ID=sg-xyz789
   SCALABLE_CACHE_REMOTE=s3://demeter-prod-artifacts/cache/

Load before running:

.. code-block:: bash

   set -a && source .env.cloud && set +a
   scalable run ./scalable.yaml --target aws --workflow pipeline.py

Troubleshooting
---------------

**"botocore.exceptions.NoCredentialsError"**
  AWS credentials are not configured. Run ``aws configure`` or set
  ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` environment variables.
  For EC2/ECS, ensure the instance profile or task role has necessary
  permissions.

**Fargate task fails with "CannotPullContainerError"**
  The execution role lacks ECR permissions, the image URI is wrong, or the
  image doesn't exist in the specified region. Verify with:
  ``aws ecr describe-images --repository-name demeter``.

**Workers can't connect to scheduler**
  Security group must allow inbound TCP on the Dask scheduler port (8786)
  from the worker security group. Subnets must have a route to the scheduler
  host (typically your local machine or a bastion).

**GCS "403 Forbidden"**
  The service account lacks ``storage.objects.create`` permission on the
  bucket. Grant the ``roles/storage.objectAdmin`` role.

**Cost estimate shows $0.00**
  Cost tables may not have pricing for your specific region or instance type.
  Check that ``scalable.providers.cloud.cost_tables`` includes your region.

Next Steps
----------

* :ref:`tutorial_telemetry` — Monitor cloud run costs and performance
  through telemetry.
* :ref:`tutorial_kubernetes` — Deploy to Kubernetes for container-native
  orchestration.
* :ref:`tutorial_error_handling` — Handle cloud-specific transient failures
  (timeouts, preemption).