Tutorial 8: Deployment Workflows with Kubernetes

What You Will Learn

By the end of this tutorial you will:

  • Deploy Scalable workflows on Kubernetes using the Dask Kubernetes Operator.

  • Configure namespace isolation, resource quotas, and pod specifications.

  • Use the Kubernetes provider with adaptive scaling.

  • Manage container images and pull secrets.

  • Combine Kubernetes with overlays for multi-environment deployments.

  • Handle pod evictions and node failures.

Prerequisites

Scenario

Your organization runs a shared Kubernetes cluster for all scientific workloads. You need to deploy the energy forecasting pipeline as a Dask cluster within your team’s namespace, with resource quotas enforced by platform engineering. The deployment must support both development (small, fast iterations) and production (large-scale, fault-tolerant) modes.

Step 1: Install the Dask Kubernetes Operator

The Dask Kubernetes Operator manages DaskCluster custom resources in your cluster:

# Install the operator (cluster-admin required, one-time setup)
helm repo add dask https://helm.dask.org
helm repo update
helm install dask-operator dask/dask-kubernetes-operator \
  --namespace dask-operator --create-namespace

Verify the operator is running:

kubectl get pods -n dask-operator
NAME                             READY   STATUS    RESTARTS   AGE
dask-operator-7f8b6d5c4-x2j9k   1/1     Running   0          2m

Step 2: Configure the Kubernetes Target

# scalable.yaml
version: 1
project:
  name: demeter-lulcc-k8s
  default_storage: gs://${GCS_BUCKET}/scalable-runs/

targets:
  k8s-dev:
    provider: kubernetes
    namespace: demeter-dev
    image: gcr.io/${GCP_PROJECT}/demeter:${IMAGE_TAG:-2.0.1}
    adaptive:
      minimum: 1
      maximum: 5
    overlay: k8s-dev-resources

  k8s-prod:
    provider: kubernetes
    namespace: demeter-prod
    image: gcr.io/${GCP_PROJECT}/demeter:${IMAGE_TAG}
    adaptive:
      minimum: 4
      maximum: 40
    overlay: k8s-prod-resources

components:
  demeter:
    image: gcr.io/${GCP_PROJECT}/demeter:2.0.1
    cpus: 8
    memory: 32G
    tags: [lulcc, downscaling, gcam]
    env:
      DEMETER_DATA: /data

  postprocess:
    image: gcr.io/${GCP_PROJECT}/postprocess:latest
    cpus: 4
    memory: 16G
    tags: [analysis]

tasks:
  run_demeter_scenario:
    component: demeter
    cache: true
    outputs:
      database: dir

  aggregate:
    component: postprocess
    cache: true

overlays:
  k8s-dev-resources:
    components:
      demeter:
        cpus: 2
        memory: 8G
      postprocess:
        cpus: 1
        memory: 4G

  k8s-prod-resources:
    components:
      demeter:
        cpus: 16
        memory: 64G
      postprocess:
        cpus: 8
        memory: 32G

Step 3: Namespace Setup

Create isolated namespaces for development and production:

# Development namespace
kubectl create namespace demeter-dev
kubectl label namespace demeter-dev team=energy env=dev

# Production namespace
kubectl create namespace demeter-prod
kubectl label namespace demeter-prod team=energy env=prod

Apply resource quotas to prevent runaway usage:

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: demeter-lulcc-quota
  namespace: demeter-prod
spec:
  hard:
    requests.cpu: "160"
    requests.memory: "640Gi"
    limits.cpu: "200"
    limits.memory: "800Gi"
    pods: "50"
kubectl apply -f resource-quota.yaml

Step 4: Image Pull Secrets

If your container registry requires authentication:

# For GCR (Google Container Registry)
kubectl create secret docker-registry gcr-secret \
  --docker-server=gcr.io \
  --docker-username=_json_key \
  --docker-password="$(cat service-account-key.json)" \
  --namespace demeter-prod

# For ECR (AWS Elastic Container Registry)
kubectl create secret docker-registry ecr-secret \
  --docker-server=123456789.dkr.ecr.us-east-1.amazonaws.com \
  --docker-username=AWS \
  --docker-password="$(aws ecr get-login-password)" \
  --namespace demeter-prod

The Kubernetes provider automatically attaches these secrets to worker pods when the image URI matches the registry.

Step 5: Run a Development Workflow

export GCP_PROJECT=my-gcp-project
export GCS_BUCKET=demeter-artifacts
export IMAGE_TAG=dev-$(git rev-parse --short HEAD)

# Validate
scalable validate ./scalable.yaml

# Plan (shows pod resource requests)
scalable plan ./scalable.yaml --target k8s-dev --dry-run
Plan created for target 'k8s-dev' (provider: kubernetes)
Namespace: demeter-dev
Workers:
  demeter: 2 pods (2 cpu, 8G memory)
  postprocess: 1 pod (1 cpu, 4G memory)
Adaptive: min=1, max=5

Run the workflow:

from scalable import ScalableSession

session = ScalableSession.from_yaml("./scalable.yaml", target="k8s-dev")
plan = session.plan(dry_run=True)
client = session.start(plan)

# Submit tasks — they run in Kubernetes pods
futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in range(5)]
results = client.gather(futures)

session.close()

What happens under the hood:

  1. The KubernetesProvider creates a DaskCluster custom resource in the demeter-dev namespace.

  2. The Dask Kubernetes Operator provisions scheduler and worker pods.

  3. Worker pods are labeled with component tags for affinity scheduling.

  4. The adaptive scaler monitors task backlog and scales pods up/down within the configured bounds.

  5. On session.close(), the DaskCluster resource is deleted, cleaning up all pods.

Step 6: Monitor Pods and Scaling

Watch Kubernetes events in real-time:

# Watch pods in the namespace
kubectl get pods -n demeter-dev -w
NAME                                READY   STATUS    RESTARTS   AGE
dask-scheduler-demeter-dev-0         1/1     Running   0          30s
dask-worker-demeter-0              1/1     Running   0          25s
dask-worker-demeter-1              1/1     Running   0          25s
dask-worker-postprocess-0           1/1     Running   0          25s

# Scale-up event
dask-worker-demeter-2              0/1     Pending   0          0s
dask-worker-demeter-2              1/1     Running   0          15s

Check the Dask dashboard (port-forward the scheduler):

kubectl port-forward -n demeter-dev svc/dask-scheduler-demeter-dev 8787:8787
# Open http://localhost:8787 in your browser

Step 7: Production Deployment

For production, ensure high availability and fault tolerance:

export IMAGE_TAG=v2.1.0  # Pinned release tag
scalable run ./scalable.yaml --target k8s-prod --workflow pipeline.py

Production considerations:

Pod disruption budgets — Prevent too many workers from being evicted simultaneously:

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dask-workers-pdb
  namespace: demeter-prod
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: dask-worker
kubectl apply -f pdb.yaml

Priority classes — Ensure your workload gets scheduled before lower-priority jobs:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: demeter-production
value: 1000
globalDefault: false
description: "Priority for production energy forecasting runs"

Step 8: Handling Pod Evictions

Kubernetes may evict pods due to resource pressure or node maintenance. Scalable’s error handling (see Tutorial 7: Error Handling and Resilience Patterns) catches these as KilledWorker exceptions:

from distributed import as_completed

session = ScalableSession.from_yaml("./scalable.yaml", target="k8s-prod")
client = session.start()

futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in range(200)]

results = []
retry_queue = []

for future in as_completed(futures):
    try:
        results.append(future.result())
    except Exception as e:
        if "KilledWorker" in str(type(e).__name__):
            # Pod was evicted — retry
            scenario_id = future.key.split("-")[-1]
            retry_queue.append(scenario_id)
        else:
            print(f"Permanent failure: {e}")

# Retry evicted tasks
if retry_queue:
    print(f"Retrying {len(retry_queue)} evicted tasks...")
    retry_futures = [
        client.submit(run_demeter_scenario, s, tag="demeter") for s in retry_queue
    ]
    retry_results = client.gather(retry_futures)
    results.extend(retry_results)

session.close()

Step 9: CI/CD Integration

Automate Kubernetes deployments from your CI pipeline:

# .github/workflows/scalable-prod.yaml
name: Production Pipeline Run
on:
  workflow_dispatch:
    inputs:
      scenarios:
        description: "Number of scenarios"
        default: "200"

jobs:
  run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - uses: google-github-actions/get-gke-credentials@v2
        with:
          cluster_name: energy-cluster
          location: us-central1

      - name: Install Scalable
        run: pip install scalable[kubernetes,cloud]

      - name: Run Pipeline
        env:
          GCP_PROJECT: ${{ vars.GCP_PROJECT }}
          GCS_BUCKET: ${{ vars.GCS_BUCKET }}
          IMAGE_TAG: ${{ github.sha }}
          SCALABLE_TARGET: k8s-prod
        run: |
          scalable validate ./scalable.yaml
          scalable run ./scalable.yaml --target k8s-prod --workflow pipeline.py

Step 10: Local Development with minikube

For local Kubernetes development without a cloud cluster:

# Start minikube
minikube start --cpus=4 --memory=8192

# Install Dask operator
helm install dask-operator dask/dask-kubernetes-operator

# Build and load image locally
docker build -t demeter:local -f capabilities/demeter/Dockerfile.scalable capabilities/demeter
minikube image load demeter:local

# Use local image in manifest
export IMAGE_TAG=local
scalable run ./scalable.yaml --target k8s-dev --workflow workflow.py

This gives you a realistic Kubernetes environment for testing pod scheduling, resource limits, and failure modes before deploying to production.

Troubleshooting

Pods stuck in “Pending” state

Check resource availability: kubectl describe pod <pod-name> -n <ns>. Common causes: insufficient cluster capacity, resource quota exceeded, or node selector constraints not met.

“ImagePullBackOff” error

The image URI is wrong or the pull secret is missing/expired. Verify: kubectl get secret -n <ns> and check image URI spelling.

Workers fail to connect to scheduler

Ensure network policies allow pod-to-pod communication within the namespace. The scheduler service must be reachable on port 8786.

Adaptive scaling not working

Verify the Dask Kubernetes Operator is running and the DaskCluster resource has adaptive section configured. Check operator logs: kubectl logs -n dask-operator deployment/dask-operator.

Resource quota prevents scaling

If adaptive.maximum exceeds what the quota allows, pods will stay pending. Set maximum to a value within your quota limits.

Next Steps