Tutorial 8: Deployment Workflows with Kubernetes¶
What You Will Learn¶
By the end of this tutorial you will:
Deploy Scalable workflows on Kubernetes using the Dask Kubernetes Operator.
Configure namespace isolation, resource quotas, and pod specifications.
Use the Kubernetes provider with adaptive scaling.
Manage container images and pull secrets.
Combine Kubernetes with overlays for multi-environment deployments.
Handle pod evictions and node failures.
Prerequisites¶
Completed Tutorial 1: Getting Started with Scalable, Tutorial 2: Mastering the Manifest System, and Tutorial 3: Scaling Strategies with Providers.
pip install scalable[kubernetes](installsdask-kubernetes,kubernetes).Access to a Kubernetes cluster (local
minikube/kindfor development, or a managed cluster like GKE/EKS/AKS for production).kubectlconfigured with cluster access.
Scenario¶
Your organization runs a shared Kubernetes cluster for all scientific workloads. You need to deploy the energy forecasting pipeline as a Dask cluster within your team’s namespace, with resource quotas enforced by platform engineering. The deployment must support both development (small, fast iterations) and production (large-scale, fault-tolerant) modes.
Step 1: Install the Dask Kubernetes Operator¶
The Dask Kubernetes Operator manages DaskCluster custom resources in your cluster:
# Install the operator (cluster-admin required, one-time setup)
helm repo add dask https://helm.dask.org
helm repo update
helm install dask-operator dask/dask-kubernetes-operator \
--namespace dask-operator --create-namespace
Verify the operator is running:
kubectl get pods -n dask-operator
NAME READY STATUS RESTARTS AGE
dask-operator-7f8b6d5c4-x2j9k 1/1 Running 0 2m
Step 2: Configure the Kubernetes Target¶
# scalable.yaml
version: 1
project:
name: demeter-lulcc-k8s
default_storage: gs://${GCS_BUCKET}/scalable-runs/
targets:
k8s-dev:
provider: kubernetes
namespace: demeter-dev
image: gcr.io/${GCP_PROJECT}/demeter:${IMAGE_TAG:-2.0.1}
adaptive:
minimum: 1
maximum: 5
overlay: k8s-dev-resources
k8s-prod:
provider: kubernetes
namespace: demeter-prod
image: gcr.io/${GCP_PROJECT}/demeter:${IMAGE_TAG}
adaptive:
minimum: 4
maximum: 40
overlay: k8s-prod-resources
components:
demeter:
image: gcr.io/${GCP_PROJECT}/demeter:2.0.1
cpus: 8
memory: 32G
tags: [lulcc, downscaling, gcam]
env:
DEMETER_DATA: /data
postprocess:
image: gcr.io/${GCP_PROJECT}/postprocess:latest
cpus: 4
memory: 16G
tags: [analysis]
tasks:
run_demeter_scenario:
component: demeter
cache: true
outputs:
database: dir
aggregate:
component: postprocess
cache: true
overlays:
k8s-dev-resources:
components:
demeter:
cpus: 2
memory: 8G
postprocess:
cpus: 1
memory: 4G
k8s-prod-resources:
components:
demeter:
cpus: 16
memory: 64G
postprocess:
cpus: 8
memory: 32G
Step 3: Namespace Setup¶
Create isolated namespaces for development and production:
# Development namespace
kubectl create namespace demeter-dev
kubectl label namespace demeter-dev team=energy env=dev
# Production namespace
kubectl create namespace demeter-prod
kubectl label namespace demeter-prod team=energy env=prod
Apply resource quotas to prevent runaway usage:
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: demeter-lulcc-quota
namespace: demeter-prod
spec:
hard:
requests.cpu: "160"
requests.memory: "640Gi"
limits.cpu: "200"
limits.memory: "800Gi"
pods: "50"
kubectl apply -f resource-quota.yaml
Step 4: Image Pull Secrets¶
If your container registry requires authentication:
# For GCR (Google Container Registry)
kubectl create secret docker-registry gcr-secret \
--docker-server=gcr.io \
--docker-username=_json_key \
--docker-password="$(cat service-account-key.json)" \
--namespace demeter-prod
# For ECR (AWS Elastic Container Registry)
kubectl create secret docker-registry ecr-secret \
--docker-server=123456789.dkr.ecr.us-east-1.amazonaws.com \
--docker-username=AWS \
--docker-password="$(aws ecr get-login-password)" \
--namespace demeter-prod
The Kubernetes provider automatically attaches these secrets to worker pods when the image URI matches the registry.
Step 5: Run a Development Workflow¶
export GCP_PROJECT=my-gcp-project
export GCS_BUCKET=demeter-artifacts
export IMAGE_TAG=dev-$(git rev-parse --short HEAD)
# Validate
scalable validate ./scalable.yaml
# Plan (shows pod resource requests)
scalable plan ./scalable.yaml --target k8s-dev --dry-run
Plan created for target 'k8s-dev' (provider: kubernetes)
Namespace: demeter-dev
Workers:
demeter: 2 pods (2 cpu, 8G memory)
postprocess: 1 pod (1 cpu, 4G memory)
Adaptive: min=1, max=5
Run the workflow:
from scalable import ScalableSession
session = ScalableSession.from_yaml("./scalable.yaml", target="k8s-dev")
plan = session.plan(dry_run=True)
client = session.start(plan)
# Submit tasks — they run in Kubernetes pods
futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in range(5)]
results = client.gather(futures)
session.close()
What happens under the hood:
The
KubernetesProvidercreates aDaskClustercustom resource in thedemeter-devnamespace.The Dask Kubernetes Operator provisions scheduler and worker pods.
Worker pods are labeled with component tags for affinity scheduling.
The adaptive scaler monitors task backlog and scales pods up/down within the configured bounds.
On
session.close(), theDaskClusterresource is deleted, cleaning up all pods.
Step 6: Monitor Pods and Scaling¶
Watch Kubernetes events in real-time:
# Watch pods in the namespace
kubectl get pods -n demeter-dev -w
NAME READY STATUS RESTARTS AGE
dask-scheduler-demeter-dev-0 1/1 Running 0 30s
dask-worker-demeter-0 1/1 Running 0 25s
dask-worker-demeter-1 1/1 Running 0 25s
dask-worker-postprocess-0 1/1 Running 0 25s
# Scale-up event
dask-worker-demeter-2 0/1 Pending 0 0s
dask-worker-demeter-2 1/1 Running 0 15s
Check the Dask dashboard (port-forward the scheduler):
kubectl port-forward -n demeter-dev svc/dask-scheduler-demeter-dev 8787:8787
# Open http://localhost:8787 in your browser
Step 7: Production Deployment¶
For production, ensure high availability and fault tolerance:
export IMAGE_TAG=v2.1.0 # Pinned release tag
scalable run ./scalable.yaml --target k8s-prod --workflow pipeline.py
Production considerations:
Pod disruption budgets — Prevent too many workers from being evicted simultaneously:
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: dask-workers-pdb
namespace: demeter-prod
spec:
minAvailable: "50%"
selector:
matchLabels:
app: dask-worker
kubectl apply -f pdb.yaml
Priority classes — Ensure your workload gets scheduled before lower-priority jobs:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: demeter-production
value: 1000
globalDefault: false
description: "Priority for production energy forecasting runs"
Step 8: Handling Pod Evictions¶
Kubernetes may evict pods due to resource pressure or node maintenance.
Scalable’s error handling (see Tutorial 7: Error Handling and Resilience Patterns) catches these
as KilledWorker exceptions:
from distributed import as_completed
session = ScalableSession.from_yaml("./scalable.yaml", target="k8s-prod")
client = session.start()
futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in range(200)]
results = []
retry_queue = []
for future in as_completed(futures):
try:
results.append(future.result())
except Exception as e:
if "KilledWorker" in str(type(e).__name__):
# Pod was evicted — retry
scenario_id = future.key.split("-")[-1]
retry_queue.append(scenario_id)
else:
print(f"Permanent failure: {e}")
# Retry evicted tasks
if retry_queue:
print(f"Retrying {len(retry_queue)} evicted tasks...")
retry_futures = [
client.submit(run_demeter_scenario, s, tag="demeter") for s in retry_queue
]
retry_results = client.gather(retry_futures)
results.extend(retry_results)
session.close()
Step 9: CI/CD Integration¶
Automate Kubernetes deployments from your CI pipeline:
# .github/workflows/scalable-prod.yaml
name: Production Pipeline Run
on:
workflow_dispatch:
inputs:
scenarios:
description: "Number of scenarios"
default: "200"
jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- uses: google-github-actions/get-gke-credentials@v2
with:
cluster_name: energy-cluster
location: us-central1
- name: Install Scalable
run: pip install scalable[kubernetes,cloud]
- name: Run Pipeline
env:
GCP_PROJECT: ${{ vars.GCP_PROJECT }}
GCS_BUCKET: ${{ vars.GCS_BUCKET }}
IMAGE_TAG: ${{ github.sha }}
SCALABLE_TARGET: k8s-prod
run: |
scalable validate ./scalable.yaml
scalable run ./scalable.yaml --target k8s-prod --workflow pipeline.py
Step 10: Local Development with minikube¶
For local Kubernetes development without a cloud cluster:
# Start minikube
minikube start --cpus=4 --memory=8192
# Install Dask operator
helm install dask-operator dask/dask-kubernetes-operator
# Build and load image locally
docker build -t demeter:local -f capabilities/demeter/Dockerfile.scalable capabilities/demeter
minikube image load demeter:local
# Use local image in manifest
export IMAGE_TAG=local
scalable run ./scalable.yaml --target k8s-dev --workflow workflow.py
This gives you a realistic Kubernetes environment for testing pod scheduling, resource limits, and failure modes before deploying to production.
Troubleshooting¶
- Pods stuck in “Pending” state
Check resource availability:
kubectl describe pod <pod-name> -n <ns>. Common causes: insufficient cluster capacity, resource quota exceeded, or node selector constraints not met.- “ImagePullBackOff” error
The image URI is wrong or the pull secret is missing/expired. Verify:
kubectl get secret -n <ns>and check image URI spelling.- Workers fail to connect to scheduler
Ensure network policies allow pod-to-pod communication within the namespace. The scheduler service must be reachable on port 8786.
- Adaptive scaling not working
Verify the Dask Kubernetes Operator is running and the
DaskClusterresource hasadaptivesection configured. Check operator logs:kubectl logs -n dask-operator deployment/dask-operator.- Resource quota prevents scaling
If
adaptive.maximumexceeds what the quota allows, pods will stay pending. Set maximum to a value within your quota limits.
Next Steps¶
Tutorial 9: ML-Driven Resource Advising and Scaling — Use ML predictions to pre-size Kubernetes pods based on historical resource usage.
Tutorial 7: Error Handling and Resilience Patterns — Build resilient pipelines that handle pod evictions gracefully.
Tutorial 10: AI-Assisted Workflow Composition — Auto-generate Kubernetes manifests from natural language workflow descriptions.