Tutorial 9: ML-Driven Resource Advising and Scaling¶

What You Will Learn¶

By the end of this tutorial you will:

Train and use the LearnedAdvisor for ML-backed resource predictions.
Compare ML-backed predictions against the deterministic ResourceAdvisor.
Choose between linear, random forest, and gradient boosting models.
Configure the AdaptiveScaler with ML-informed decisions.
Tune model hyperparameters with cross-validation.

Prerequisites¶

Completed Tutorial 1: Getting Started with Scalable, Tutorial 6: Monitoring and Observability with Telemetry, and Tutorial 3: Scaling Strategies with Providers.
pip install scalable[ml] (installs scikit-learn, dask-ml, joblib).
At least 5 completed telemetry runs (more history → better predictions).

Scenario¶

Your pipeline has been running for weeks, accumulating telemetry data. You want to leverage this history to automatically predict optimal resource allocations for new runs and to drive adaptive scaling decisions in real time. ML-backed advising reduces wasted resources, improves throughput, and adapts to the unique characteristics of each workload.

Step 1: The ResourceAdvisor (Baseline)¶

Before ML, Scalable provides a deterministic, quantile-based advisor:

from scalable import ResourceAdvisor

advisor = ResourceAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(
    task="run_demeter_scenario",
    target="local",
    confidence=0.95,
)

print(f"Recommended workers: {recommendation.workers}")
print(f"Resources: {recommendation.resources}")
print(f"Confidence: {recommendation.confidence}")
print(f"Evidence: {recommendation.evidence}")

Expected output:

Recommended workers: {'demeter': 4}
Resources: {'demeter': {'cpus': 8, 'memory': '32G', 'walltime': '02:30:00'}}
Confidence: 0.95
Evidence: {'runs_analyzed': 12, 'method': 'quantile', 'percentile': 95}

The deterministic advisor uses simple quantile statistics (P95 of historical duration and resource usage). It’s reliable but doesn’t adapt to input characteristics — it treats all invocations of run_demeter_scenario identically.

Step 2: The LearnedAdvisor (ML-Enhanced)¶

The LearnedAdvisor trains a machine learning model on your telemetry to predict resource requirements based on task features:

from scalable import LearnedAdvisor

# Train from telemetry history
advisor = LearnedAdvisor.from_history(
    "./.scalable/runs",
    model_type="gradient_boosting",  # or "random_forest", "linear"
)

# Predict resources for a specific task with input features
recommendation = advisor.recommend(
    task="run_demeter_scenario",
    target="hpc",
    features={
        "num_scenarios": 50,
        "input_size_mb": 2048,
        "time_horizon": 2100,
    },
)

print(f"Predicted workers: {recommendation.workers}")
print(f"Predicted resources: {recommendation.resources}")
print(f"Model confidence: {recommendation.confidence:.2f}")

Expected output:

Predicted workers: {'demeter': 8}
Predicted resources: {'demeter': {'cpus': 16, 'memory': '48G', 'walltime': '03:15:00'}}
Model confidence: 0.87

How it works:

The advisor scans telemetry run directories for completed tasks.
It extracts features: task name, input sizes, component resources, target type, historical duration, peak memory.
A gradient boosting model (or random forest) is trained to predict optimal resource allocation given input features.
Predictions include confidence intervals — low confidence triggers fallback to the deterministic advisor.

Step 3: Model Types and Trade-Offs¶

Model Type	Accuracy	Training Speed	When to Use
`linear`	Low	Fast (<1s)	Few runs, simple patterns
`random_forest`	Medium	Moderate (5–30s)	Moderate history, non-linear patterns
`gradient_boosting`	High	Slow (30–120s)	Rich history (50+ runs), complex patterns

Choose via CLI:

# Use the ML advisor from CLI
scalable advise --task run_demeter_scenario --model-type gradient_boosting --format json

{
  "task": "run_demeter_scenario",
  "workers": {"demeter": 8},
  "resources": {"demeter": {"cpus": 16, "memory": "48G", "walltime": "03:15:00"}},
  "confidence": 0.87,
  "model_type": "gradient_boosting"
}

Step 4: AdaptiveScaler with ML Predictions¶

Combine the LearnedAdvisor with real-time scaling:

from scalable import AdaptiveScaler, LearnedAdvisor, ScalableSession

# Train advisor
advisor = LearnedAdvisor.from_history("./.scalable/runs", model_type="gradient_boosting")

# Create adaptive scaler backed by ML predictions
scaler = AdaptiveScaler(
    advisor=advisor,
    min_workers={"demeter": 2, "postprocess": 1},
    max_workers={"demeter": 30, "postprocess": 10},
    scale_up_threshold=0.7,
    scale_down_threshold=0.3,
    cooldown_seconds=90,
)

session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
client = session.start()

# Submit work in batches and let the scaler decide
for batch in scenario_batches:
    futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in batch]

    decision = scaler.evaluate(
        pending_tasks=[{"tag": "gcam", "features": {"input_size_mb": s.size}} for s in batch],
        active_workers={"demeter": 10},
        recent_completions=[{"tag": "gcam", "duration_s": 180}],
    )

    if decision.has_changes:
        print(f"ML-informed scaling: {decision.reasoning}")
        print(f"  Confidence: {decision.confidence:.2f}")
        print(f"  Predicted completion: {decision.predicted_completion_time:.0f}s")

session.close()

The ML-backed scaler considers:

Current queue depth and worker utilization.
Predicted task duration from the learned model.
Historical scaling patterns (what worked before).
Cost constraints (from the max_workers ceiling).

Step 5: Hyperparameter Tuning¶

For optimal predictions, tune the ML model:

from scalable.ml import HyperparameterSearch

search = HyperparameterSearch(
    runs_dir="./.scalable/runs",
    model_type="gradient_boosting",
    cv_folds=5,
)

best_params = search.run()
print(f"Best parameters: {best_params}")
print(f"Cross-validation score: {search.best_score:.3f}")

# Use best parameters
advisor = LearnedAdvisor.from_history(
    "./.scalable/runs",
    model_type="gradient_boosting",
    model_params=best_params,
)

Troubleshooting¶

LearnedAdvisor predictions are poor: Ensure you have sufficient telemetry history (at least 10–20 completed runs with varied inputs). With fewer runs, the deterministic ResourceAdvisor is more reliable.
“ImportError: scikit-learn not installed”: Install the ML extra: pip install scalable[ml].
Confidence is consistently low: The model has not seen enough varied inputs. Continue running the workflow to grow the telemetry history, then refit the advisor.
Cross-validation scores are unstable across folds: Your dataset may be too small or imbalanced. Aim for 50+ runs with a mix of input characteristics before relying on tuned hyperparameters.

Next Steps¶

Tutorial 10: AI-Assisted Workflow Composition — Use AI assistants to generate workflow configurations that incorporate ML-driven advising.
Tutorial 6: Monitoring and Observability with Telemetry — Track ML-driven scaling decisions and resource utilization in telemetry for cost analysis.
Tutorial 5: Cloud Integration with AWS and GCP — Deploy ML-backed adaptive workflows to cloud for maximum cost efficiency.