Tutorial 9: ML-Driven Resource Advising and Scaling

What You Will Learn

By the end of this tutorial you will:

  • Train and use the LearnedAdvisor for ML-backed resource predictions.

  • Compare ML-backed predictions against the deterministic ResourceAdvisor.

  • Choose between linear, random forest, and gradient boosting models.

  • Configure the AdaptiveScaler with ML-informed decisions.

  • Tune model hyperparameters with cross-validation.

Prerequisites

Scenario

Your pipeline has been running for weeks, accumulating telemetry data. You want to leverage this history to automatically predict optimal resource allocations for new runs and to drive adaptive scaling decisions in real time. ML-backed advising reduces wasted resources, improves throughput, and adapts to the unique characteristics of each workload.

Step 1: The ResourceAdvisor (Baseline)

Before ML, Scalable provides a deterministic, quantile-based advisor:

from scalable import ResourceAdvisor

advisor = ResourceAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(
    task="run_demeter_scenario",
    target="local",
    confidence=0.95,
)

print(f"Recommended workers: {recommendation.workers}")
print(f"Resources: {recommendation.resources}")
print(f"Confidence: {recommendation.confidence}")
print(f"Evidence: {recommendation.evidence}")

Expected output:

Recommended workers: {'demeter': 4}
Resources: {'demeter': {'cpus': 8, 'memory': '32G', 'walltime': '02:30:00'}}
Confidence: 0.95
Evidence: {'runs_analyzed': 12, 'method': 'quantile', 'percentile': 95}

The deterministic advisor uses simple quantile statistics (P95 of historical duration and resource usage). It’s reliable but doesn’t adapt to input characteristics — it treats all invocations of run_demeter_scenario identically.

Step 2: The LearnedAdvisor (ML-Enhanced)

The LearnedAdvisor trains a machine learning model on your telemetry to predict resource requirements based on task features:

from scalable import LearnedAdvisor

# Train from telemetry history
advisor = LearnedAdvisor.from_history(
    "./.scalable/runs",
    model_type="gradient_boosting",  # or "random_forest", "linear"
)

# Predict resources for a specific task with input features
recommendation = advisor.recommend(
    task="run_demeter_scenario",
    target="hpc",
    features={
        "num_scenarios": 50,
        "input_size_mb": 2048,
        "time_horizon": 2100,
    },
)

print(f"Predicted workers: {recommendation.workers}")
print(f"Predicted resources: {recommendation.resources}")
print(f"Model confidence: {recommendation.confidence:.2f}")

Expected output:

Predicted workers: {'demeter': 8}
Predicted resources: {'demeter': {'cpus': 16, 'memory': '48G', 'walltime': '03:15:00'}}
Model confidence: 0.87

How it works:

  1. The advisor scans telemetry run directories for completed tasks.

  2. It extracts features: task name, input sizes, component resources, target type, historical duration, peak memory.

  3. A gradient boosting model (or random forest) is trained to predict optimal resource allocation given input features.

  4. Predictions include confidence intervals — low confidence triggers fallback to the deterministic advisor.

Step 3: Model Types and Trade-Offs

Model Type

Accuracy

Training Speed

When to Use

linear

Low

Fast (<1s)

Few runs, simple patterns

random_forest

Medium

Moderate (5–30s)

Moderate history, non-linear patterns

gradient_boosting

High

Slow (30–120s)

Rich history (50+ runs), complex patterns

Choose via CLI:

# Use the ML advisor from CLI
scalable advise --task run_demeter_scenario --model-type gradient_boosting --format json
{
  "task": "run_demeter_scenario",
  "workers": {"demeter": 8},
  "resources": {"demeter": {"cpus": 16, "memory": "48G", "walltime": "03:15:00"}},
  "confidence": 0.87,
  "model_type": "gradient_boosting"
}

Step 4: AdaptiveScaler with ML Predictions

Combine the LearnedAdvisor with real-time scaling:

from scalable import AdaptiveScaler, LearnedAdvisor, ScalableSession

# Train advisor
advisor = LearnedAdvisor.from_history("./.scalable/runs", model_type="gradient_boosting")

# Create adaptive scaler backed by ML predictions
scaler = AdaptiveScaler(
    advisor=advisor,
    min_workers={"demeter": 2, "postprocess": 1},
    max_workers={"demeter": 30, "postprocess": 10},
    scale_up_threshold=0.7,
    scale_down_threshold=0.3,
    cooldown_seconds=90,
)

session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
client = session.start()

# Submit work in batches and let the scaler decide
for batch in scenario_batches:
    futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in batch]

    decision = scaler.evaluate(
        pending_tasks=[{"tag": "gcam", "features": {"input_size_mb": s.size}} for s in batch],
        active_workers={"demeter": 10},
        recent_completions=[{"tag": "gcam", "duration_s": 180}],
    )

    if decision.has_changes:
        print(f"ML-informed scaling: {decision.reasoning}")
        print(f"  Confidence: {decision.confidence:.2f}")
        print(f"  Predicted completion: {decision.predicted_completion_time:.0f}s")

session.close()

The ML-backed scaler considers:

  • Current queue depth and worker utilization.

  • Predicted task duration from the learned model.

  • Historical scaling patterns (what worked before).

  • Cost constraints (from the max_workers ceiling).

Step 5: Hyperparameter Tuning

For optimal predictions, tune the ML model:

from scalable.ml import HyperparameterSearch

search = HyperparameterSearch(
    runs_dir="./.scalable/runs",
    model_type="gradient_boosting",
    cv_folds=5,
)

best_params = search.run()
print(f"Best parameters: {best_params}")
print(f"Cross-validation score: {search.best_score:.3f}")

# Use best parameters
advisor = LearnedAdvisor.from_history(
    "./.scalable/runs",
    model_type="gradient_boosting",
    model_params=best_params,
)

Troubleshooting

LearnedAdvisor predictions are poor

Ensure you have sufficient telemetry history (at least 10–20 completed runs with varied inputs). With fewer runs, the deterministic ResourceAdvisor is more reliable.

“ImportError: scikit-learn not installed”

Install the ML extra: pip install scalable[ml].

Confidence is consistently low

The model has not seen enough varied inputs. Continue running the workflow to grow the telemetry history, then refit the advisor.

Cross-validation scores are unstable across folds

Your dataset may be too small or imbalanced. Aim for 50+ runs with a mix of input characteristics before relying on tuned hyperparameters.

Next Steps