Tutorial 9: ML-Driven Resource Advising and Scaling¶
What You Will Learn¶
By the end of this tutorial you will:
Train and use the LearnedAdvisor for ML-backed resource predictions.
Compare ML-backed predictions against the deterministic ResourceAdvisor.
Choose between linear, random forest, and gradient boosting models.
Configure the AdaptiveScaler with ML-informed decisions.
Tune model hyperparameters with cross-validation.
Prerequisites¶
Completed Tutorial 1: Getting Started with Scalable, Tutorial 6: Monitoring and Observability with Telemetry, and Tutorial 3: Scaling Strategies with Providers.
pip install scalable[ml](installsscikit-learn,dask-ml,joblib).At least 5 completed telemetry runs (more history → better predictions).
Scenario¶
Your pipeline has been running for weeks, accumulating telemetry data. You want to leverage this history to automatically predict optimal resource allocations for new runs and to drive adaptive scaling decisions in real time. ML-backed advising reduces wasted resources, improves throughput, and adapts to the unique characteristics of each workload.
Step 1: The ResourceAdvisor (Baseline)¶
Before ML, Scalable provides a deterministic, quantile-based advisor:
from scalable import ResourceAdvisor
advisor = ResourceAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(
task="run_demeter_scenario",
target="local",
confidence=0.95,
)
print(f"Recommended workers: {recommendation.workers}")
print(f"Resources: {recommendation.resources}")
print(f"Confidence: {recommendation.confidence}")
print(f"Evidence: {recommendation.evidence}")
Expected output:
Recommended workers: {'demeter': 4}
Resources: {'demeter': {'cpus': 8, 'memory': '32G', 'walltime': '02:30:00'}}
Confidence: 0.95
Evidence: {'runs_analyzed': 12, 'method': 'quantile', 'percentile': 95}
The deterministic advisor uses simple quantile statistics (P95 of historical
duration and resource usage). It’s reliable but doesn’t adapt to input
characteristics — it treats all invocations of run_demeter_scenario identically.
Step 2: The LearnedAdvisor (ML-Enhanced)¶
The LearnedAdvisor trains a machine
learning model on your telemetry to predict resource requirements based on
task features:
from scalable import LearnedAdvisor
# Train from telemetry history
advisor = LearnedAdvisor.from_history(
"./.scalable/runs",
model_type="gradient_boosting", # or "random_forest", "linear"
)
# Predict resources for a specific task with input features
recommendation = advisor.recommend(
task="run_demeter_scenario",
target="hpc",
features={
"num_scenarios": 50,
"input_size_mb": 2048,
"time_horizon": 2100,
},
)
print(f"Predicted workers: {recommendation.workers}")
print(f"Predicted resources: {recommendation.resources}")
print(f"Model confidence: {recommendation.confidence:.2f}")
Expected output:
Predicted workers: {'demeter': 8}
Predicted resources: {'demeter': {'cpus': 16, 'memory': '48G', 'walltime': '03:15:00'}}
Model confidence: 0.87
How it works:
The advisor scans telemetry run directories for completed tasks.
It extracts features: task name, input sizes, component resources, target type, historical duration, peak memory.
A gradient boosting model (or random forest) is trained to predict optimal resource allocation given input features.
Predictions include confidence intervals — low confidence triggers fallback to the deterministic advisor.
Step 3: Model Types and Trade-Offs¶
Model Type |
Accuracy |
Training Speed |
When to Use |
|---|---|---|---|
|
Low |
Fast (<1s) |
Few runs, simple patterns |
|
Medium |
Moderate (5–30s) |
Moderate history, non-linear patterns |
|
High |
Slow (30–120s) |
Rich history (50+ runs), complex patterns |
Choose via CLI:
# Use the ML advisor from CLI
scalable advise --task run_demeter_scenario --model-type gradient_boosting --format json
{
"task": "run_demeter_scenario",
"workers": {"demeter": 8},
"resources": {"demeter": {"cpus": 16, "memory": "48G", "walltime": "03:15:00"}},
"confidence": 0.87,
"model_type": "gradient_boosting"
}
Step 4: AdaptiveScaler with ML Predictions¶
Combine the LearnedAdvisor with real-time scaling:
from scalable import AdaptiveScaler, LearnedAdvisor, ScalableSession
# Train advisor
advisor = LearnedAdvisor.from_history("./.scalable/runs", model_type="gradient_boosting")
# Create adaptive scaler backed by ML predictions
scaler = AdaptiveScaler(
advisor=advisor,
min_workers={"demeter": 2, "postprocess": 1},
max_workers={"demeter": 30, "postprocess": 10},
scale_up_threshold=0.7,
scale_down_threshold=0.3,
cooldown_seconds=90,
)
session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
client = session.start()
# Submit work in batches and let the scaler decide
for batch in scenario_batches:
futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in batch]
decision = scaler.evaluate(
pending_tasks=[{"tag": "gcam", "features": {"input_size_mb": s.size}} for s in batch],
active_workers={"demeter": 10},
recent_completions=[{"tag": "gcam", "duration_s": 180}],
)
if decision.has_changes:
print(f"ML-informed scaling: {decision.reasoning}")
print(f" Confidence: {decision.confidence:.2f}")
print(f" Predicted completion: {decision.predicted_completion_time:.0f}s")
session.close()
The ML-backed scaler considers:
Current queue depth and worker utilization.
Predicted task duration from the learned model.
Historical scaling patterns (what worked before).
Cost constraints (from the
max_workersceiling).
Step 5: Hyperparameter Tuning¶
For optimal predictions, tune the ML model:
from scalable.ml import HyperparameterSearch
search = HyperparameterSearch(
runs_dir="./.scalable/runs",
model_type="gradient_boosting",
cv_folds=5,
)
best_params = search.run()
print(f"Best parameters: {best_params}")
print(f"Cross-validation score: {search.best_score:.3f}")
# Use best parameters
advisor = LearnedAdvisor.from_history(
"./.scalable/runs",
model_type="gradient_boosting",
model_params=best_params,
)
Troubleshooting¶
- LearnedAdvisor predictions are poor
Ensure you have sufficient telemetry history (at least 10–20 completed runs with varied inputs). With fewer runs, the deterministic
ResourceAdvisoris more reliable.- “ImportError: scikit-learn not installed”
Install the ML extra:
pip install scalable[ml].- Confidence is consistently low
The model has not seen enough varied inputs. Continue running the workflow to grow the telemetry history, then refit the advisor.
- Cross-validation scores are unstable across folds
Your dataset may be too small or imbalanced. Aim for 50+ runs with a mix of input characteristics before relying on tuned hyperparameters.
Next Steps¶
Tutorial 10: AI-Assisted Workflow Composition — Use AI assistants to generate workflow configurations that incorporate ML-driven advising.
Tutorial 6: Monitoring and Observability with Telemetry — Track ML-driven scaling decisions and resource utilization in telemetry for cost analysis.
Tutorial 5: Cloud Integration with AWS and GCP — Deploy ML-backed adaptive workflows to cloud for maximum cost efficiency.