Beginner Tutorial 9: Machine Learning for Smarter Workflows¶
The Big Picture¶
After running your workflow many times, you’ve accumulated telemetry data showing how tasks perform: which scenarios are fast, which are slow, how much memory different inputs require. What if a computer could learn these patterns and predict optimal resource allocations automatically?
This tutorial introduces machine learning concepts in the context of workflow optimization: using past experience to make smarter decisions about how many workers to start, how much memory to request, and when to scale up or down.
What You Will Learn¶
By the end of this tutorial you will:
Understand what machine learning is at a high level.
Know the difference between training and inference.
Understand how Scalable’s LearnedAdvisor predicts resource needs.
Understand cross-validation and why it matters.
Use the AdaptiveScaler for real-time scaling decisions backed by ML.
Prerequisites¶
Completed Beginner Tutorial 1: Your First Workflow, Beginner Tutorial 6: Understanding What Happened, and Beginner Tutorial 3: How Distributed Computing Works.
pip install scalable[ml](installs scikit-learn, dask-ml).At least 5 completed telemetry runs (more history → better predictions).
Key Concepts Explained¶
💡 Key Concept: What is Machine Learning?
Machine learning (ML) is teaching computers to find patterns in data and make predictions without being explicitly programmed with rules.
- Traditional programming:
Human writes rules → computer follows rules
IF memory_usage > 8GB THEN allocate 16GB IF memory_usage > 16GB THEN allocate 32GB
- Machine learning:
Computer finds rules from data → uses them to predict
Training data: [past runs with memory usage patterns] ML model learns: "scenarios with >1000 nodes need ~12GB" Prediction: "scenario 47 (1200 nodes) → recommend 16GB"
Analogy: A traditional program is like a recipe (follow these steps). ML is like learning to cook from experience (after cooking 100 dishes, you develop intuition about seasoning, timing, etc.).
💡 Key Concept: Training vs. Inference
ML has two phases:
- Training (learning):
Feed historical data to an algorithm. The algorithm adjusts its internal parameters to fit the patterns in the data.
Slow (minutes to hours)
Done once (or periodically when new data is available)
Requires labeled data (inputs + known correct outputs)
- Inference (predicting):
Use the trained model to make predictions on new inputs.
Fast (milliseconds)
Done many times
Uses the patterns learned during training
In Scalable:
Training = learning from telemetry history (past run metrics)
Inference = predicting resource needs for new runs
💡 Key Concept: Features
Features are the input variables that a model uses to make predictions. They’re the characteristics of your data that the model “looks at.”
For Scalable’s resource prediction:
Task name
Number of input data points
Historical average duration for this task type
Time of day
Target provider type
Feature engineering is the process of choosing and transforming raw data into useful features. Good features → good predictions.
💡 Key Concept: What is a Model?
In ML, a model is a mathematical function learned from data. It maps inputs (features) to outputs (predictions):
Model: features → prediction
Example: [task="demeter", scenarios=50, history_avg=45s] → memory=12GB
Think of a model as a function that was “written” by the training process rather than by a human programmer. The model doesn’t understand what it’s doing — it just captures statistical patterns in the training data.
Common model types:
Linear regression — simple, interpretable, assumes linear relationships
Decision tree — series of if/then rules learned from data
Random forest — many decision trees that vote on the answer
Gradient boosting — trees that correct each other’s mistakes
Scalable uses gradient boosting and random forests — they work well for tabular data (like telemetry metrics) without much tuning.
💡 Key Concept: Confidence
Confidence quantifies how sure a model is about a prediction.
Model prediction: memory = 12GB (confidence: 0.87)
High confidence → trust the prediction, use it directly
Low confidence → fall back to a safer default (the rule-based advisor)
Scalable uses confidence to decide whether to apply the ML recommendation or fall back to deterministic statistics. You never have to choose manually — the system picks whichever is more reliable for the current situation.
💡 Key Concept: Cross-Validation
Cross-validation tests model quality by repeatedly splitting data into training and testing sets:
Split data into 5 parts (folds)
Train on 4 parts, test on 1
Repeat 5 times (each part is the test set once)
Average the test scores
This prevents overfitting — a model that memorizes the training data but fails on new data. Cross-validation estimates how well the model will perform on data it hasn’t seen.
Step 1: The ResourceAdvisor (Baseline — No ML)¶
Before ML, Scalable provides a deterministic, rule-based advisor:
from scalable import ResourceAdvisor
advisor = ResourceAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(task="run_simulation")
print(recommendation)
# {'cpus': 4, 'memory': '16G', 'basis': 'p95 of 50 historical runs'}
This uses simple statistics (percentiles) — it works but doesn’t learn complex patterns.
Step 2: The LearnedAdvisor (ML-Powered)¶
The LearnedAdvisor uses machine learning on your telemetry history:
from scalable import LearnedAdvisor
# Train on historical telemetry
advisor = LearnedAdvisor.from_history(
"./.scalable/runs",
model_type="gradient_boosting", # Algorithm choice
)
# Predict resources for a new run
recommendation = advisor.recommend(
task="run_simulation",
input_features={"num_nodes": 1200, "scenario_type": "peak_demand"},
)
print(recommendation)
# {'cpus': 2, 'memory': '8G', 'confidence': 0.87}
What’s happening here
from_history()loads telemetry data from past runsIt extracts features (task names, durations, resource usage)
It trains a gradient boosting model to predict resource needs
recommend()uses the trained model to predict for new inputs
The confidence: 0.87 means the model is 87% confident in this
prediction. High confidence → the prediction is likely accurate.
Step 3: The AdaptiveScaler¶
The AdaptiveScaler uses ML predictions to decide scaling in real-time:
from scalable import AdaptiveScaler
scaler = AdaptiveScaler(
min_workers=2,
max_workers=20,
scale_up_threshold=0.8, # Scale up when 80% busy
scale_down_threshold=0.3, # Scale down when 30% busy
cooldown_seconds=60, # Wait 60s between scaling decisions
)
How adaptive scaling works with ML
Without ML: scale based on simple thresholds (queue depth > N → add workers)
With ML: predict future load based on patterns. If the model predicts a burst of heavy tasks coming, scale up BEFORE the queue fills. This reduces latency because workers are already ready when tasks arrive.
Step 4: Putting It All Together¶
A workflow using ML-informed advising and adaptive scaling:
from scalable import (
ScalableSession, LearnedAdvisor, AdaptiveScaler
)
# 1. ML-informed resource allocation
advisor = LearnedAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(task="run_demeter_scenario")
print(f"Recommended: {recommendation.resources} (confidence={recommendation.confidence})")
# 2. Adaptive scaler driven by the ML advisor
scaler = AdaptiveScaler(
advisor=advisor,
min_workers={"demeter": 2},
max_workers={"demeter": 20},
)
# 3. Run with ML-optimized resources
session = ScalableSession.from_yaml("./scalable.yaml", target="local")
client = session.start()
futures = [client.submit(run_demeter_scenario, i, tag="demeter")
for i in range(100)]
results = client.gather(futures)
# The advisor informs initial sizing; the scaler adjusts in real time
# Telemetry records every scaling decision and its reasoning
Common Questions¶
Q: Do I need ML expertise to use these features?
No. Scalable provides sensible defaults. You just need:
Enough telemetry history (5+ runs for the advisor)
The
[ml]extra installed
The system handles model selection, training, and evaluation.
Q: How much data do I need for the LearnedAdvisor?
Rule of thumb:
5 runs → basic predictions (limited accuracy)
20+ runs → reliable predictions
100+ runs → high accuracy with confidence intervals
More data = better predictions. The system falls back to the rule-based advisor when insufficient data exists.
Q: When should I retrain the advisor?
Retrain after significant changes — new task types, new hardware, or large
shifts in input characteristics. LearnedAdvisor.from_history() simply
re-reads the telemetry directory, so retraining is just calling that function
again.
Q: What if the ML model gives a bad recommendation?
The recommendation includes a confidence score. Low confidence
automatically falls back to the deterministic ResourceAdvisor. You can
also set hard ceilings via max_workers to bound any prediction.
What You Learned¶
Term |
Definition |
|---|---|
Machine Learning |
Teaching computers to find patterns and make predictions from data |
Training |
Learning phase where model adjusts to fit historical data |
Inference |
Prediction phase using a trained model on new inputs |
Features |
Input variables the model uses for predictions |
Model |
Mathematical function learned from data (inputs → predictions) |
Confidence |
How sure the model is in a particular prediction |
Cross-Validation |
Testing model quality by splitting data into train/test sets |
Gradient Boosting |
ML algorithm using sequential corrective decision trees |
Adaptive Scaling |
Adjusting worker counts in real time based on workload signals |
Next Steps¶
You now understand how ML enhances workflow optimization through learned resource advising and adaptive scaling.
Next beginner tutorial: Beginner Tutorial 10: AI-Assisted Workflow Development — using AI assistants for workflow development
Standard tutorial: Tutorial 9: ML-Driven Resource Advising and Scaling — advanced ML patterns and hyperparameter tuning
Try it: Run your workflow 5+ times with different inputs. Then use
LearnedAdvisor.from_history()to see what it recommends. Compare the ML recommendation to your current resource allocation.