Beginner Tutorial 9: Machine Learning for Smarter Workflows¶

The Big Picture¶

After running your workflow many times, you’ve accumulated telemetry data showing how tasks perform: which scenarios are fast, which are slow, how much memory different inputs require. What if a computer could learn these patterns and predict optimal resource allocations automatically?

This tutorial introduces machine learning concepts in the context of workflow optimization: using past experience to make smarter decisions about how many workers to start, how much memory to request, and when to scale up or down.

What You Will Learn¶

By the end of this tutorial you will:

Understand what machine learning is at a high level.
Know the difference between training and inference.
Understand how Scalable’s LearnedAdvisor predicts resource needs.
Understand cross-validation and why it matters.
Use the AdaptiveScaler for real-time scaling decisions backed by ML.

Prerequisites¶

Completed Beginner Tutorial 1: Your First Workflow, Beginner Tutorial 6: Understanding What Happened, and Beginner Tutorial 3: How Distributed Computing Works.
pip install scalable[ml] (installs scikit-learn, dask-ml).
At least 5 completed telemetry runs (more history → better predictions).

Key Concepts Explained¶

💡 Key Concept: What is Machine Learning?

Machine learning (ML) is teaching computers to find patterns in data and make predictions without being explicitly programmed with rules.

Traditional programming:

Human writes rules → computer follows rules

IF memory_usage > 8GB THEN allocate 16GB
IF memory_usage > 16GB THEN allocate 32GB

Machine learning:

Computer finds rules from data → uses them to predict

Training data: [past runs with memory usage patterns]
ML model learns: "scenarios with >1000 nodes need ~12GB"
Prediction: "scenario 47 (1200 nodes) → recommend 16GB"

Analogy: A traditional program is like a recipe (follow these steps). ML is like learning to cook from experience (after cooking 100 dishes, you develop intuition about seasoning, timing, etc.).

💡 Key Concept: Training vs. Inference

ML has two phases:

Training (learning):

Feed historical data to an algorithm. The algorithm adjusts its internal parameters to fit the patterns in the data.

Slow (minutes to hours)
Done once (or periodically when new data is available)
Requires labeled data (inputs + known correct outputs)

Inference (predicting):

Use the trained model to make predictions on new inputs.

Fast (milliseconds)
Done many times
Uses the patterns learned during training

In Scalable:

Training = learning from telemetry history (past run metrics)
Inference = predicting resource needs for new runs

💡 Key Concept: Features

Features are the input variables that a model uses to make predictions. They’re the characteristics of your data that the model “looks at.”

For Scalable’s resource prediction:

Task name
Number of input data points
Historical average duration for this task type
Time of day
Target provider type

Feature engineering is the process of choosing and transforming raw data into useful features. Good features → good predictions.

💡 Key Concept: What is a Model?

In ML, a model is a mathematical function learned from data. It maps inputs (features) to outputs (predictions):

Model: features → prediction
Example: [task="demeter", scenarios=50, history_avg=45s] → memory=12GB

Think of a model as a function that was “written” by the training process rather than by a human programmer. The model doesn’t understand what it’s doing — it just captures statistical patterns in the training data.

Common model types:

Linear regression — simple, interpretable, assumes linear relationships
Decision tree — series of if/then rules learned from data
Random forest — many decision trees that vote on the answer
Gradient boosting — trees that correct each other’s mistakes

Scalable uses gradient boosting and random forests — they work well for tabular data (like telemetry metrics) without much tuning.

💡 Key Concept: Confidence

Confidence quantifies how sure a model is about a prediction.

Model prediction: memory = 12GB (confidence: 0.87)

High confidence → trust the prediction, use it directly
Low confidence → fall back to a safer default (the rule-based advisor)

Scalable uses confidence to decide whether to apply the ML recommendation or fall back to deterministic statistics. You never have to choose manually — the system picks whichever is more reliable for the current situation.

💡 Key Concept: Cross-Validation

Cross-validation tests model quality by repeatedly splitting data into training and testing sets:

Split data into 5 parts (folds)
Train on 4 parts, test on 1
Repeat 5 times (each part is the test set once)
Average the test scores

This prevents overfitting — a model that memorizes the training data but fails on new data. Cross-validation estimates how well the model will perform on data it hasn’t seen.

Step 1: The ResourceAdvisor (Baseline — No ML)¶

Before ML, Scalable provides a deterministic, rule-based advisor:

from scalable import ResourceAdvisor

advisor = ResourceAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(task="run_simulation")
print(recommendation)
# {'cpus': 4, 'memory': '16G', 'basis': 'p95 of 50 historical runs'}

This uses simple statistics (percentiles) — it works but doesn’t learn complex patterns.

Step 2: The LearnedAdvisor (ML-Powered)¶

The LearnedAdvisor uses machine learning on your telemetry history:

from scalable import LearnedAdvisor

# Train on historical telemetry
advisor = LearnedAdvisor.from_history(
    "./.scalable/runs",
    model_type="gradient_boosting",   # Algorithm choice
)

# Predict resources for a new run
recommendation = advisor.recommend(
    task="run_simulation",
    input_features={"num_nodes": 1200, "scenario_type": "peak_demand"},
)
print(recommendation)
# {'cpus': 2, 'memory': '8G', 'confidence': 0.87}

What’s happening here

from_history() loads telemetry data from past runs
It extracts features (task names, durations, resource usage)
It trains a gradient boosting model to predict resource needs
recommend() uses the trained model to predict for new inputs

The confidence: 0.87 means the model is 87% confident in this prediction. High confidence → the prediction is likely accurate.

Step 3: The AdaptiveScaler¶

The AdaptiveScaler uses ML predictions to decide scaling in real-time:

from scalable import AdaptiveScaler

scaler = AdaptiveScaler(
    min_workers=2,
    max_workers=20,
    scale_up_threshold=0.8,    # Scale up when 80% busy
    scale_down_threshold=0.3,  # Scale down when 30% busy
    cooldown_seconds=60,       # Wait 60s between scaling decisions
)

How adaptive scaling works with ML

Without ML: scale based on simple thresholds (queue depth > N → add workers)

With ML: predict future load based on patterns. If the model predicts a burst of heavy tasks coming, scale up BEFORE the queue fills. This reduces latency because workers are already ready when tasks arrive.

Step 4: Putting It All Together¶

A workflow using ML-informed advising and adaptive scaling:

from scalable import (
    ScalableSession, LearnedAdvisor, AdaptiveScaler
)

# 1. ML-informed resource allocation
advisor = LearnedAdvisor.from_history("./.scalable/runs")
recommendation = advisor.recommend(task="run_demeter_scenario")
print(f"Recommended: {recommendation.resources} (confidence={recommendation.confidence})")

# 2. Adaptive scaler driven by the ML advisor
scaler = AdaptiveScaler(
    advisor=advisor,
    min_workers={"demeter": 2},
    max_workers={"demeter": 20},
)

# 3. Run with ML-optimized resources
session = ScalableSession.from_yaml("./scalable.yaml", target="local")
client = session.start()

futures = [client.submit(run_demeter_scenario, i, tag="demeter")
           for i in range(100)]
results = client.gather(futures)

# The advisor informs initial sizing; the scaler adjusts in real time
# Telemetry records every scaling decision and its reasoning

Common Questions¶

Q: Do I need ML expertise to use these features?

No. Scalable provides sensible defaults. You just need:

Enough telemetry history (5+ runs for the advisor)
The [ml] extra installed

The system handles model selection, training, and evaluation.

Q: How much data do I need for the LearnedAdvisor?

Rule of thumb:

5 runs → basic predictions (limited accuracy)
20+ runs → reliable predictions
100+ runs → high accuracy with confidence intervals

More data = better predictions. The system falls back to the rule-based advisor when insufficient data exists.

Q: When should I retrain the advisor?

Retrain after significant changes — new task types, new hardware, or large shifts in input characteristics. LearnedAdvisor.from_history() simply re-reads the telemetry directory, so retraining is just calling that function again.

Q: What if the ML model gives a bad recommendation?

The recommendation includes a confidence score. Low confidence automatically falls back to the deterministic ResourceAdvisor. You can also set hard ceilings via max_workers to bound any prediction.

What You Learned¶

Term	Definition
Machine Learning	Teaching computers to find patterns and make predictions from data
Training	Learning phase where model adjusts to fit historical data
Inference	Prediction phase using a trained model on new inputs
Features	Input variables the model uses for predictions
Model	Mathematical function learned from data (inputs → predictions)
Confidence	How sure the model is in a particular prediction
Cross-Validation	Testing model quality by splitting data into train/test sets
Gradient Boosting	ML algorithm using sequential corrective decision trees
Adaptive Scaling	Adjusting worker counts in real time based on workload signals

Next Steps¶

You now understand how ML enhances workflow optimization through learned resource advising and adaptive scaling.

Next beginner tutorial: Beginner Tutorial 10: AI-Assisted Workflow Development — using AI assistants for workflow development
Standard tutorial: Tutorial 9: ML-Driven Resource Advising and Scaling — advanced ML patterns and hyperparameter tuning
Try it: Run your workflow 5+ times with different inputs. Then use LearnedAdvisor.from_history() to see what it recommends. Compare the ML recommendation to your current resource allocation.