.. _beginner_ml_emulation: ====================================================== Beginner Tutorial 9: Machine Learning for Smarter Workflows ====================================================== The Big Picture ---------------- After running your workflow many times, you've accumulated telemetry data showing how tasks perform: which scenarios are fast, which are slow, how much memory different inputs require. What if a computer could learn these patterns and predict optimal resource allocations automatically? This tutorial introduces **machine learning** concepts in the context of workflow optimization: using past experience to make smarter decisions about how many workers to start, how much memory to request, and when to scale up or down. What You Will Learn -------------------- By the end of this tutorial you will: * Understand what machine learning is at a high level. * Know the difference between training and inference. * Understand how Scalable's LearnedAdvisor predicts resource needs. * Understand cross-validation and why it matters. * Use the AdaptiveScaler for real-time scaling decisions backed by ML. Prerequisites -------------- * Completed :ref:`beginner_getting_started`, :ref:`beginner_telemetry`, and :ref:`beginner_scaling_strategies`. * ``pip install scalable[ml]`` (installs scikit-learn, dask-ml). * At least 5 completed telemetry runs (more history → better predictions). Key Concepts Explained ----------------------- .. admonition:: 💡 Key Concept: What is Machine Learning? :class: tip **Machine learning (ML)** is teaching computers to find patterns in data and make predictions without being explicitly programmed with rules. **Traditional programming:** Human writes rules → computer follows rules .. code-block:: text IF memory_usage > 8GB THEN allocate 16GB IF memory_usage > 16GB THEN allocate 32GB **Machine learning:** Computer finds rules from data → uses them to predict .. code-block:: text Training data: [past runs with memory usage patterns] ML model learns: "scenarios with >1000 nodes need ~12GB" Prediction: "scenario 47 (1200 nodes) → recommend 16GB" **Analogy:** A traditional program is like a recipe (follow these steps). ML is like learning to cook from experience (after cooking 100 dishes, you develop intuition about seasoning, timing, etc.). .. admonition:: 💡 Key Concept: Training vs. Inference :class: tip ML has two phases: **Training** (learning): Feed historical data to an algorithm. The algorithm adjusts its internal parameters to fit the patterns in the data. * Slow (minutes to hours) * Done once (or periodically when new data is available) * Requires labeled data (inputs + known correct outputs) **Inference** (predicting): Use the trained model to make predictions on new inputs. * Fast (milliseconds) * Done many times * Uses the patterns learned during training **In Scalable:** * **Training** = learning from telemetry history (past run metrics) * **Inference** = predicting resource needs for new runs .. admonition:: 💡 Key Concept: Features :class: tip **Features** are the input variables that a model uses to make predictions. They're the characteristics of your data that the model "looks at." For Scalable's resource prediction: * Task name * Number of input data points * Historical average duration for this task type * Time of day * Target provider type **Feature engineering** is the process of choosing and transforming raw data into useful features. Good features → good predictions. .. admonition:: 💡 Key Concept: What is a Model? :class: tip In ML, a **model** is a mathematical function learned from data. It maps inputs (features) to outputs (predictions): .. code-block:: text Model: features → prediction Example: [task="demeter", scenarios=50, history_avg=45s] → memory=12GB Think of a model as a function that was "written" by the training process rather than by a human programmer. The model doesn't understand what it's doing — it just captures statistical patterns in the training data. Common model types: * **Linear regression** — simple, interpretable, assumes linear relationships * **Decision tree** — series of if/then rules learned from data * **Random forest** — many decision trees that vote on the answer * **Gradient boosting** — trees that correct each other's mistakes Scalable uses gradient boosting and random forests — they work well for tabular data (like telemetry metrics) without much tuning. .. admonition:: 💡 Key Concept: Confidence :class: tip **Confidence** quantifies how sure a model is about a prediction. .. code-block:: text Model prediction: memory = 12GB (confidence: 0.87) * High confidence → trust the prediction, use it directly * Low confidence → fall back to a safer default (the rule-based advisor) Scalable uses confidence to decide whether to apply the ML recommendation or fall back to deterministic statistics. You never have to choose manually — the system picks whichever is more reliable for the current situation. .. admonition:: 💡 Key Concept: Cross-Validation :class: tip **Cross-validation** tests model quality by repeatedly splitting data into training and testing sets: 1. Split data into 5 parts (folds) 2. Train on 4 parts, test on 1 3. Repeat 5 times (each part is the test set once) 4. Average the test scores This prevents **overfitting** — a model that memorizes the training data but fails on new data. Cross-validation estimates how well the model will perform on data it hasn't seen. Step 1: The ResourceAdvisor (Baseline — No ML) ------------------------------------------------- Before ML, Scalable provides a deterministic, rule-based advisor: .. code-block:: python from scalable import ResourceAdvisor advisor = ResourceAdvisor.from_history("./.scalable/runs") recommendation = advisor.recommend(task="run_simulation") print(recommendation) # {'cpus': 4, 'memory': '16G', 'basis': 'p95 of 50 historical runs'} This uses simple statistics (percentiles) — it works but doesn't learn complex patterns. Step 2: The LearnedAdvisor (ML-Powered) ------------------------------------------ The LearnedAdvisor uses machine learning on your telemetry history: .. code-block:: python from scalable import LearnedAdvisor # Train on historical telemetry advisor = LearnedAdvisor.from_history( "./.scalable/runs", model_type="gradient_boosting", # Algorithm choice ) # Predict resources for a new run recommendation = advisor.recommend( task="run_simulation", input_features={"num_nodes": 1200, "scenario_type": "peak_demand"}, ) print(recommendation) # {'cpus': 2, 'memory': '8G', 'confidence': 0.87} .. admonition:: What's happening here :class: note 1. ``from_history()`` loads telemetry data from past runs 2. It extracts features (task names, durations, resource usage) 3. It trains a gradient boosting model to predict resource needs 4. ``recommend()`` uses the trained model to predict for new inputs The ``confidence: 0.87`` means the model is 87% confident in this prediction. High confidence → the prediction is likely accurate. Step 3: The AdaptiveScaler ---------------------------- The AdaptiveScaler uses ML predictions to decide scaling in real-time: .. code-block:: python from scalable import AdaptiveScaler scaler = AdaptiveScaler( min_workers=2, max_workers=20, scale_up_threshold=0.8, # Scale up when 80% busy scale_down_threshold=0.3, # Scale down when 30% busy cooldown_seconds=60, # Wait 60s between scaling decisions ) .. admonition:: How adaptive scaling works with ML :class: note Without ML: scale based on simple thresholds (queue depth > N → add workers) With ML: predict future load based on patterns. If the model predicts a burst of heavy tasks coming, scale up BEFORE the queue fills. This reduces latency because workers are already ready when tasks arrive. Step 4: Putting It All Together --------------------------------- A workflow using ML-informed advising and adaptive scaling: .. code-block:: python from scalable import ( ScalableSession, LearnedAdvisor, AdaptiveScaler ) # 1. ML-informed resource allocation advisor = LearnedAdvisor.from_history("./.scalable/runs") recommendation = advisor.recommend(task="run_demeter_scenario") print(f"Recommended: {recommendation.resources} (confidence={recommendation.confidence})") # 2. Adaptive scaler driven by the ML advisor scaler = AdaptiveScaler( advisor=advisor, min_workers={"demeter": 2}, max_workers={"demeter": 20}, ) # 3. Run with ML-optimized resources session = ScalableSession.from_yaml("./scalable.yaml", target="local") client = session.start() futures = [client.submit(run_demeter_scenario, i, tag="demeter") for i in range(100)] results = client.gather(futures) # The advisor informs initial sizing; the scaler adjusts in real time # Telemetry records every scaling decision and its reasoning Common Questions ----------------- **Q: Do I need ML expertise to use these features?** No. Scalable provides sensible defaults. You just need: * Enough telemetry history (5+ runs for the advisor) * The ``[ml]`` extra installed The system handles model selection, training, and evaluation. **Q: How much data do I need for the LearnedAdvisor?** Rule of thumb: * 5 runs → basic predictions (limited accuracy) * 20+ runs → reliable predictions * 100+ runs → high accuracy with confidence intervals More data = better predictions. The system falls back to the rule-based advisor when insufficient data exists. **Q: When should I retrain the advisor?** Retrain after significant changes — new task types, new hardware, or large shifts in input characteristics. ``LearnedAdvisor.from_history()`` simply re-reads the telemetry directory, so retraining is just calling that function again. **Q: What if the ML model gives a bad recommendation?** The recommendation includes a ``confidence`` score. Low confidence automatically falls back to the deterministic ``ResourceAdvisor``. You can also set hard ceilings via ``max_workers`` to bound any prediction. What You Learned ----------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Term - Definition * - Machine Learning - Teaching computers to find patterns and make predictions from data * - Training - Learning phase where model adjusts to fit historical data * - Inference - Prediction phase using a trained model on new inputs * - Features - Input variables the model uses for predictions * - Model - Mathematical function learned from data (inputs → predictions) * - Confidence - How sure the model is in a particular prediction * - Cross-Validation - Testing model quality by splitting data into train/test sets * - Gradient Boosting - ML algorithm using sequential corrective decision trees * - Adaptive Scaling - Adjusting worker counts in real time based on workload signals Next Steps ----------- You now understand how ML enhances workflow optimization through learned resource advising and adaptive scaling. * **Next beginner tutorial:** :ref:`beginner_ai_composition` — using AI assistants for workflow development * **Standard tutorial:** :ref:`tutorial_ml_advanced` — advanced ML patterns and hyperparameter tuning * **Try it:** Run your workflow 5+ times with different inputs. Then use ``LearnedAdvisor.from_history()`` to see what it recommends. Compare the ML recommendation to your current resource allocation.