.. _tutorial_ml_advanced: ====================================================== Tutorial 9: ML-Driven Resource Advising and Scaling ====================================================== What You Will Learn ------------------- By the end of this tutorial you will: * Train and use the LearnedAdvisor for ML-backed resource predictions. * Compare ML-backed predictions against the deterministic ResourceAdvisor. * Choose between linear, random forest, and gradient boosting models. * Configure the AdaptiveScaler with ML-informed decisions. * Tune model hyperparameters with cross-validation. Prerequisites ------------- * Completed :ref:`tutorial_getting_started`, :ref:`tutorial_telemetry`, and :ref:`tutorial_scaling_strategies`. * ``pip install scalable[ml]`` (installs ``scikit-learn``, ``dask-ml``, ``joblib``). * At least 5 completed telemetry runs (more history → better predictions). Scenario -------- Your pipeline has been running for weeks, accumulating telemetry data. You want to leverage this history to automatically predict optimal resource allocations for new runs and to drive adaptive scaling decisions in real time. ML-backed advising reduces wasted resources, improves throughput, and adapts to the unique characteristics of each workload. Step 1: The ResourceAdvisor (Baseline) --------------------------------------- Before ML, Scalable provides a deterministic, quantile-based advisor: .. code-block:: python from scalable import ResourceAdvisor advisor = ResourceAdvisor.from_history("./.scalable/runs") recommendation = advisor.recommend( task="run_demeter_scenario", target="local", confidence=0.95, ) print(f"Recommended workers: {recommendation.workers}") print(f"Resources: {recommendation.resources}") print(f"Confidence: {recommendation.confidence}") print(f"Evidence: {recommendation.evidence}") Expected output: .. code-block:: text Recommended workers: {'demeter': 4} Resources: {'demeter': {'cpus': 8, 'memory': '32G', 'walltime': '02:30:00'}} Confidence: 0.95 Evidence: {'runs_analyzed': 12, 'method': 'quantile', 'percentile': 95} The deterministic advisor uses simple quantile statistics (P95 of historical duration and resource usage). It's reliable but doesn't adapt to input characteristics — it treats all invocations of ``run_demeter_scenario`` identically. Step 2: The LearnedAdvisor (ML-Enhanced) ----------------------------------------- The :class:`~scalable.ml.learned_advisor.LearnedAdvisor` trains a machine learning model on your telemetry to predict resource requirements based on task features: .. code-block:: python from scalable import LearnedAdvisor # Train from telemetry history advisor = LearnedAdvisor.from_history( "./.scalable/runs", model_type="gradient_boosting", # or "random_forest", "linear" ) # Predict resources for a specific task with input features recommendation = advisor.recommend( task="run_demeter_scenario", target="hpc", features={ "num_scenarios": 50, "input_size_mb": 2048, "time_horizon": 2100, }, ) print(f"Predicted workers: {recommendation.workers}") print(f"Predicted resources: {recommendation.resources}") print(f"Model confidence: {recommendation.confidence:.2f}") Expected output: .. code-block:: text Predicted workers: {'demeter': 8} Predicted resources: {'demeter': {'cpus': 16, 'memory': '48G', 'walltime': '03:15:00'}} Model confidence: 0.87 **How it works:** 1. The advisor scans telemetry run directories for completed tasks. 2. It extracts features: task name, input sizes, component resources, target type, historical duration, peak memory. 3. A gradient boosting model (or random forest) is trained to predict optimal resource allocation given input features. 4. Predictions include confidence intervals — low confidence triggers fallback to the deterministic advisor. Step 3: Model Types and Trade-Offs ------------------------------------ .. list-table:: :header-rows: 1 :widths: 20 30 25 25 * - Model Type - Accuracy - Training Speed - When to Use * - ``linear`` - Low - Fast (<1s) - Few runs, simple patterns * - ``random_forest`` - Medium - Moderate (5–30s) - Moderate history, non-linear patterns * - ``gradient_boosting`` - High - Slow (30–120s) - Rich history (50+ runs), complex patterns Choose via CLI: .. code-block:: bash # Use the ML advisor from CLI scalable advise --task run_demeter_scenario --model-type gradient_boosting --format json .. code-block:: json { "task": "run_demeter_scenario", "workers": {"demeter": 8}, "resources": {"demeter": {"cpus": 16, "memory": "48G", "walltime": "03:15:00"}}, "confidence": 0.87, "model_type": "gradient_boosting" } Step 4: AdaptiveScaler with ML Predictions -------------------------------------------- Combine the LearnedAdvisor with real-time scaling: .. code-block:: python from scalable import AdaptiveScaler, LearnedAdvisor, ScalableSession # Train advisor advisor = LearnedAdvisor.from_history("./.scalable/runs", model_type="gradient_boosting") # Create adaptive scaler backed by ML predictions scaler = AdaptiveScaler( advisor=advisor, min_workers={"demeter": 2, "postprocess": 1}, max_workers={"demeter": 30, "postprocess": 10}, scale_up_threshold=0.7, scale_down_threshold=0.3, cooldown_seconds=90, ) session = ScalableSession.from_yaml("./scalable.yaml", target="aws") client = session.start() # Submit work in batches and let the scaler decide for batch in scenario_batches: futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in batch] decision = scaler.evaluate( pending_tasks=[{"tag": "gcam", "features": {"input_size_mb": s.size}} for s in batch], active_workers={"demeter": 10}, recent_completions=[{"tag": "gcam", "duration_s": 180}], ) if decision.has_changes: print(f"ML-informed scaling: {decision.reasoning}") print(f" Confidence: {decision.confidence:.2f}") print(f" Predicted completion: {decision.predicted_completion_time:.0f}s") session.close() The ML-backed scaler considers: * Current queue depth and worker utilization. * Predicted task duration from the learned model. * Historical scaling patterns (what worked before). * Cost constraints (from the ``max_workers`` ceiling). Step 5: Hyperparameter Tuning ------------------------------ For optimal predictions, tune the ML model: .. code-block:: python from scalable.ml import HyperparameterSearch search = HyperparameterSearch( runs_dir="./.scalable/runs", model_type="gradient_boosting", cv_folds=5, ) best_params = search.run() print(f"Best parameters: {best_params}") print(f"Cross-validation score: {search.best_score:.3f}") # Use best parameters advisor = LearnedAdvisor.from_history( "./.scalable/runs", model_type="gradient_boosting", model_params=best_params, ) Troubleshooting --------------- **LearnedAdvisor predictions are poor** Ensure you have sufficient telemetry history (at least 10–20 completed runs with varied inputs). With fewer runs, the deterministic ``ResourceAdvisor`` is more reliable. **"ImportError: scikit-learn not installed"** Install the ML extra: ``pip install scalable[ml]``. **Confidence is consistently low** The model has not seen enough varied inputs. Continue running the workflow to grow the telemetry history, then refit the advisor. **Cross-validation scores are unstable across folds** Your dataset may be too small or imbalanced. Aim for 50+ runs with a mix of input characteristics before relying on tuned hyperparameters. Next Steps ---------- * :ref:`tutorial_ai_composition` — Use AI assistants to generate workflow configurations that incorporate ML-driven advising. * :ref:`tutorial_telemetry` — Track ML-driven scaling decisions and resource utilization in telemetry for cost analysis. * :ref:`tutorial_cloud_integration` — Deploy ML-backed adaptive workflows to cloud for maximum cost efficiency.