.. _tutorial_ml_advanced:

======================================================
Tutorial 9: ML-Driven Resource Advising and Scaling
======================================================

What You Will Learn
-------------------

By the end of this tutorial you will:

* Train and use the LearnedAdvisor for ML-backed resource predictions.
* Compare ML-backed predictions against the deterministic ResourceAdvisor.
* Choose between linear, random forest, and gradient boosting models.
* Configure the AdaptiveScaler with ML-informed decisions.
* Tune model hyperparameters with cross-validation.

Prerequisites
-------------

* Completed :ref:`tutorial_getting_started`, :ref:`tutorial_telemetry`, and
  :ref:`tutorial_scaling_strategies`.
* ``pip install scalable[ml]`` (installs ``scikit-learn``, ``dask-ml``,
  ``joblib``).
* At least 5 completed telemetry runs (more history → better predictions).

Scenario
--------

Your pipeline has been running for weeks, accumulating telemetry data. You want
to leverage this history to automatically predict optimal resource allocations
for new runs and to drive adaptive scaling decisions in real time. ML-backed
advising reduces wasted resources, improves throughput, and adapts to the
unique characteristics of each workload.

Step 1: The ResourceAdvisor (Baseline)
---------------------------------------

Before ML, Scalable provides a deterministic, quantile-based advisor:

.. code-block:: python

   from scalable import ResourceAdvisor

   advisor = ResourceAdvisor.from_history("./.scalable/runs")
   recommendation = advisor.recommend(
       task="run_demeter_scenario",
       target="local",
       confidence=0.95,
   )

   print(f"Recommended workers: {recommendation.workers}")
   print(f"Resources: {recommendation.resources}")
   print(f"Confidence: {recommendation.confidence}")
   print(f"Evidence: {recommendation.evidence}")

Expected output:

.. code-block:: text

   Recommended workers: {'demeter': 4}
   Resources: {'demeter': {'cpus': 8, 'memory': '32G', 'walltime': '02:30:00'}}
   Confidence: 0.95
   Evidence: {'runs_analyzed': 12, 'method': 'quantile', 'percentile': 95}

The deterministic advisor uses simple quantile statistics (P95 of historical
duration and resource usage). It's reliable but doesn't adapt to input
characteristics — it treats all invocations of ``run_demeter_scenario`` identically.

Step 2: The LearnedAdvisor (ML-Enhanced)
-----------------------------------------

The :class:`~scalable.ml.learned_advisor.LearnedAdvisor` trains a machine
learning model on your telemetry to predict resource requirements based on
task features:

.. code-block:: python

   from scalable import LearnedAdvisor

   # Train from telemetry history
   advisor = LearnedAdvisor.from_history(
       "./.scalable/runs",
       model_type="gradient_boosting",  # or "random_forest", "linear"
   )

   # Predict resources for a specific task with input features
   recommendation = advisor.recommend(
       task="run_demeter_scenario",
       target="hpc",
       features={
           "num_scenarios": 50,
           "input_size_mb": 2048,
           "time_horizon": 2100,
       },
   )

   print(f"Predicted workers: {recommendation.workers}")
   print(f"Predicted resources: {recommendation.resources}")
   print(f"Model confidence: {recommendation.confidence:.2f}")

Expected output:

.. code-block:: text

   Predicted workers: {'demeter': 8}
   Predicted resources: {'demeter': {'cpus': 16, 'memory': '48G', 'walltime': '03:15:00'}}
   Model confidence: 0.87

**How it works:**

1. The advisor scans telemetry run directories for completed tasks.
2. It extracts features: task name, input sizes, component resources, target
   type, historical duration, peak memory.
3. A gradient boosting model (or random forest) is trained to predict optimal
   resource allocation given input features.
4. Predictions include confidence intervals — low confidence triggers
   fallback to the deterministic advisor.

Step 3: Model Types and Trade-Offs
------------------------------------

.. list-table::
   :header-rows: 1
   :widths: 20 30 25 25

   * - Model Type
     - Accuracy
     - Training Speed
     - When to Use
   * - ``linear``
     - Low
     - Fast (<1s)
     - Few runs, simple patterns
   * - ``random_forest``
     - Medium
     - Moderate (5–30s)
     - Moderate history, non-linear patterns
   * - ``gradient_boosting``
     - High
     - Slow (30–120s)
     - Rich history (50+ runs), complex patterns

Choose via CLI:

.. code-block:: bash

   # Use the ML advisor from CLI
   scalable advise --task run_demeter_scenario --model-type gradient_boosting --format json

.. code-block:: json

   {
     "task": "run_demeter_scenario",
     "workers": {"demeter": 8},
     "resources": {"demeter": {"cpus": 16, "memory": "48G", "walltime": "03:15:00"}},
     "confidence": 0.87,
     "model_type": "gradient_boosting"
   }

Step 4: AdaptiveScaler with ML Predictions
--------------------------------------------

Combine the LearnedAdvisor with real-time scaling:

.. code-block:: python

   from scalable import AdaptiveScaler, LearnedAdvisor, ScalableSession

   # Train advisor
   advisor = LearnedAdvisor.from_history("./.scalable/runs", model_type="gradient_boosting")

   # Create adaptive scaler backed by ML predictions
   scaler = AdaptiveScaler(
       advisor=advisor,
       min_workers={"demeter": 2, "postprocess": 1},
       max_workers={"demeter": 30, "postprocess": 10},
       scale_up_threshold=0.7,
       scale_down_threshold=0.3,
       cooldown_seconds=90,
   )

   session = ScalableSession.from_yaml("./scalable.yaml", target="aws")
   client = session.start()

   # Submit work in batches and let the scaler decide
   for batch in scenario_batches:
       futures = [client.submit(run_demeter_scenario, s, tag="demeter") for s in batch]

       decision = scaler.evaluate(
           pending_tasks=[{"tag": "gcam", "features": {"input_size_mb": s.size}} for s in batch],
           active_workers={"demeter": 10},
           recent_completions=[{"tag": "gcam", "duration_s": 180}],
       )

       if decision.has_changes:
           print(f"ML-informed scaling: {decision.reasoning}")
           print(f"  Confidence: {decision.confidence:.2f}")
           print(f"  Predicted completion: {decision.predicted_completion_time:.0f}s")

   session.close()

The ML-backed scaler considers:

* Current queue depth and worker utilization.
* Predicted task duration from the learned model.
* Historical scaling patterns (what worked before).
* Cost constraints (from the ``max_workers`` ceiling).

Step 5: Hyperparameter Tuning
------------------------------

For optimal predictions, tune the ML model:

.. code-block:: python

   from scalable.ml import HyperparameterSearch

   search = HyperparameterSearch(
       runs_dir="./.scalable/runs",
       model_type="gradient_boosting",
       cv_folds=5,
   )

   best_params = search.run()
   print(f"Best parameters: {best_params}")
   print(f"Cross-validation score: {search.best_score:.3f}")

   # Use best parameters
   advisor = LearnedAdvisor.from_history(
       "./.scalable/runs",
       model_type="gradient_boosting",
       model_params=best_params,
   )

Troubleshooting
---------------

**LearnedAdvisor predictions are poor**
  Ensure you have sufficient telemetry history (at least 10–20 completed runs
  with varied inputs). With fewer runs, the deterministic ``ResourceAdvisor``
  is more reliable.

**"ImportError: scikit-learn not installed"**
  Install the ML extra: ``pip install scalable[ml]``.

**Confidence is consistently low**
  The model has not seen enough varied inputs. Continue running the workflow
  to grow the telemetry history, then refit the advisor.

**Cross-validation scores are unstable across folds**
  Your dataset may be too small or imbalanced. Aim for 50+ runs with a mix of
  input characteristics before relying on tuned hyperparameters.

Next Steps
----------

* :ref:`tutorial_ai_composition` — Use AI assistants to generate workflow
  configurations that incorporate ML-driven advising.
* :ref:`tutorial_telemetry` — Track ML-driven scaling decisions and resource
  utilization in telemetry for cost analysis.
* :ref:`tutorial_cloud_integration` — Deploy ML-backed adaptive workflows to
  cloud for maximum cost efficiency.