.. _beginner_caching:

======================================================
Beginner Tutorial 4: Caching — Avoiding Redundant Work
======================================================

The Big Picture
----------------

Imagine you've run a 2-hour simulation pipeline and it fails on step 47 of 50.
You fix the bug and re-run. Without caching, all 50 steps execute again —
including the 46 that already succeeded. That's hours of wasted computation.

**Caching** solves this by saving the results of completed work. On re-run,
Scalable checks: "Have I already computed this exact function with these exact
inputs?" If yes, it returns the saved result instantly. If no, it computes
normally and saves the result for next time.

This tutorial explains how caching works from first principles — hashing,
content-addressable storage, decorators, and the trade-offs involved.

What You Will Learn
--------------------

By the end of this tutorial you will:

* Understand what caching is and why it matters for scientific workflows.
* Know how hash functions create "fingerprints" of data.
* Understand content-addressable storage.
* Use the ``@cacheable`` decorator in Scalable.
* Handle file-based and directory-based inputs.
* Configure local and remote cache storage.
* Understand cache invalidation strategies.

Prerequisites
--------------

* Completed :ref:`beginner_getting_started`.
* Scalable installed (``pip install scalable``).
* For remote cache concepts: no cloud account needed (follow along).


Key Concepts Explained
-----------------------

.. admonition:: 💡 Key Concept: What is Caching?
   :class: tip

   **Caching** is storing the result of an expensive operation so you can
   reuse it later without recomputing. It trades **storage space** for
   **computation time**.

   Real-world examples of caching:

   * **Web browser cache** — stores downloaded images/CSS so pages load
     faster on revisit
   * **CPU cache** — keeps frequently accessed memory close to the processor
   * **DNS cache** — remembers IP addresses so your computer doesn't ask
     "what's google.com's address?" every time

   In Scalable, caching means: "If I've already computed ``f(x)`` and saved
   the result, don't compute it again — just return the saved result."

.. admonition:: 💡 Key Concept: Hash Functions
   :class: tip

   A **hash function** takes input of any size and produces a fixed-size
   "fingerprint." Think of it as a one-way summarizer:

   .. code-block:: text

      Input: "Hello, World!"     → Hash: 65a8e27d8879...
      Input: "Hello, World!!"    → Hash: 7f83b1657ff1...  (totally different!)
      Input: (500MB data file)   → Hash: a3b8c9d2e1f0...

   Key properties:

   * **Deterministic** — same input always produces same hash
   * **Fixed size** — output is always the same length regardless of input
   * **Avalanche effect** — tiny input change → completely different hash
   * **One-way** — you can't reconstruct the input from the hash

   **In Scalable:** When you call a cached function, Scalable hashes the
   function name + all arguments to create a unique key. If that key exists
   in the cache, the result is already known.

.. admonition:: 💡 Key Concept: Content-Addressable Storage
   :class: tip

   **Content-addressable storage** (CAS) uses the *content's hash* as its
   address (filename/key). Instead of naming a file ``results_v3_final.json``,
   you name it ``sha256_a3b8f1c2d4e5.json``.

   **Benefits:**

   * **Deduplication** — identical content has the same hash, stored once
   * **Verification** — you can verify data hasn't been corrupted by
     re-hashing and comparing
   * **Immutability** — content at a hash never changes (any change =
     different hash = different address)

   **Used by:** Git (every commit, file, and tree is content-addressed),
   Docker (image layers), IPFS, and Scalable's cache system.

.. admonition:: 💡 Key Concept: Memoization
   :class: tip

   **Memoization** is a specific caching technique for functions: remember
   the result of a function call based on its inputs.

   .. code-block:: python

      # Without memoization:
      result1 = expensive_function(42)   # Takes 5 minutes
      result2 = expensive_function(42)   # Takes 5 minutes again!

      # With memoization:
      result1 = expensive_function(42)   # Takes 5 minutes, saves result
      result2 = expensive_function(42)   # Instant! Returns saved result

   Memoization requires **determinism** — the same inputs must always produce
   the same output. If your function depends on the current time, random
   numbers, or external state that changes, memoization won't give correct
   results.

.. admonition:: 💡 Key Concept: Python Decorators
   :class: tip

   A **decorator** is a Python pattern that wraps a function to add behavior
   without changing the function's code. Decorators use the ``@`` syntax:

   .. code-block:: python

      @some_decorator
      def my_function(x):
          return x * 2

   This is equivalent to:

   .. code-block:: python

      def my_function(x):
          return x * 2
      my_function = some_decorator(my_function)

   The decorator receives your function and returns a new function that does
   something extra (like checking a cache before calling the original).

   **Common decorators you may have seen:**

   * ``@property`` — makes a method behave like an attribute
   * ``@staticmethod`` — marks a method that doesn't use ``self``
   * ``@functools.lru_cache`` — Python's built-in memoization

   Scalable's ``@cacheable`` is a decorator that adds persistent caching to
   any function.


Step 1: Basic Caching with @cacheable
---------------------------------------

Here's how to make a function cacheable in Scalable:

.. code-block:: python

   from scalable import cacheable


   @cacheable(return_type=dict, scenario_id=int)
   def run_simulation(scenario_id: int) -> dict:
       """Expensive computation — runs an energy demand scenario."""
       import time
       time.sleep(5)  # Simulate 5 seconds of heavy computation
       return {
           "scenario_id": scenario_id,
           "demand_mwh": scenario_id * 1000 + 42,
           "status": "complete",
       }

.. admonition:: What's happening with that decorator?
   :class: note

   ``@cacheable(return_type=dict, scenario_id=int)`` tells Scalable:

   1. **This function can be cached** — wrap it with cache logic
   2. **Return type is ``dict``** — Scalable knows how to serialize/
      deserialize the result
   3. **``scenario_id`` is type ``int``** — Scalable knows how to hash
      this argument deterministically

   The type annotations help Scalable create reliable cache keys. Different
   types hash differently (the integer ``1`` vs. the string ``"1"`` produce
   different cache keys).

**First call** — cache miss (slow):

.. code-block:: python

   result = run_simulation(scenario_id=42)
   # Takes 5 seconds — computes and saves to cache
   print(result)  # {"scenario_id": 42, "demand_mwh": 42042, "status": "complete"}

**Second call** — cache hit (instant):

.. code-block:: python

   result = run_simulation(scenario_id=42)
   # Instant! Returns saved result from cache
   print(result)  # {"scenario_id": 42, "demand_mwh": 42042, "status": "complete"}

.. admonition:: Under the Hood: What happens on each call
   :class: hint

   **Cache miss (first call):**

   1. Scalable hashes: ``hash("run_simulation" + hash(42))`` → key ``abc123``
   2. Looks up key ``abc123`` in cache storage → not found
   3. Calls the actual function → waits 5 seconds → gets result
   4. Serializes the result and stores it at key ``abc123``
   5. Returns the result to you

   **Cache hit (second call):**

   1. Scalable hashes: ``hash("run_simulation" + hash(42))`` → key ``abc123``
   2. Looks up key ``abc123`` in cache storage → found!
   3. Deserializes the stored result
   4. Returns it immediately (no function execution)


Step 2: How Cache Keys Are Computed
-------------------------------------

The cache key is a hash of:

1. The function's **fully qualified name** (module + function name)
2. The function's **arguments** (each individually hashed)

.. code-block:: python

   # These produce DIFFERENT cache keys:
   run_simulation(scenario_id=1)    # key = hash(name + hash(1))
   run_simulation(scenario_id=2)    # key = hash(name + hash(2))

   # These produce the SAME cache key:
   run_simulation(scenario_id=42)   # First call
   run_simulation(scenario_id=42)   # Same key → cache hit!

.. admonition:: 💡 Key Concept: Deterministic Hashing
   :class: tip

   For caching to work correctly, hashing must be **deterministic** — the
   same input must always produce the same hash.

   This is why Scalable asks you to declare argument types. A Python ``dict``
   doesn't have a guaranteed ordering (in practice it does in Python 3.7+,
   but Scalable ensures stability by sorting keys before hashing).

   **What can be hashed reliably:**

   * Primitive types: ``int``, ``float``, ``str``, ``bool``
   * Collections: ``list``, ``tuple``, ``dict`` (with hashable contents)
   * Files: hashed by content (not filename!)

   **What can't be hashed reliably:**

   * Objects with mutable state
   * Functions/lambdas (their code might change)
   * Anything involving randomness or external state


Step 3: Handling File Inputs
------------------------------

Scientific workflows often take files as input. Scalable provides special
types for file-based hashing:

.. code-block:: python

   from scalable import cacheable
   from scalable.caching import FileType, DirType


   @cacheable(return_type=dict, input_file=FileType, config=dict)
   def process_data(input_file: str, config: dict) -> dict:
       """Process a data file according to config."""
       with open(input_file) as f:
           data = f.read()
       # ... processing ...
       return {"rows": len(data.splitlines()), "config": config}

.. admonition:: 💡 Key Concept: FileType and Content Hashing
   :class: tip

   When you annotate an argument as ``FileType``, Scalable hashes the
   **file's contents** (not its path or name).

   Why? Because:

   * Same file at different paths = same computation = should cache-hit
   * Same path with different contents = different computation = should
     NOT cache-hit

   .. code-block:: text

      process_data("/data/input_v1.csv", ...)   # Hashes CSV content
      process_data("/tmp/copy_of_v1.csv", ...)  # Same content → cache hit!
      # (even though the path is different)

   ``DirType`` works similarly but hashes all files in a directory
   (recursively).


Step 4: Cache Storage Configuration
--------------------------------------

By default, Scalable stores cache entries on local disk:

.. code-block:: yaml

   # In scalable.yaml
   project:
     name: my-project
     local_cache: ./cache    # Cache stored here

The cache directory structure looks like:

.. code-block:: text

   ./cache/
   ├── run_simulation/
   │   ├── abc123.json       # Cached result for scenario_id=42
   │   ├── def456.json       # Cached result for scenario_id=7
   │   └── ...
   └── process_data/
       ├── 789ghi.json
       └── ...

For team collaboration or cloud workflows, you can use remote storage:

.. code-block:: yaml

   project:
     name: my-project
     local_cache: s3://my-bucket/scalable-cache/

.. admonition:: 💡 Key Concept: Local vs. Remote Cache
   :class: tip

   **Local cache** (filesystem):

   * Fast (no network latency)
   * Private (only you can access)
   * Lost if machine is destroyed

   **Remote cache** (S3, GCS):

   * Shared across team members and CI/CD
   * Persistent (survives machine changes)
   * Slower (network round-trip for every lookup)
   * Costs money (storage + requests)

   **When to use remote cache:** When your team runs the same pipeline and
   you want to share cached results. Person A computes scenario 1–500,
   Person B starts from 501 but benefits from A's cached results.


Step 5: Cache Invalidation
-----------------------------

.. admonition:: 💡 Key Concept: Cache Invalidation
   :class: tip

   There's a famous saying in computer science:

      *"There are only two hard things in Computer Science: cache
      invalidation and naming things."* — Phil Karlton

   **Cache invalidation** means deciding when cached results are no longer
   valid. A result becomes invalid when:

   * The function's logic changes (you fixed a bug)
   * An input file's content changes
   * External dependencies update (new library version)
   * You explicitly want fresh results

Scalable handles invalidation in several ways:

**Automatic invalidation** (content-based):

* File inputs are hashed by content → changed file = different key = no hit
* Function arguments change → different key = no hit

**Manual invalidation:**

.. code-block:: bash

   # Clear all cache for a project
   rm -rf ./cache/

   # Clear cache for a specific function
   rm -rf ./cache/run_simulation/

**Selective re-computation:**

.. code-block:: python

   # Force re-computation even if cached
   result = run_simulation(scenario_id=42, _cache_bypass=True)

.. admonition:: 🤔 Think About It
   :class: note

   What happens if you change the function's code but not its inputs?

   By default, Scalable hashes the function **name**, not its **code**. So
   if you fix a bug in ``run_simulation``, the cache key is the same and
   you'll get stale results!

   **Solution:** Clear the cache after code changes, or use versioning:

   .. code-block:: python

      @cacheable(return_type=dict, scenario_id=int, _version="2")
      def run_simulation(scenario_id: int) -> dict:
          # Fixed bug — _version="2" creates different cache keys
          ...


Step 6: Monitoring Cache Performance
---------------------------------------

Scalable records cache hit/miss events in telemetry:

.. code-block:: bash

   scalable report --last

.. code-block:: text

   Cache Performance:
     Total lookups: 200
     Hits: 180 (90%)
     Misses: 20 (10%)
     Time saved: ~15 minutes (estimated from hit count × avg task duration)

A high hit rate (>80%) means caching is working well. A low hit rate might
mean:

* Inputs are always changing (cache keys never match)
* The cache was recently cleared
* Tasks aren't deterministic

.. admonition:: 💡 Key Concept: Serialization
   :class: tip

   **Serialization** converts a Python object into bytes that can be stored
   on disk or sent over a network. **Deserialization** converts bytes back
   into a Python object.

   Common serialization formats:

   * **JSON** — human-readable, limited types (no sets, dates, custom objects)
   * **Pickle** — Python-native, supports any object, not human-readable
   * **MessagePack** — fast binary format, limited types

   Scalable uses JSON for simple types (dicts, lists, strings) and pickle
   for complex objects. The ``return_type`` annotation in ``@cacheable``
   helps Scalable choose the best serialization strategy.


Common Questions
-----------------

**Q: Does caching use a lot of disk space?**

It depends on your output sizes. Small results (numbers, short strings) use
negligible space. Large results (DataFrames, arrays) can grow quickly. Monitor
your cache directory size and set up periodic cleanup for old entries.

**Q: What if two people compute the same thing simultaneously?**

With local cache, they each compute independently. With remote cache (S3),
the second writer overwrites the first — but since the result is deterministic,
they're writing the same value, so it's safe.

**Q: Can I cache functions that return different results each time?**

No! Caching assumes **determinism** — same inputs → same output. If your
function involves randomness, time-dependence, or external state that changes,
caching will return stale/incorrect results.

**Q: What's the difference between Scalable's cache and Python's
``functools.lru_cache``?**

* ``lru_cache`` stores results **in memory** (lost when program exits)
* ``@cacheable`` stores results **on disk or remote storage** (persistent
  across runs)

Scalable's caching is designed for expensive computations that span multiple
program invocations.

**Q: Can I cache only some invocations?**

Yes — the ``@cacheable`` decorator checks the cache on every call. If you
want to bypass it for specific calls, use ``_cache_bypass=True``.


What You Learned
-----------------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Term
     - Definition
   * - Caching
     - Storing results for reuse to avoid recomputation
   * - Hash Function
     - Produces a fixed-size fingerprint from arbitrary input
   * - Content-Addressable Storage
     - Data addressed by its content's hash, not by name
   * - Memoization
     - Caching function results based on inputs
   * - Decorator
     - Python pattern that wraps a function to add behavior
   * - Cache Key
     - Unique identifier for a cached result (hash of function + args)
   * - Cache Hit
     - Result found in cache (fast, no recomputation)
   * - Cache Miss
     - Result NOT found, must compute and store
   * - Cache Invalidation
     - Deciding when cached results are no longer valid
   * - Serialization
     - Converting objects to bytes for storage/transmission
   * - Determinism
     - Same inputs always produce the same output
   * - FileType
     - Annotation telling Scalable to hash file contents, not path


Next Steps
-----------

You now understand how caching works and can use it to avoid redundant
computation in your workflows.

* **Next beginner tutorial:** :ref:`beginner_cloud_integration` — running
  workflows in the cloud
* **Standard tutorial:** :ref:`tutorial_caching` — advanced caching patterns,
  remote configuration, and cache management
* **Try it:** Add ``@cacheable`` to a function, run it twice, and check the
  ``./cache/`` directory to see the stored results. Modify an input and
  verify you get a cache miss.