.. _tutorial_caching: ====================================================== Tutorial 4: Performance Optimization and Caching ====================================================== What You Will Learn ------------------- By the end of this tutorial you will: * Use the ``@cacheable`` decorator to skip redundant computation. * Understand how Scalable hashes function arguments for cache keys. * Handle file-based and directory-based inputs with type-safe hashing. * Configure cache storage (local disk, remote S3/GCS). * Monitor cache hit/miss rates through telemetry. * Implement cache invalidation strategies for evolving workflows. Prerequisites ------------- * Completed :ref:`tutorial_getting_started`. * Scalable installed (``pip install scalable``). * For remote cache: ``pip install scalable[cloud]``. Scenario -------- Your pipeline executes expensive energy demand simulations that take 30+ minutes per scenario. During development you frequently restart runs after fixing downstream bugs. Without caching, every restart recomputes scenarios that already succeeded. The ``@cacheable`` decorator lets completed tasks skip execution on retry. Step 1: Basic Caching with @cacheable --------------------------------------- The :func:`~scalable.caching.cacheable` decorator intercepts function calls, computes a content-addressable cache key from the function's name and arguments, and returns cached results when available: .. code-block:: python from scalable import cacheable @cacheable(return_type=dict, scenario_id=int) def run_simulation(scenario_id: int) -> dict: """Expensive computation — runs an energy demand scenario.""" import time time.sleep(30) # Simulating expensive work return {"scenario": scenario_id, "demand_mw": scenario_id * 1.5} First call: .. code-block:: python result = run_simulation(42) # Takes 30 seconds — cache MISS print(result) # {'scenario': 42, 'demand_mw': 63.0} Second call with the same argument: .. code-block:: python result = run_simulation(42) # Returns instantly — cache HIT print(result) # {'scenario': 42, 'demand_mw': 63.0} **How it works:** 1. The decorator serializes each argument using ``dill`` and hashes the bytes with ``xxhash`` (seeded by ``SCALABLE_SEED``). 2. The function name and hash form a composite cache key. 3. On a hit, the stored result is deserialized and returned without executing the function body. 4. On a miss, the function executes normally and the result is stored. Step 2: Type Annotations for Reliable Hashing ----------------------------------------------- Scalable's cache key depends on how arguments are hashed. Without type hints, the decorator falls back to generic serialization, which may produce inconsistent keys for complex objects. Explicit type annotations are preferred: .. code-block:: python from scalable import cacheable @cacheable(return_type=str, name=str, count=int) def greet(name: str, count: int) -> str: return f"Hello {name}! (x{count})" The decorator parameters mirror the function signature: * ``return_type=str`` — declares the return type for safe deserialization. * ``name=str, count=int`` — declares argument types for deterministic hashing. **Why this matters:** Python objects hash differently depending on their runtime type. A ``numpy.int64(42)`` and Python ``int(42)`` produce different byte representations. Explicit type annotations ensure the decorator coerces inputs consistently. Step 3: Hashing Files and Directories --------------------------------------- Scientific workflows frequently operate on input files. Scalable provides specialized type wrappers that hash file *content* rather than paths: .. code-block:: python from scalable import cacheable, FileType, DirType @cacheable(return_type=dict, config=FileType, data_dir=DirType) def process_data(config: str, data_dir: str) -> dict: """Process data files. Cache key includes file contents.""" import json with open(config) as f: cfg = json.load(f) # ... process files in data_dir ... return {"records_processed": 1000, "config_version": cfg["version"]} How each type hashes: .. list-table:: :header-rows: 1 :widths: 15 85 * - Type - Hashing Strategy * - ``FileType`` - Streams file content in 1 MB chunks through xxhash. Includes the filename (basename only) in the hash. If the file doesn't exist, raises ``ValueError``. * - ``DirType`` - Walks the directory tree, hashes each file's relative path and content. Order is sorted for determinism. Missing directory raises ``ValueError``. * - ``str`` - Hashes the string bytes directly (UTF-8 encoded). * - ``int`` - Hashes the integer's byte representation. **Trade-off:** ``FileType`` hashing reads the entire file on every call to compute the key. For very large files (multi-GB), this adds I/O overhead even on cache hits. Consider whether your workflow modifies input files between runs — if inputs are immutable, a simpler path-based key might suffice. Step 4: Forcing Recomputation ------------------------------ Sometimes you need to invalidate the cache for a specific function, for example after fixing a bug in the computation logic: .. code-block:: python @cacheable(return_type=dict, recompute=True, scenario_id=int) def run_simulation(scenario_id: int) -> dict: """Always recompute — ignores cached results.""" # Fixed version of the computation return {"scenario": scenario_id, "demand_mw": scenario_id * 1.7} Setting ``recompute=True`` forces the function to execute every time. The result still gets written to the cache, so subsequent calls (once you remove ``recompute=True``) will find fresh entries. **Alternative: Change the seed.** If you want to invalidate *all* cache entries globally, change the ``SCALABLE_SEED`` environment variable: .. code-block:: bash export SCALABLE_SEED=123456789 python workflow.py # All cache keys change — full recomputation Step 5: The Minimal @cacheable Form -------------------------------------- For quick prototyping, ``@cacheable`` works without explicit types: .. code-block:: python @cacheable def quick_computation(x, y): return x + y In this form: * Arguments are serialized with ``dill`` and hashed generically. * Return type is inferred from the actual return value. * This is less reliable for complex objects but convenient during exploration. **Recommendation:** Always add explicit types for production code. The minimal form is acceptable for quick experiments where cache key stability isn't critical. Step 6: Cache Configuration ----------------------------- Configure cache storage via environment variables or the manifest: **Local disk cache (default):** .. code-block:: bash export SCALABLE_CACHE_DIR=./cache # Or in the manifest: # project: # local_cache: ./my-cache **Remote cache (S3/GCS):** .. code-block:: bash export SCALABLE_CACHE_REMOTE=s3://my-bucket/scalable-cache/ When a remote cache is configured, Scalable checks the remote store on cache miss before executing the function. This enables cache sharing across machines and CI runs: .. code-block:: text Cache lookup order: 1. Local disk (fast, per-machine) 2. Remote store (slower, shared across team) 3. Execute function (slowest, produces new cache entry) **Cache directory structure:** .. code-block:: text ./cache/ ├── cache.db # SQLite index (diskcache) ├── 00/ # Sharded data files │ ├── a3b8f1... │ └── ... └── tmp/ # Temporary write staging The cache is process-safe (uses SQLite locking) and can be shared between concurrent workflows on the same machine. Step 7: Cache-Aware Task Definitions -------------------------------------- In the manifest, marking a task with ``cache: true`` signals to the Session that functions submitted under that task should honor the cache: .. code-block:: yaml tasks: run_demeter_scenario: component: demeter cache: true postprocess: component: analysis cache: false # Always recompute (e.g., aggregation is cheap) When ``cache: true``, the session emits cache hit/miss events to telemetry, allowing you to track cache effectiveness over time. Step 8: Monitoring Cache Performance -------------------------------------- Cache events are recorded in telemetry when running through the Session API: .. code-block:: bash scalable report --latest .. code-block:: text Cache Performance: Total calls: 50 Hits: 35 (70%) Misses: 15 (30%) Estimated time saved: 17.5 minutes Programmatic access: .. code-block:: python import json from pathlib import Path run_dir = Path(".scalable/runs/run-20260520T.../") cache_events = [] with open(run_dir / "cache.jsonl") as f: for line in f: cache_events.append(json.loads(line)) hits = sum(1 for e in cache_events if e["hit"]) misses = sum(1 for e in cache_events if not e["hit"]) print(f"Hit rate: {hits}/{hits+misses} = {hits/(hits+misses)*100:.0f}%") Step 9: Cache Invalidation Strategies --------------------------------------- Effective caching requires a strategy for when to invalidate: **Strategy 1: Seed rotation** Change ``SCALABLE_SEED`` to invalidate all entries. Use this after major code changes that affect all functions: .. code-block:: bash export SCALABLE_SEED=$(date +%s) # New seed each day **Strategy 2: Per-function recompute** Set ``recompute=True`` on specific functions during development. Remove once verified: .. code-block:: python @cacheable(return_type=dict, recompute=True, params=dict) def run_demeter_scenario(params: dict) -> dict: ... **Strategy 3: Version the function name** Include a version suffix in the function name to naturally invalidate when logic changes: .. code-block:: python @cacheable(return_type=dict, params=dict) def run_demeter_scenario_v3(params: dict) -> dict: # v3: fixed fuel cost calculation ... **Strategy 4: Delete the cache directory** Nuclear option — simply remove the cache directory: .. code-block:: bash rm -rf ./cache python workflow.py # Full recomputation Step 10: Distributed Caching Pattern -------------------------------------- For team workflows where multiple developers run the same pipeline, use a shared remote cache: .. code-block:: bash # All team members set the same remote cache export SCALABLE_CACHE_REMOTE=s3://team-bucket/scalable-cache/ Workflow: 1. Developer A runs the pipeline. All 50 scenarios compute and cache remotely. 2. Developer B runs the same pipeline. All 50 scenarios hit the remote cache. 3. Developer A modifies scenario 7's parameters. Only scenario 7 recomputes. .. code-block:: python from scalable import ScalableSession session = ScalableSession.from_yaml("./scalable.yaml", target="local") client = session.start() # These will check local cache, then remote, then compute futures = [ client.submit(run_demeter_scenario, scenario, tag="demeter") for scenario in range(50) ] results = client.gather(futures) # First run: 50 misses. Subsequent runs: 50 hits. Troubleshooting --------------- **Cache never hits despite identical arguments** Check that ``SCALABLE_SEED`` hasn't changed between runs. Also verify that argument types are consistent — passing ``numpy.int64`` vs ``int`` may produce different keys. Use explicit type annotations. **"ValueError: File does not exist" from FileType** ``FileType`` validates file existence at hash time. Ensure the file path is accessible from the worker process (relevant for containerized workers where paths differ from the host). **Cache grows unboundedly** ``diskcache`` doesn't auto-evict by default. Periodically clean old entries: .. code-block:: python from diskcache import Cache cache = Cache("./cache") cache.clear() # Remove all entries # Or set a size limit: cache = Cache("./cache", size_limit=10 * 1024**3) # 10 GB **Remote cache is slow** S3/GCS lookups add latency per call (50–200ms). For workflows with thousands of small tasks, the overhead may exceed computation time. Use remote caching only for expensive tasks (>1 minute per call) or batch cache lookups. Next Steps ---------- * :ref:`tutorial_cloud_integration` — Deploy cached workflows to AWS/GCP with shared remote storage. * :ref:`tutorial_telemetry` — Analyze cache performance across historical runs. * :ref:`tutorial_error_handling` — Handle cache corruption and partial failures gracefully.