Beginner Tutorial 4: Caching — Avoiding Redundant Work¶

The Big Picture¶

Imagine you’ve run a 2-hour simulation pipeline and it fails on step 47 of 50. You fix the bug and re-run. Without caching, all 50 steps execute again — including the 46 that already succeeded. That’s hours of wasted computation.

Caching solves this by saving the results of completed work. On re-run, Scalable checks: “Have I already computed this exact function with these exact inputs?” If yes, it returns the saved result instantly. If no, it computes normally and saves the result for next time.

This tutorial explains how caching works from first principles — hashing, content-addressable storage, decorators, and the trade-offs involved.

What You Will Learn¶

By the end of this tutorial you will:

Understand what caching is and why it matters for scientific workflows.
Know how hash functions create “fingerprints” of data.
Understand content-addressable storage.
Use the @cacheable decorator in Scalable.
Handle file-based and directory-based inputs.
Configure local and remote cache storage.
Understand cache invalidation strategies.

Prerequisites¶

Completed Beginner Tutorial 1: Your First Workflow.
Scalable installed (pip install scalable).
For remote cache concepts: no cloud account needed (follow along).

Key Concepts Explained¶

💡 Key Concept: What is Caching?

Caching is storing the result of an expensive operation so you can reuse it later without recomputing. It trades storage space for computation time.

Real-world examples of caching:

Web browser cache — stores downloaded images/CSS so pages load faster on revisit
CPU cache — keeps frequently accessed memory close to the processor
DNS cache — remembers IP addresses so your computer doesn’t ask “what’s google.com’s address?” every time

In Scalable, caching means: “If I’ve already computed f(x) and saved the result, don’t compute it again — just return the saved result.”

💡 Key Concept: Hash Functions

A hash function takes input of any size and produces a fixed-size “fingerprint.” Think of it as a one-way summarizer:

Input: "Hello, World!"     → Hash: 65a8e27d8879...
Input: "Hello, World!!"    → Hash: 7f83b1657ff1...  (totally different!)
Input: (500MB data file)   → Hash: a3b8c9d2e1f0...

Key properties:

Deterministic — same input always produces same hash
Fixed size — output is always the same length regardless of input
Avalanche effect — tiny input change → completely different hash
One-way — you can’t reconstruct the input from the hash

In Scalable: When you call a cached function, Scalable hashes the function name + all arguments to create a unique key. If that key exists in the cache, the result is already known.

💡 Key Concept: Content-Addressable Storage

Content-addressable storage (CAS) uses the content’s hash as its address (filename/key). Instead of naming a file results_v3_final.json, you name it sha256_a3b8f1c2d4e5.json.

Benefits:

Deduplication — identical content has the same hash, stored once
Verification — you can verify data hasn’t been corrupted by re-hashing and comparing
Immutability — content at a hash never changes (any change = different hash = different address)

Used by: Git (every commit, file, and tree is content-addressed), Docker (image layers), IPFS, and Scalable’s cache system.

💡 Key Concept: Memoization

Memoization is a specific caching technique for functions: remember the result of a function call based on its inputs.

# Without memoization:
result1 = expensive_function(42)   # Takes 5 minutes
result2 = expensive_function(42)   # Takes 5 minutes again!

# With memoization:
result1 = expensive_function(42)   # Takes 5 minutes, saves result
result2 = expensive_function(42)   # Instant! Returns saved result

Memoization requires determinism — the same inputs must always produce the same output. If your function depends on the current time, random numbers, or external state that changes, memoization won’t give correct results.

💡 Key Concept: Python Decorators

A decorator is a Python pattern that wraps a function to add behavior without changing the function’s code. Decorators use the @ syntax:

@some_decorator
def my_function(x):
    return x * 2

This is equivalent to:

def my_function(x):
    return x * 2
my_function = some_decorator(my_function)

The decorator receives your function and returns a new function that does something extra (like checking a cache before calling the original).

Common decorators you may have seen:

@property — makes a method behave like an attribute
@staticmethod — marks a method that doesn’t use self
@functools.lru_cache — Python’s built-in memoization

Scalable’s @cacheable is a decorator that adds persistent caching to any function.

Step 1: Basic Caching with @cacheable¶

Here’s how to make a function cacheable in Scalable:

from scalable import cacheable


@cacheable(return_type=dict, scenario_id=int)
def run_simulation(scenario_id: int) -> dict:
    """Expensive computation — runs an energy demand scenario."""
    import time
    time.sleep(5)  # Simulate 5 seconds of heavy computation
    return {
        "scenario_id": scenario_id,
        "demand_mwh": scenario_id * 1000 + 42,
        "status": "complete",
    }

What’s happening with that decorator?

@cacheable(return_type=dict, scenario_id=int) tells Scalable:

This function can be cached — wrap it with cache logic
Return type is ``dict`` — Scalable knows how to serialize/ deserialize the result
``scenario_id`` is type ``int`` — Scalable knows how to hash this argument deterministically

The type annotations help Scalable create reliable cache keys. Different types hash differently (the integer 1 vs. the string "1" produce different cache keys).

First call — cache miss (slow):

result = run_simulation(scenario_id=42)
# Takes 5 seconds — computes and saves to cache
print(result)  # {"scenario_id": 42, "demand_mwh": 42042, "status": "complete"}

Second call — cache hit (instant):

result = run_simulation(scenario_id=42)
# Instant! Returns saved result from cache
print(result)  # {"scenario_id": 42, "demand_mwh": 42042, "status": "complete"}

Under the Hood: What happens on each call

Cache miss (first call):

Scalable hashes: hash("run_simulation" + hash(42)) → key abc123
Looks up key abc123 in cache storage → not found
Calls the actual function → waits 5 seconds → gets result
Serializes the result and stores it at key abc123
Returns the result to you

Cache hit (second call):

Scalable hashes: hash("run_simulation" + hash(42)) → key abc123
Looks up key abc123 in cache storage → found!
Deserializes the stored result
Returns it immediately (no function execution)

Step 2: How Cache Keys Are Computed¶

The cache key is a hash of:

The function’s fully qualified name (module + function name)
The function’s arguments (each individually hashed)

# These produce DIFFERENT cache keys:
run_simulation(scenario_id=1)    # key = hash(name + hash(1))
run_simulation(scenario_id=2)    # key = hash(name + hash(2))

# These produce the SAME cache key:
run_simulation(scenario_id=42)   # First call
run_simulation(scenario_id=42)   # Same key → cache hit!

💡 Key Concept: Deterministic Hashing

For caching to work correctly, hashing must be deterministic — the same input must always produce the same hash.

This is why Scalable asks you to declare argument types. A Python dict doesn’t have a guaranteed ordering (in practice it does in Python 3.7+, but Scalable ensures stability by sorting keys before hashing).

What can be hashed reliably:

Primitive types: int, float, str, bool
Collections: list, tuple, dict (with hashable contents)
Files: hashed by content (not filename!)

What can’t be hashed reliably:

Objects with mutable state
Functions/lambdas (their code might change)
Anything involving randomness or external state

Step 3: Handling File Inputs¶

Scientific workflows often take files as input. Scalable provides special types for file-based hashing:

from scalable import cacheable
from scalable.caching import FileType, DirType


@cacheable(return_type=dict, input_file=FileType, config=dict)
def process_data(input_file: str, config: dict) -> dict:
    """Process a data file according to config."""
    with open(input_file) as f:
        data = f.read()
    # ... processing ...
    return {"rows": len(data.splitlines()), "config": config}

💡 Key Concept: FileType and Content Hashing

When you annotate an argument as FileType, Scalable hashes the file’s contents (not its path or name).

Why? Because:

Same file at different paths = same computation = should cache-hit
Same path with different contents = different computation = should NOT cache-hit

process_data("/data/input_v1.csv", ...)   # Hashes CSV content
process_data("/tmp/copy_of_v1.csv", ...)  # Same content → cache hit!
# (even though the path is different)

DirType works similarly but hashes all files in a directory (recursively).

Step 4: Cache Storage Configuration¶

By default, Scalable stores cache entries on local disk:

# In scalable.yaml
project:
  name: my-project
  local_cache: ./cache    # Cache stored here

The cache directory structure looks like:

./cache/
├── run_simulation/
│   ├── abc123.json       # Cached result for scenario_id=42
│   ├── def456.json       # Cached result for scenario_id=7
│   └── ...
└── process_data/
    ├── 789ghi.json
    └── ...

For team collaboration or cloud workflows, you can use remote storage:

project:
  name: my-project
  local_cache: s3://my-bucket/scalable-cache/

💡 Key Concept: Local vs. Remote Cache

Local cache (filesystem):

Fast (no network latency)
Private (only you can access)
Lost if machine is destroyed

Remote cache (S3, GCS):

Shared across team members and CI/CD
Persistent (survives machine changes)
Slower (network round-trip for every lookup)
Costs money (storage + requests)

When to use remote cache: When your team runs the same pipeline and you want to share cached results. Person A computes scenario 1–500, Person B starts from 501 but benefits from A’s cached results.

Step 5: Cache Invalidation¶

💡 Key Concept: Cache Invalidation

There’s a famous saying in computer science:

“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton

Cache invalidation means deciding when cached results are no longer valid. A result becomes invalid when:

The function’s logic changes (you fixed a bug)
An input file’s content changes
External dependencies update (new library version)
You explicitly want fresh results

Scalable handles invalidation in several ways:

Automatic invalidation (content-based):

File inputs are hashed by content → changed file = different key = no hit
Function arguments change → different key = no hit

Manual invalidation:

# Clear all cache for a project
rm -rf ./cache/

# Clear cache for a specific function
rm -rf ./cache/run_simulation/

Selective re-computation:

# Force re-computation even if cached
result = run_simulation(scenario_id=42, _cache_bypass=True)

🤔 Think About It

What happens if you change the function’s code but not its inputs?

By default, Scalable hashes the function name, not its code. So if you fix a bug in run_simulation, the cache key is the same and you’ll get stale results!

Solution: Clear the cache after code changes, or use versioning:

@cacheable(return_type=dict, scenario_id=int, _version="2")
def run_simulation(scenario_id: int) -> dict:
    # Fixed bug — _version="2" creates different cache keys
    ...

Step 6: Monitoring Cache Performance¶

Scalable records cache hit/miss events in telemetry:

scalable report --last

Cache Performance:
  Total lookups: 200
  Hits: 180 (90%)
  Misses: 20 (10%)
  Time saved: ~15 minutes (estimated from hit count × avg task duration)

A high hit rate (>80%) means caching is working well. A low hit rate might mean:

Inputs are always changing (cache keys never match)
The cache was recently cleared
Tasks aren’t deterministic

💡 Key Concept: Serialization

Serialization converts a Python object into bytes that can be stored on disk or sent over a network. Deserialization converts bytes back into a Python object.

Common serialization formats:

JSON — human-readable, limited types (no sets, dates, custom objects)
Pickle — Python-native, supports any object, not human-readable
MessagePack — fast binary format, limited types

Scalable uses JSON for simple types (dicts, lists, strings) and pickle for complex objects. The return_type annotation in @cacheable helps Scalable choose the best serialization strategy.

Common Questions¶

Q: Does caching use a lot of disk space?

It depends on your output sizes. Small results (numbers, short strings) use negligible space. Large results (DataFrames, arrays) can grow quickly. Monitor your cache directory size and set up periodic cleanup for old entries.

Q: What if two people compute the same thing simultaneously?

With local cache, they each compute independently. With remote cache (S3), the second writer overwrites the first — but since the result is deterministic, they’re writing the same value, so it’s safe.

Q: Can I cache functions that return different results each time?

No! Caching assumes determinism — same inputs → same output. If your function involves randomness, time-dependence, or external state that changes, caching will return stale/incorrect results.

Q: What’s the difference between Scalable’s cache and Python’s ``functools.lru_cache``?

lru_cache stores results in memory (lost when program exits)
@cacheable stores results on disk or remote storage (persistent across runs)

Scalable’s caching is designed for expensive computations that span multiple program invocations.

Q: Can I cache only some invocations?

Yes — the @cacheable decorator checks the cache on every call. If you want to bypass it for specific calls, use _cache_bypass=True.

What You Learned¶

Term	Definition
Caching	Storing results for reuse to avoid recomputation
Hash Function	Produces a fixed-size fingerprint from arbitrary input
Content-Addressable Storage	Data addressed by its content’s hash, not by name
Memoization	Caching function results based on inputs
Decorator	Python pattern that wraps a function to add behavior
Cache Key	Unique identifier for a cached result (hash of function + args)
Cache Hit	Result found in cache (fast, no recomputation)
Cache Miss	Result NOT found, must compute and store
Cache Invalidation	Deciding when cached results are no longer valid
Serialization	Converting objects to bytes for storage/transmission
Determinism	Same inputs always produce the same output
FileType	Annotation telling Scalable to hash file contents, not path

Next Steps¶

You now understand how caching works and can use it to avoid redundant computation in your workflows.

Next beginner tutorial: Beginner Tutorial 5: Cloud Computing Fundamentals — running workflows in the cloud
Standard tutorial: Tutorial 4: Performance Optimization and Caching — advanced caching patterns, remote configuration, and cache management
Try it: Add @cacheable to a function, run it twice, and check the ./cache/ directory to see the stored results. Modify an input and verify you get a cache miss.