Beginner Tutorial 4: Caching — Avoiding Redundant Work¶
The Big Picture¶
Imagine you’ve run a 2-hour simulation pipeline and it fails on step 47 of 50. You fix the bug and re-run. Without caching, all 50 steps execute again — including the 46 that already succeeded. That’s hours of wasted computation.
Caching solves this by saving the results of completed work. On re-run, Scalable checks: “Have I already computed this exact function with these exact inputs?” If yes, it returns the saved result instantly. If no, it computes normally and saves the result for next time.
This tutorial explains how caching works from first principles — hashing, content-addressable storage, decorators, and the trade-offs involved.
What You Will Learn¶
By the end of this tutorial you will:
Understand what caching is and why it matters for scientific workflows.
Know how hash functions create “fingerprints” of data.
Understand content-addressable storage.
Use the
@cacheabledecorator in Scalable.Handle file-based and directory-based inputs.
Configure local and remote cache storage.
Understand cache invalidation strategies.
Prerequisites¶
Completed Beginner Tutorial 1: Your First Workflow.
Scalable installed (
pip install scalable).For remote cache concepts: no cloud account needed (follow along).
Key Concepts Explained¶
💡 Key Concept: What is Caching?
Caching is storing the result of an expensive operation so you can reuse it later without recomputing. It trades storage space for computation time.
Real-world examples of caching:
Web browser cache — stores downloaded images/CSS so pages load faster on revisit
CPU cache — keeps frequently accessed memory close to the processor
DNS cache — remembers IP addresses so your computer doesn’t ask “what’s google.com’s address?” every time
In Scalable, caching means: “If I’ve already computed f(x) and saved
the result, don’t compute it again — just return the saved result.”
💡 Key Concept: Hash Functions
A hash function takes input of any size and produces a fixed-size “fingerprint.” Think of it as a one-way summarizer:
Input: "Hello, World!" → Hash: 65a8e27d8879...
Input: "Hello, World!!" → Hash: 7f83b1657ff1... (totally different!)
Input: (500MB data file) → Hash: a3b8c9d2e1f0...
Key properties:
Deterministic — same input always produces same hash
Fixed size — output is always the same length regardless of input
Avalanche effect — tiny input change → completely different hash
One-way — you can’t reconstruct the input from the hash
In Scalable: When you call a cached function, Scalable hashes the function name + all arguments to create a unique key. If that key exists in the cache, the result is already known.
💡 Key Concept: Content-Addressable Storage
Content-addressable storage (CAS) uses the content’s hash as its
address (filename/key). Instead of naming a file results_v3_final.json,
you name it sha256_a3b8f1c2d4e5.json.
Benefits:
Deduplication — identical content has the same hash, stored once
Verification — you can verify data hasn’t been corrupted by re-hashing and comparing
Immutability — content at a hash never changes (any change = different hash = different address)
Used by: Git (every commit, file, and tree is content-addressed), Docker (image layers), IPFS, and Scalable’s cache system.
💡 Key Concept: Memoization
Memoization is a specific caching technique for functions: remember the result of a function call based on its inputs.
# Without memoization:
result1 = expensive_function(42) # Takes 5 minutes
result2 = expensive_function(42) # Takes 5 minutes again!
# With memoization:
result1 = expensive_function(42) # Takes 5 minutes, saves result
result2 = expensive_function(42) # Instant! Returns saved result
Memoization requires determinism — the same inputs must always produce the same output. If your function depends on the current time, random numbers, or external state that changes, memoization won’t give correct results.
💡 Key Concept: Python Decorators
A decorator is a Python pattern that wraps a function to add behavior
without changing the function’s code. Decorators use the @ syntax:
@some_decorator
def my_function(x):
return x * 2
This is equivalent to:
def my_function(x):
return x * 2
my_function = some_decorator(my_function)
The decorator receives your function and returns a new function that does something extra (like checking a cache before calling the original).
Common decorators you may have seen:
@property— makes a method behave like an attribute@staticmethod— marks a method that doesn’t useself@functools.lru_cache— Python’s built-in memoization
Scalable’s @cacheable is a decorator that adds persistent caching to
any function.
Step 1: Basic Caching with @cacheable¶
Here’s how to make a function cacheable in Scalable:
from scalable import cacheable
@cacheable(return_type=dict, scenario_id=int)
def run_simulation(scenario_id: int) -> dict:
"""Expensive computation — runs an energy demand scenario."""
import time
time.sleep(5) # Simulate 5 seconds of heavy computation
return {
"scenario_id": scenario_id,
"demand_mwh": scenario_id * 1000 + 42,
"status": "complete",
}
What’s happening with that decorator?
@cacheable(return_type=dict, scenario_id=int) tells Scalable:
This function can be cached — wrap it with cache logic
Return type is ``dict`` — Scalable knows how to serialize/ deserialize the result
``scenario_id`` is type ``int`` — Scalable knows how to hash this argument deterministically
The type annotations help Scalable create reliable cache keys. Different
types hash differently (the integer 1 vs. the string "1" produce
different cache keys).
First call — cache miss (slow):
result = run_simulation(scenario_id=42)
# Takes 5 seconds — computes and saves to cache
print(result) # {"scenario_id": 42, "demand_mwh": 42042, "status": "complete"}
Second call — cache hit (instant):
result = run_simulation(scenario_id=42)
# Instant! Returns saved result from cache
print(result) # {"scenario_id": 42, "demand_mwh": 42042, "status": "complete"}
Under the Hood: What happens on each call
Cache miss (first call):
Scalable hashes:
hash("run_simulation" + hash(42))→ keyabc123Looks up key
abc123in cache storage → not foundCalls the actual function → waits 5 seconds → gets result
Serializes the result and stores it at key
abc123Returns the result to you
Cache hit (second call):
Scalable hashes:
hash("run_simulation" + hash(42))→ keyabc123Looks up key
abc123in cache storage → found!Deserializes the stored result
Returns it immediately (no function execution)
Step 2: How Cache Keys Are Computed¶
The cache key is a hash of:
The function’s fully qualified name (module + function name)
The function’s arguments (each individually hashed)
# These produce DIFFERENT cache keys:
run_simulation(scenario_id=1) # key = hash(name + hash(1))
run_simulation(scenario_id=2) # key = hash(name + hash(2))
# These produce the SAME cache key:
run_simulation(scenario_id=42) # First call
run_simulation(scenario_id=42) # Same key → cache hit!
💡 Key Concept: Deterministic Hashing
For caching to work correctly, hashing must be deterministic — the same input must always produce the same hash.
This is why Scalable asks you to declare argument types. A Python dict
doesn’t have a guaranteed ordering (in practice it does in Python 3.7+,
but Scalable ensures stability by sorting keys before hashing).
What can be hashed reliably:
Primitive types:
int,float,str,boolCollections:
list,tuple,dict(with hashable contents)Files: hashed by content (not filename!)
What can’t be hashed reliably:
Objects with mutable state
Functions/lambdas (their code might change)
Anything involving randomness or external state
Step 3: Handling File Inputs¶
Scientific workflows often take files as input. Scalable provides special types for file-based hashing:
from scalable import cacheable
from scalable.caching import FileType, DirType
@cacheable(return_type=dict, input_file=FileType, config=dict)
def process_data(input_file: str, config: dict) -> dict:
"""Process a data file according to config."""
with open(input_file) as f:
data = f.read()
# ... processing ...
return {"rows": len(data.splitlines()), "config": config}
💡 Key Concept: FileType and Content Hashing
When you annotate an argument as FileType, Scalable hashes the
file’s contents (not its path or name).
Why? Because:
Same file at different paths = same computation = should cache-hit
Same path with different contents = different computation = should NOT cache-hit
process_data("/data/input_v1.csv", ...) # Hashes CSV content
process_data("/tmp/copy_of_v1.csv", ...) # Same content → cache hit!
# (even though the path is different)
DirType works similarly but hashes all files in a directory
(recursively).
Step 4: Cache Storage Configuration¶
By default, Scalable stores cache entries on local disk:
# In scalable.yaml
project:
name: my-project
local_cache: ./cache # Cache stored here
The cache directory structure looks like:
./cache/
├── run_simulation/
│ ├── abc123.json # Cached result for scenario_id=42
│ ├── def456.json # Cached result for scenario_id=7
│ └── ...
└── process_data/
├── 789ghi.json
└── ...
For team collaboration or cloud workflows, you can use remote storage:
project:
name: my-project
local_cache: s3://my-bucket/scalable-cache/
💡 Key Concept: Local vs. Remote Cache
Local cache (filesystem):
Fast (no network latency)
Private (only you can access)
Lost if machine is destroyed
Remote cache (S3, GCS):
Shared across team members and CI/CD
Persistent (survives machine changes)
Slower (network round-trip for every lookup)
Costs money (storage + requests)
When to use remote cache: When your team runs the same pipeline and you want to share cached results. Person A computes scenario 1–500, Person B starts from 501 but benefits from A’s cached results.
Step 5: Cache Invalidation¶
💡 Key Concept: Cache Invalidation
There’s a famous saying in computer science:
“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton
Cache invalidation means deciding when cached results are no longer valid. A result becomes invalid when:
The function’s logic changes (you fixed a bug)
An input file’s content changes
External dependencies update (new library version)
You explicitly want fresh results
Scalable handles invalidation in several ways:
Automatic invalidation (content-based):
File inputs are hashed by content → changed file = different key = no hit
Function arguments change → different key = no hit
Manual invalidation:
# Clear all cache for a project
rm -rf ./cache/
# Clear cache for a specific function
rm -rf ./cache/run_simulation/
Selective re-computation:
# Force re-computation even if cached
result = run_simulation(scenario_id=42, _cache_bypass=True)
🤔 Think About It
What happens if you change the function’s code but not its inputs?
By default, Scalable hashes the function name, not its code. So
if you fix a bug in run_simulation, the cache key is the same and
you’ll get stale results!
Solution: Clear the cache after code changes, or use versioning:
@cacheable(return_type=dict, scenario_id=int, _version="2")
def run_simulation(scenario_id: int) -> dict:
# Fixed bug — _version="2" creates different cache keys
...
Step 6: Monitoring Cache Performance¶
Scalable records cache hit/miss events in telemetry:
scalable report --last
Cache Performance:
Total lookups: 200
Hits: 180 (90%)
Misses: 20 (10%)
Time saved: ~15 minutes (estimated from hit count × avg task duration)
A high hit rate (>80%) means caching is working well. A low hit rate might mean:
Inputs are always changing (cache keys never match)
The cache was recently cleared
Tasks aren’t deterministic
💡 Key Concept: Serialization
Serialization converts a Python object into bytes that can be stored on disk or sent over a network. Deserialization converts bytes back into a Python object.
Common serialization formats:
JSON — human-readable, limited types (no sets, dates, custom objects)
Pickle — Python-native, supports any object, not human-readable
MessagePack — fast binary format, limited types
Scalable uses JSON for simple types (dicts, lists, strings) and pickle
for complex objects. The return_type annotation in @cacheable
helps Scalable choose the best serialization strategy.
Common Questions¶
Q: Does caching use a lot of disk space?
It depends on your output sizes. Small results (numbers, short strings) use negligible space. Large results (DataFrames, arrays) can grow quickly. Monitor your cache directory size and set up periodic cleanup for old entries.
Q: What if two people compute the same thing simultaneously?
With local cache, they each compute independently. With remote cache (S3), the second writer overwrites the first — but since the result is deterministic, they’re writing the same value, so it’s safe.
Q: Can I cache functions that return different results each time?
No! Caching assumes determinism — same inputs → same output. If your function involves randomness, time-dependence, or external state that changes, caching will return stale/incorrect results.
Q: What’s the difference between Scalable’s cache and Python’s ``functools.lru_cache``?
lru_cachestores results in memory (lost when program exits)@cacheablestores results on disk or remote storage (persistent across runs)
Scalable’s caching is designed for expensive computations that span multiple program invocations.
Q: Can I cache only some invocations?
Yes — the @cacheable decorator checks the cache on every call. If you
want to bypass it for specific calls, use _cache_bypass=True.
What You Learned¶
Term |
Definition |
|---|---|
Caching |
Storing results for reuse to avoid recomputation |
Hash Function |
Produces a fixed-size fingerprint from arbitrary input |
Content-Addressable Storage |
Data addressed by its content’s hash, not by name |
Memoization |
Caching function results based on inputs |
Decorator |
Python pattern that wraps a function to add behavior |
Cache Key |
Unique identifier for a cached result (hash of function + args) |
Cache Hit |
Result found in cache (fast, no recomputation) |
Cache Miss |
Result NOT found, must compute and store |
Cache Invalidation |
Deciding when cached results are no longer valid |
Serialization |
Converting objects to bytes for storage/transmission |
Determinism |
Same inputs always produce the same output |
FileType |
Annotation telling Scalable to hash file contents, not path |
Next Steps¶
You now understand how caching works and can use it to avoid redundant computation in your workflows.
Next beginner tutorial: Beginner Tutorial 5: Cloud Computing Fundamentals — running workflows in the cloud
Standard tutorial: Tutorial 4: Performance Optimization and Caching — advanced caching patterns, remote configuration, and cache management
Try it: Add
@cacheableto a function, run it twice, and check the./cache/directory to see the stored results. Modify an input and verify you get a cache miss.