Beginner Tutorial 6: Understanding What Happened¶
The Big Picture¶
You’ve run a workflow. It completed. But did it perform well? Were some tasks slower than expected? Did workers sit idle? How much did it cost?
Telemetry is the automated recording of everything that happens during a run — every task start, every completion, every failure, every resource measurement. It’s like a flight recorder for your workflow, letting you understand what happened after the fact and make informed decisions about optimization.
This tutorial explains observability from first principles: what telemetry is, why structured logging matters, how to read event data, and how to generate useful reports.
What You Will Learn¶
By the end of this tutorial you will:
Understand what telemetry and observability mean.
Know the difference between metrics, logs, and traces.
Read JSONL telemetry files and understand their structure.
Generate reports from the CLI and Python API.
Use telemetry data to identify performance bottlenecks.
Understand how historical telemetry informs future decisions.
Prerequisites¶
Completed Beginner Tutorial 1: Your First Workflow.
At least one completed Scalable run (to have telemetry data).
pandasinstalled (included with Scalable’s core dependencies).
Key Concepts Explained¶
💡 Key Concept: What is Telemetry?
Telemetry is the automated collection and transmission of data from remote systems. The word comes from Greek: tele (remote) + metron (measurement).
In software, telemetry means recording what your program did:
When did tasks start and finish?
How much memory did workers use?
Which tasks failed and why?
How many cache hits occurred?
Analogy: A car’s dashboard shows speed, fuel level, and engine temperature in real-time. Telemetry is like a dashcam that records everything so you can review it later.
💡 Key Concept: Observability
Observability is the ability to understand a system’s internal state by examining its outputs. A system is “observable” if you can answer “why is this slow?” or “why did this fail?” from the data it produces.
The three pillars of observability:
- 1. Metrics — numerical measurements over time
“CPU utilization was 87% at 14:03:22”
“Average task duration was 4.2 seconds”
Good for dashboards and alerting
- 2. Logs — discrete events with context
“Task run_simulation(42) started at 14:03:22 on worker-3”
“Worker-2 failed with OutOfMemoryError at 14:05:11”
Good for debugging specific incidents
- 3. Traces — the journey of a request through the system
“Task 42: submitted → queued 0.3s → scheduled to worker-3 → executed 4.1s → completed”
Good for understanding latency and bottlenecks
Scalable’s telemetry provides all three through structured event files.
💡 Key Concept: Structured Logging
Structured logging means recording events as machine-parseable data (typically JSON) rather than free-form text.
Unstructured log (hard to parse programmatically):
2026-05-20 14:03:22 INFO Task run_simulation(42) completed in 4.2s on worker-3
Structured log (easy to parse, filter, aggregate):
{
"timestamp": "2026-05-20T14:03:22Z",
"event": "task_completed",
"task": "run_simulation",
"args": {"scenario_id": 42},
"duration_s": 4.2,
"worker": "worker-3"
}
Structured logs can be:
Filtered: “show me only failures”
Aggregated: “average duration per task type”
Queried: “which worker handled the most tasks?”
Visualized: plotted on timelines and dashboards
💡 Key Concept: JSONL (JSON Lines)
JSONL (JSON Lines) is a format where each line is a complete JSON object. It’s perfect for event streams because:
Appendable — just add a new line (no need to rewrite the file)
Streamable — process one line at a time (no need to load entire file)
Parseable — each line is valid JSON
{"event": "task_started", "task": "sim", "time": "14:03:22"}
{"event": "task_completed", "task": "sim", "time": "14:03:26", "duration": 4.2}
{"event": "task_started", "task": "sim", "time": "14:03:22"}
Compare to a single large JSON array (which requires loading the entire file to append or read):
[
{"event": "task_started", ...},
{"event": "task_completed", ...}
]
💡 Key Concept: Events
An event is a discrete occurrence at a specific point in time. Events have:
Timestamp — when it happened
Type — what kind of event (task_started, worker_added, etc.)
Payload — additional context (task name, duration, error message)
Events form the foundation of Scalable’s telemetry system. Everything that happens is recorded as an event.
Step 1: Telemetry File Structure¶
After every run, Scalable creates a run directory with structured telemetry:
.scalable/runs/
└── run-20260520T035200Z-demeter-lulcc-a1b2c3d4/
├── run.json # Run metadata (start time, target, manifest)
├── manifest.yaml # Snapshot of the manifest used
├── plan.json # Execution plan snapshot
├── tasks.jsonl # Task lifecycle events
├── resources.jsonl # Resource utilization snapshots
├── workers.jsonl # Worker lifecycle events
├── cache.jsonl # Cache hit/miss events
└── failures.jsonl # Error details (if any)
Each file serves a purpose:
run.jsonHigh-level metadata: when the run started, which target was used, the manifest hash for reproducibility verification.
tasks.jsonlThe most important file — every task submission, start, completion, and failure is recorded here.
resources.jsonlPeriodic snapshots of CPU and memory usage per worker.
workers.jsonlWorker lifecycle: when workers started, stopped, or crashed.
cache.jsonlEvery cache lookup: hit (saved time) or miss (had to compute).
failures.jsonlDetailed error information including tracebacks.
Step 2: Reading Telemetry Data¶
You can read telemetry files directly:
import json
# Read task events line by line
with open(".scalable/runs/run-.../tasks.jsonl") as f:
for line in f:
event = json.loads(line)
print(f"{event['timestamp']} | {event['event']} | {event.get('task', '')}")
Output:
2026-05-20T14:03:22Z | task_submitted | run_simulation
2026-05-20T14:03:22Z | task_started | run_simulation
2026-05-20T14:03:26Z | task_completed | run_simulation
2026-05-20T14:03:22Z | task_submitted | run_simulation
...
Or use pandas for analysis:
import pandas as pd
# Load all task events into a DataFrame
tasks = pd.read_json(".scalable/runs/run-.../tasks.jsonl", lines=True)
# Filter to completions and compute statistics
completed = tasks[tasks["event"] == "task_completed"]
print(f"Total tasks: {len(completed)}")
print(f"Average duration: {completed['duration_s'].mean():.2f}s")
print(f"Slowest task: {completed['duration_s'].max():.2f}s")
print(f"Fastest task: {completed['duration_s'].min():.2f}s")
Under the Hood
Scalable records telemetry automatically — you don’t need to add
logging to your functions. The ScalableSession instruments:
Every
submit()→task_submittedeventWhen a worker picks up a task →
task_startedWhen a task completes →
task_completed(with duration)When a task fails →
task_failed(with error details)Periodic resource snapshots →
resource_sample
Step 3: Generating Reports¶
The CLI provides quick summaries:
# Report on the most recent run
scalable report --last
═══════════════════════════════════════════════
Run Report: run-20260520T035200Z-demeter-lulcc-a1b2c3d4
═══════════════════════════════════════════════
Target: local (provider: local)
Duration: 45.2s
Status: completed
Tasks:
Submitted: 100
Completed: 100
Failed: 0
Avg duration: 4.2s
Max duration: 8.7s (run_simulation, scenario_id=47)
Workers:
Peak: 4
Avg utilization: 87%
Cache:
Lookups: 100
Hits: 0 (0%) — first run, no prior cache
Misses: 100
Estimated Cost: $0.00 (local provider)
You can also compare runs:
scalable report --compare run-abc123 run-def456
This shows performance differences between two runs — useful for verifying that optimization changes actually helped.
Step 4: Using Telemetry for Optimization¶
Telemetry answers critical questions:
“Which tasks are slowest?”
# Find the 5 slowest tasks
slowest = completed.nlargest(5, "duration_s")[["task", "duration_s"]]
print(slowest)
“Are workers sitting idle?”
resources = pd.read_json(".scalable/runs/run-.../resources.jsonl", lines=True)
print(f"Average CPU utilization: {resources['cpu_percent'].mean():.1f}%")
# Below 70% suggests you have too many workers for the workload
“Is caching helping?”
cache = pd.read_json(".scalable/runs/run-.../cache.jsonl", lines=True)
hit_rate = cache[cache["result"] == "hit"].shape[0] / len(cache) * 100
print(f"Cache hit rate: {hit_rate:.1f}%")
💡 Key Concept: Utilization and Efficiency
Utilization measures how much of your allocated resources are actually being used:
100% utilization = every worker is busy all the time (ideal)
50% utilization = workers are idle half the time (wasteful)
Low utilization usually means: too many workers, or tasks are too quick (overhead dominates)
Efficiency considers the ratio of useful work to total time:
Efficiency = (total task computation time) / (total worker uptime × worker count)
If you have 4 workers running for 60 seconds each (240 worker-seconds) but only 180 seconds of actual task computation, efficiency is 75%.
Step 5: Historical Analysis¶
💡 Key Concept: Trend Analysis
Trend analysis looks at how metrics change over time:
Are runs getting slower? (regression detection)
Are resource needs growing? (capacity planning)
Is cache hit rate improving? (optimization validation)
Scalable stores all runs in .scalable/runs/ so you can analyze trends
across your project’s history.
import os
import json
# Load metadata from all runs
runs_dir = ".scalable/runs"
runs = []
for run_name in sorted(os.listdir(runs_dir)):
run_meta = os.path.join(runs_dir, run_name, "run.json")
if os.path.exists(run_meta):
with open(run_meta) as f:
runs.append(json.load(f))
# Plot duration over time (if matplotlib available)
for r in runs:
print(f"{r['start_time']}: {r['duration_s']:.1f}s ({r['tasks_completed']} tasks)")
Step 6: Telemetry-Driven Resource Recommendations¶
Scalable’s resource advisor uses telemetry history to recommend better resource allocations:
scalable advise --task run_simulation
Resource Recommendation for 'run_simulation':
Current: 4 CPUs, 16G memory
Recommended: 2 CPUs, 8G memory
Reason: 95th percentile usage is 1.8 CPUs and 6.2G memory
Potential savings: 50% compute cost reduction
🤔 Think About It
Without telemetry, resource allocation is guesswork (“let’s try 32G and see”). With telemetry, it’s data-driven (“historical usage shows 6G is the 95th percentile, so 8G gives comfortable headroom”).
This is why Scalable records telemetry by default — even if you don’t look at it now, it enables smarter decisions later.
Common Questions¶
Q: Does telemetry slow down my workflow?
Negligibly. Writing a JSON line to a file takes microseconds. Compared to tasks that take seconds or minutes, the overhead is unmeasurable.
Q: How much disk space does telemetry use?
Typically 1–10 MB per run (for hundreds of tasks). You can periodically archive or delete old runs. For long-term storage, telemetry can be exported to Parquet format (compressed columnar storage).
Q: Can I disable telemetry?
Yes, but it’s not recommended. Telemetry is what enables caching verification, resource recommendations, and debugging. Without it, you’re flying blind.
Q: What’s the difference between telemetry and logging?
Logging = messages for developers to debug issues (often unstructured, verbose, human-oriented)
Telemetry = structured data for analysis and automation (machine-parseable, consistent schema)
Scalable provides both: Python logging for debugging, telemetry for analysis.
Q: Can I send telemetry to external systems?
Yes — telemetry files are standard JSONL that can be ingested by any log aggregation system (Elasticsearch, Splunk, CloudWatch). Export to Parquet for data warehouse analytics.
What You Learned¶
Term |
Definition |
|---|---|
Telemetry |
Automated collection of system behavior data |
Observability |
Ability to understand internal state from outputs |
Metrics |
Numerical measurements over time (CPU %, duration) |
Logs |
Discrete events with context (structured or unstructured) |
Traces |
Journey of a request through the system |
Structured Logging |
Recording events as machine-parseable data (JSON) |
JSONL |
JSON Lines — one JSON object per line |
Event |
Discrete occurrence with timestamp, type, and payload |
Utilization |
Percentage of allocated resources actually being used |
Trend Analysis |
Examining how metrics change over time |
Run Directory |
Folder containing all telemetry for a single execution |
Next Steps¶
You now understand telemetry and observability, and can use Scalable’s data to optimize your workflows.
Next beginner tutorial: Beginner Tutorial 7: When Things Go Wrong — what happens when things go wrong
Standard tutorial: Tutorial 6: Monitoring and Observability with Telemetry — custom dashboards, Parquet export, and advanced analysis
Try it: After running a workflow, explore the
.scalable/runs/directory. Open atasks.jsonlfile and look at the event structure. Can you find the slowest task?