Beginner Tutorial 6: Understanding What Happened¶

The Big Picture¶

You’ve run a workflow. It completed. But did it perform well? Were some tasks slower than expected? Did workers sit idle? How much did it cost?

Telemetry is the automated recording of everything that happens during a run — every task start, every completion, every failure, every resource measurement. It’s like a flight recorder for your workflow, letting you understand what happened after the fact and make informed decisions about optimization.

This tutorial explains observability from first principles: what telemetry is, why structured logging matters, how to read event data, and how to generate useful reports.

What You Will Learn¶

By the end of this tutorial you will:

Understand what telemetry and observability mean.
Know the difference between metrics, logs, and traces.
Read JSONL telemetry files and understand their structure.
Generate reports from the CLI and Python API.
Use telemetry data to identify performance bottlenecks.
Understand how historical telemetry informs future decisions.

Prerequisites¶

Completed Beginner Tutorial 1: Your First Workflow.
At least one completed Scalable run (to have telemetry data).
pandas installed (included with Scalable’s core dependencies).

Key Concepts Explained¶

💡 Key Concept: What is Telemetry?

Telemetry is the automated collection and transmission of data from remote systems. The word comes from Greek: tele (remote) + metron (measurement).

In software, telemetry means recording what your program did:

When did tasks start and finish?
How much memory did workers use?
Which tasks failed and why?
How many cache hits occurred?

Analogy: A car’s dashboard shows speed, fuel level, and engine temperature in real-time. Telemetry is like a dashcam that records everything so you can review it later.

💡 Key Concept: Observability

Observability is the ability to understand a system’s internal state by examining its outputs. A system is “observable” if you can answer “why is this slow?” or “why did this fail?” from the data it produces.

The three pillars of observability:

1. Metrics — numerical measurements over time

“CPU utilization was 87% at 14:03:22”
“Average task duration was 4.2 seconds”
Good for dashboards and alerting

2. Logs — discrete events with context

“Task run_simulation(42) started at 14:03:22 on worker-3”
“Worker-2 failed with OutOfMemoryError at 14:05:11”
Good for debugging specific incidents

3. Traces — the journey of a request through the system

“Task 42: submitted → queued 0.3s → scheduled to worker-3 → executed 4.1s → completed”
Good for understanding latency and bottlenecks

Scalable’s telemetry provides all three through structured event files.

💡 Key Concept: Structured Logging

Structured logging means recording events as machine-parseable data (typically JSON) rather than free-form text.

Unstructured log (hard to parse programmatically):

2026-05-20 14:03:22 INFO Task run_simulation(42) completed in 4.2s on worker-3

Structured log (easy to parse, filter, aggregate):

{
  "timestamp": "2026-05-20T14:03:22Z",
  "event": "task_completed",
  "task": "run_simulation",
  "args": {"scenario_id": 42},
  "duration_s": 4.2,
  "worker": "worker-3"
}

Structured logs can be:

Filtered: “show me only failures”
Aggregated: “average duration per task type”
Queried: “which worker handled the most tasks?”
Visualized: plotted on timelines and dashboards

💡 Key Concept: JSONL (JSON Lines)

JSONL (JSON Lines) is a format where each line is a complete JSON object. It’s perfect for event streams because:

Appendable — just add a new line (no need to rewrite the file)
Streamable — process one line at a time (no need to load entire file)
Parseable — each line is valid JSON

{"event": "task_started", "task": "sim", "time": "14:03:22"}
{"event": "task_completed", "task": "sim", "time": "14:03:26", "duration": 4.2}
{"event": "task_started", "task": "sim", "time": "14:03:22"}

Compare to a single large JSON array (which requires loading the entire file to append or read):

[
  {"event": "task_started", ...},
  {"event": "task_completed", ...}
]

💡 Key Concept: Events

An event is a discrete occurrence at a specific point in time. Events have:

Timestamp — when it happened
Type — what kind of event (task_started, worker_added, etc.)
Payload — additional context (task name, duration, error message)

Events form the foundation of Scalable’s telemetry system. Everything that happens is recorded as an event.

Step 1: Telemetry File Structure¶

After every run, Scalable creates a run directory with structured telemetry:

.scalable/runs/
└── run-20260520T035200Z-demeter-lulcc-a1b2c3d4/
    ├── run.json           # Run metadata (start time, target, manifest)
    ├── manifest.yaml      # Snapshot of the manifest used
    ├── plan.json          # Execution plan snapshot
    ├── tasks.jsonl        # Task lifecycle events
    ├── resources.jsonl    # Resource utilization snapshots
    ├── workers.jsonl      # Worker lifecycle events
    ├── cache.jsonl        # Cache hit/miss events
    └── failures.jsonl     # Error details (if any)

Each file serves a purpose:

run.json: High-level metadata: when the run started, which target was used, the manifest hash for reproducibility verification.
tasks.jsonl: The most important file — every task submission, start, completion, and failure is recorded here.
resources.jsonl: Periodic snapshots of CPU and memory usage per worker.
workers.jsonl: Worker lifecycle: when workers started, stopped, or crashed.
cache.jsonl: Every cache lookup: hit (saved time) or miss (had to compute).
failures.jsonl: Detailed error information including tracebacks.

Step 2: Reading Telemetry Data¶

You can read telemetry files directly:

import json

# Read task events line by line
with open(".scalable/runs/run-.../tasks.jsonl") as f:
    for line in f:
        event = json.loads(line)
        print(f"{event['timestamp']} | {event['event']} | {event.get('task', '')}")

Output:

2026-05-20T14:03:22Z | task_submitted | run_simulation
2026-05-20T14:03:22Z | task_started   | run_simulation
2026-05-20T14:03:26Z | task_completed | run_simulation
2026-05-20T14:03:22Z | task_submitted | run_simulation
...

Or use pandas for analysis:

import pandas as pd

# Load all task events into a DataFrame
tasks = pd.read_json(".scalable/runs/run-.../tasks.jsonl", lines=True)

# Filter to completions and compute statistics
completed = tasks[tasks["event"] == "task_completed"]
print(f"Total tasks: {len(completed)}")
print(f"Average duration: {completed['duration_s'].mean():.2f}s")
print(f"Slowest task: {completed['duration_s'].max():.2f}s")
print(f"Fastest task: {completed['duration_s'].min():.2f}s")

Under the Hood

Scalable records telemetry automatically — you don’t need to add logging to your functions. The ScalableSession instruments:

Every submit() → task_submitted event
When a worker picks up a task → task_started
When a task completes → task_completed (with duration)
When a task fails → task_failed (with error details)
Periodic resource snapshots → resource_sample

Step 3: Generating Reports¶

The CLI provides quick summaries:

# Report on the most recent run
scalable report --last

═══════════════════════════════════════════════
Run Report: run-20260520T035200Z-demeter-lulcc-a1b2c3d4
═══════════════════════════════════════════════
Target: local (provider: local)
Duration: 45.2s
Status: completed

Tasks:
  Submitted: 100
  Completed: 100
  Failed: 0
  Avg duration: 4.2s
  Max duration: 8.7s (run_simulation, scenario_id=47)

Workers:
  Peak: 4
  Avg utilization: 87%

Cache:
  Lookups: 100
  Hits: 0 (0%) — first run, no prior cache
  Misses: 100

Estimated Cost: $0.00 (local provider)

You can also compare runs:

scalable report --compare run-abc123 run-def456

This shows performance differences between two runs — useful for verifying that optimization changes actually helped.

Step 4: Using Telemetry for Optimization¶

Telemetry answers critical questions:

“Which tasks are slowest?”

# Find the 5 slowest tasks
slowest = completed.nlargest(5, "duration_s")[["task", "duration_s"]]
print(slowest)

“Are workers sitting idle?”

resources = pd.read_json(".scalable/runs/run-.../resources.jsonl", lines=True)
print(f"Average CPU utilization: {resources['cpu_percent'].mean():.1f}%")
# Below 70% suggests you have too many workers for the workload

“Is caching helping?”

cache = pd.read_json(".scalable/runs/run-.../cache.jsonl", lines=True)
hit_rate = cache[cache["result"] == "hit"].shape[0] / len(cache) * 100
print(f"Cache hit rate: {hit_rate:.1f}%")

💡 Key Concept: Utilization and Efficiency

Utilization measures how much of your allocated resources are actually being used:

100% utilization = every worker is busy all the time (ideal)
50% utilization = workers are idle half the time (wasteful)
Low utilization usually means: too many workers, or tasks are too quick (overhead dominates)

Efficiency considers the ratio of useful work to total time:

Efficiency = (total task computation time) / (total worker uptime × worker count)

If you have 4 workers running for 60 seconds each (240 worker-seconds) but only 180 seconds of actual task computation, efficiency is 75%.

Step 5: Historical Analysis¶

💡 Key Concept: Trend Analysis

Trend analysis looks at how metrics change over time:

Are runs getting slower? (regression detection)
Are resource needs growing? (capacity planning)
Is cache hit rate improving? (optimization validation)

Scalable stores all runs in .scalable/runs/ so you can analyze trends across your project’s history.

import os
import json

# Load metadata from all runs
runs_dir = ".scalable/runs"
runs = []
for run_name in sorted(os.listdir(runs_dir)):
    run_meta = os.path.join(runs_dir, run_name, "run.json")
    if os.path.exists(run_meta):
        with open(run_meta) as f:
            runs.append(json.load(f))

# Plot duration over time (if matplotlib available)
for r in runs:
    print(f"{r['start_time']}: {r['duration_s']:.1f}s ({r['tasks_completed']} tasks)")

Step 6: Telemetry-Driven Resource Recommendations¶

Scalable’s resource advisor uses telemetry history to recommend better resource allocations:

scalable advise --task run_simulation

Resource Recommendation for 'run_simulation':
  Current: 4 CPUs, 16G memory
  Recommended: 2 CPUs, 8G memory
  Reason: 95th percentile usage is 1.8 CPUs and 6.2G memory
  Potential savings: 50% compute cost reduction

🤔 Think About It

Without telemetry, resource allocation is guesswork (“let’s try 32G and see”). With telemetry, it’s data-driven (“historical usage shows 6G is the 95th percentile, so 8G gives comfortable headroom”).

This is why Scalable records telemetry by default — even if you don’t look at it now, it enables smarter decisions later.

Common Questions¶

Q: Does telemetry slow down my workflow?

Negligibly. Writing a JSON line to a file takes microseconds. Compared to tasks that take seconds or minutes, the overhead is unmeasurable.

Q: How much disk space does telemetry use?

Typically 1–10 MB per run (for hundreds of tasks). You can periodically archive or delete old runs. For long-term storage, telemetry can be exported to Parquet format (compressed columnar storage).

Q: Can I disable telemetry?

Yes, but it’s not recommended. Telemetry is what enables caching verification, resource recommendations, and debugging. Without it, you’re flying blind.

Q: What’s the difference between telemetry and logging?

Logging = messages for developers to debug issues (often unstructured, verbose, human-oriented)
Telemetry = structured data for analysis and automation (machine-parseable, consistent schema)

Scalable provides both: Python logging for debugging, telemetry for analysis.

Q: Can I send telemetry to external systems?

Yes — telemetry files are standard JSONL that can be ingested by any log aggregation system (Elasticsearch, Splunk, CloudWatch). Export to Parquet for data warehouse analytics.

What You Learned¶

Term	Definition
Telemetry	Automated collection of system behavior data
Observability	Ability to understand internal state from outputs
Metrics	Numerical measurements over time (CPU %, duration)
Logs	Discrete events with context (structured or unstructured)
Traces	Journey of a request through the system
Structured Logging	Recording events as machine-parseable data (JSON)
JSONL	JSON Lines — one JSON object per line
Event	Discrete occurrence with timestamp, type, and payload
Utilization	Percentage of allocated resources actually being used
Trend Analysis	Examining how metrics change over time
Run Directory	Folder containing all telemetry for a single execution

Next Steps¶

You now understand telemetry and observability, and can use Scalable’s data to optimize your workflows.

Next beginner tutorial: Beginner Tutorial 7: When Things Go Wrong — what happens when things go wrong
Standard tutorial: Tutorial 6: Monitoring and Observability with Telemetry — custom dashboards, Parquet export, and advanced analysis
Try it: After running a workflow, explore the .scalable/runs/ directory. Open a tasks.jsonl file and look at the event structure. Can you find the slowest task?