.. _beginner_telemetry: ====================================================== Beginner Tutorial 6: Understanding What Happened ====================================================== The Big Picture ---------------- You've run a workflow. It completed. But did it perform well? Were some tasks slower than expected? Did workers sit idle? How much did it cost? **Telemetry** is the automated recording of everything that happens during a run — every task start, every completion, every failure, every resource measurement. It's like a flight recorder for your workflow, letting you understand what happened after the fact and make informed decisions about optimization. This tutorial explains observability from first principles: what telemetry is, why structured logging matters, how to read event data, and how to generate useful reports. What You Will Learn -------------------- By the end of this tutorial you will: * Understand what telemetry and observability mean. * Know the difference between metrics, logs, and traces. * Read JSONL telemetry files and understand their structure. * Generate reports from the CLI and Python API. * Use telemetry data to identify performance bottlenecks. * Understand how historical telemetry informs future decisions. Prerequisites -------------- * Completed :ref:`beginner_getting_started`. * At least one completed Scalable run (to have telemetry data). * ``pandas`` installed (included with Scalable's core dependencies). Key Concepts Explained ----------------------- .. admonition:: 💡 Key Concept: What is Telemetry? :class: tip **Telemetry** is the automated collection and transmission of data from remote systems. The word comes from Greek: *tele* (remote) + *metron* (measurement). In software, telemetry means recording what your program did: * When did tasks start and finish? * How much memory did workers use? * Which tasks failed and why? * How many cache hits occurred? **Analogy:** A car's dashboard shows speed, fuel level, and engine temperature in real-time. Telemetry is like a dashcam that records everything so you can review it later. .. admonition:: 💡 Key Concept: Observability :class: tip **Observability** is the ability to understand a system's internal state by examining its outputs. A system is "observable" if you can answer "why is this slow?" or "why did this fail?" from the data it produces. The three pillars of observability: **1. Metrics** — numerical measurements over time * "CPU utilization was 87% at 14:03:22" * "Average task duration was 4.2 seconds" * Good for dashboards and alerting **2. Logs** — discrete events with context * "Task run_simulation(42) started at 14:03:22 on worker-3" * "Worker-2 failed with OutOfMemoryError at 14:05:11" * Good for debugging specific incidents **3. Traces** — the journey of a request through the system * "Task 42: submitted → queued 0.3s → scheduled to worker-3 → executed 4.1s → completed" * Good for understanding latency and bottlenecks Scalable's telemetry provides all three through structured event files. .. admonition:: 💡 Key Concept: Structured Logging :class: tip **Structured logging** means recording events as machine-parseable data (typically JSON) rather than free-form text. **Unstructured log** (hard to parse programmatically): .. code-block:: text 2026-05-20 14:03:22 INFO Task run_simulation(42) completed in 4.2s on worker-3 **Structured log** (easy to parse, filter, aggregate): .. code-block:: json { "timestamp": "2026-05-20T14:03:22Z", "event": "task_completed", "task": "run_simulation", "args": {"scenario_id": 42}, "duration_s": 4.2, "worker": "worker-3" } Structured logs can be: * Filtered: "show me only failures" * Aggregated: "average duration per task type" * Queried: "which worker handled the most tasks?" * Visualized: plotted on timelines and dashboards .. admonition:: 💡 Key Concept: JSONL (JSON Lines) :class: tip **JSONL** (JSON Lines) is a format where each line is a complete JSON object. It's perfect for event streams because: * **Appendable** — just add a new line (no need to rewrite the file) * **Streamable** — process one line at a time (no need to load entire file) * **Parseable** — each line is valid JSON .. code-block:: text {"event": "task_started", "task": "sim", "time": "14:03:22"} {"event": "task_completed", "task": "sim", "time": "14:03:26", "duration": 4.2} {"event": "task_started", "task": "sim", "time": "14:03:22"} Compare to a single large JSON array (which requires loading the entire file to append or read): .. code-block:: json [ {"event": "task_started", ...}, {"event": "task_completed", ...} ] .. admonition:: 💡 Key Concept: Events :class: tip An **event** is a discrete occurrence at a specific point in time. Events have: * **Timestamp** — when it happened * **Type** — what kind of event (task_started, worker_added, etc.) * **Payload** — additional context (task name, duration, error message) Events form the foundation of Scalable's telemetry system. Everything that happens is recorded as an event. Step 1: Telemetry File Structure ---------------------------------- After every run, Scalable creates a run directory with structured telemetry: .. code-block:: text .scalable/runs/ └── run-20260520T035200Z-demeter-lulcc-a1b2c3d4/ ├── run.json # Run metadata (start time, target, manifest) ├── manifest.yaml # Snapshot of the manifest used ├── plan.json # Execution plan snapshot ├── tasks.jsonl # Task lifecycle events ├── resources.jsonl # Resource utilization snapshots ├── workers.jsonl # Worker lifecycle events ├── cache.jsonl # Cache hit/miss events └── failures.jsonl # Error details (if any) Each file serves a purpose: ``run.json`` High-level metadata: when the run started, which target was used, the manifest hash for reproducibility verification. ``tasks.jsonl`` The most important file — every task submission, start, completion, and failure is recorded here. ``resources.jsonl`` Periodic snapshots of CPU and memory usage per worker. ``workers.jsonl`` Worker lifecycle: when workers started, stopped, or crashed. ``cache.jsonl`` Every cache lookup: hit (saved time) or miss (had to compute). ``failures.jsonl`` Detailed error information including tracebacks. Step 2: Reading Telemetry Data -------------------------------- You can read telemetry files directly: .. code-block:: python import json # Read task events line by line with open(".scalable/runs/run-.../tasks.jsonl") as f: for line in f: event = json.loads(line) print(f"{event['timestamp']} | {event['event']} | {event.get('task', '')}") Output: .. code-block:: text 2026-05-20T14:03:22Z | task_submitted | run_simulation 2026-05-20T14:03:22Z | task_started | run_simulation 2026-05-20T14:03:26Z | task_completed | run_simulation 2026-05-20T14:03:22Z | task_submitted | run_simulation ... Or use pandas for analysis: .. code-block:: python import pandas as pd # Load all task events into a DataFrame tasks = pd.read_json(".scalable/runs/run-.../tasks.jsonl", lines=True) # Filter to completions and compute statistics completed = tasks[tasks["event"] == "task_completed"] print(f"Total tasks: {len(completed)}") print(f"Average duration: {completed['duration_s'].mean():.2f}s") print(f"Slowest task: {completed['duration_s'].max():.2f}s") print(f"Fastest task: {completed['duration_s'].min():.2f}s") .. admonition:: Under the Hood :class: hint Scalable records telemetry **automatically** — you don't need to add logging to your functions. The ``ScalableSession`` instruments: 1. Every ``submit()`` → ``task_submitted`` event 2. When a worker picks up a task → ``task_started`` 3. When a task completes → ``task_completed`` (with duration) 4. When a task fails → ``task_failed`` (with error details) 5. Periodic resource snapshots → ``resource_sample`` Step 3: Generating Reports ----------------------------- The CLI provides quick summaries: .. code-block:: bash # Report on the most recent run scalable report --last .. code-block:: text ═══════════════════════════════════════════════ Run Report: run-20260520T035200Z-demeter-lulcc-a1b2c3d4 ═══════════════════════════════════════════════ Target: local (provider: local) Duration: 45.2s Status: completed Tasks: Submitted: 100 Completed: 100 Failed: 0 Avg duration: 4.2s Max duration: 8.7s (run_simulation, scenario_id=47) Workers: Peak: 4 Avg utilization: 87% Cache: Lookups: 100 Hits: 0 (0%) — first run, no prior cache Misses: 100 Estimated Cost: $0.00 (local provider) You can also compare runs: .. code-block:: bash scalable report --compare run-abc123 run-def456 This shows performance differences between two runs — useful for verifying that optimization changes actually helped. Step 4: Using Telemetry for Optimization ------------------------------------------ Telemetry answers critical questions: **"Which tasks are slowest?"** .. code-block:: python # Find the 5 slowest tasks slowest = completed.nlargest(5, "duration_s")[["task", "duration_s"]] print(slowest) **"Are workers sitting idle?"** .. code-block:: python resources = pd.read_json(".scalable/runs/run-.../resources.jsonl", lines=True) print(f"Average CPU utilization: {resources['cpu_percent'].mean():.1f}%") # Below 70% suggests you have too many workers for the workload **"Is caching helping?"** .. code-block:: python cache = pd.read_json(".scalable/runs/run-.../cache.jsonl", lines=True) hit_rate = cache[cache["result"] == "hit"].shape[0] / len(cache) * 100 print(f"Cache hit rate: {hit_rate:.1f}%") .. admonition:: 💡 Key Concept: Utilization and Efficiency :class: tip **Utilization** measures how much of your allocated resources are actually being used: * **100% utilization** = every worker is busy all the time (ideal) * **50% utilization** = workers are idle half the time (wasteful) * **Low utilization** usually means: too many workers, or tasks are too quick (overhead dominates) **Efficiency** considers the ratio of useful work to total time: .. code-block:: text Efficiency = (total task computation time) / (total worker uptime × worker count) If you have 4 workers running for 60 seconds each (240 worker-seconds) but only 180 seconds of actual task computation, efficiency is 75%. Step 5: Historical Analysis ------------------------------ .. admonition:: 💡 Key Concept: Trend Analysis :class: tip **Trend analysis** looks at how metrics change over time: * Are runs getting slower? (regression detection) * Are resource needs growing? (capacity planning) * Is cache hit rate improving? (optimization validation) Scalable stores all runs in ``.scalable/runs/`` so you can analyze trends across your project's history. .. code-block:: python import os import json # Load metadata from all runs runs_dir = ".scalable/runs" runs = [] for run_name in sorted(os.listdir(runs_dir)): run_meta = os.path.join(runs_dir, run_name, "run.json") if os.path.exists(run_meta): with open(run_meta) as f: runs.append(json.load(f)) # Plot duration over time (if matplotlib available) for r in runs: print(f"{r['start_time']}: {r['duration_s']:.1f}s ({r['tasks_completed']} tasks)") Step 6: Telemetry-Driven Resource Recommendations ---------------------------------------------------- Scalable's resource advisor uses telemetry history to recommend better resource allocations: .. code-block:: bash scalable advise --task run_simulation .. code-block:: text Resource Recommendation for 'run_simulation': Current: 4 CPUs, 16G memory Recommended: 2 CPUs, 8G memory Reason: 95th percentile usage is 1.8 CPUs and 6.2G memory Potential savings: 50% compute cost reduction .. admonition:: 🤔 Think About It :class: note Without telemetry, resource allocation is guesswork ("let's try 32G and see"). With telemetry, it's data-driven ("historical usage shows 6G is the 95th percentile, so 8G gives comfortable headroom"). This is why Scalable records telemetry by default — even if you don't look at it now, it enables smarter decisions later. Common Questions ----------------- **Q: Does telemetry slow down my workflow?** Negligibly. Writing a JSON line to a file takes microseconds. Compared to tasks that take seconds or minutes, the overhead is unmeasurable. **Q: How much disk space does telemetry use?** Typically 1–10 MB per run (for hundreds of tasks). You can periodically archive or delete old runs. For long-term storage, telemetry can be exported to Parquet format (compressed columnar storage). **Q: Can I disable telemetry?** Yes, but it's not recommended. Telemetry is what enables caching verification, resource recommendations, and debugging. Without it, you're flying blind. **Q: What's the difference between telemetry and logging?** * **Logging** = messages for developers to debug issues (often unstructured, verbose, human-oriented) * **Telemetry** = structured data for analysis and automation (machine-parseable, consistent schema) Scalable provides both: Python logging for debugging, telemetry for analysis. **Q: Can I send telemetry to external systems?** Yes — telemetry files are standard JSONL that can be ingested by any log aggregation system (Elasticsearch, Splunk, CloudWatch). Export to Parquet for data warehouse analytics. What You Learned ----------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Term - Definition * - Telemetry - Automated collection of system behavior data * - Observability - Ability to understand internal state from outputs * - Metrics - Numerical measurements over time (CPU %, duration) * - Logs - Discrete events with context (structured or unstructured) * - Traces - Journey of a request through the system * - Structured Logging - Recording events as machine-parseable data (JSON) * - JSONL - JSON Lines — one JSON object per line * - Event - Discrete occurrence with timestamp, type, and payload * - Utilization - Percentage of allocated resources actually being used * - Trend Analysis - Examining how metrics change over time * - Run Directory - Folder containing all telemetry for a single execution Next Steps ----------- You now understand telemetry and observability, and can use Scalable's data to optimize your workflows. * **Next beginner tutorial:** :ref:`beginner_error_handling` — what happens when things go wrong * **Standard tutorial:** :ref:`tutorial_telemetry` — custom dashboards, Parquet export, and advanced analysis * **Try it:** After running a workflow, explore the ``.scalable/runs/`` directory. Open a ``tasks.jsonl`` file and look at the event structure. Can you find the slowest task?