# JEP-0013: Metrics, Tracing, and Log Observability | Field | Value | | ----------------- | --------------------------------------------------------------------- | | **JEP** | 0013 | | **Title** | Metrics, Tracing, and Log Observability | | **Author(s)** | @mangelajo (Miguel Angel Ajo Pelayo ) | | **Status** | Accepted | | **Type** | Standards Track | | **Created** | 2026-04-23 | | **Updated** | 2026-05-04 | | **Discussion** | https://github.com/jumpstarter-dev/jumpstarter/pull/631 | | **Requires** | — | | **Supersedes** | — | | **Superseded-By** | — | --- ## Abstract This JEP defines an optional, cross-component observability model for Jumpstarter covering lease context metadata, structured operational events, exporter/driver metrics, and standardized logging. It targets direct integration with Prometheus (scrape), Loki (log aggregation), and Perses (dashboards) — without mandating OpenTelemetry — and introduces an optional in-cluster Jumpstarter Telemetry service that aggregates data from exporters and clients so that edge processes never need Loki or cluster-scrape credentials. Implementation is expected to land in phases; this JEP describes the end state and compatibility rules. ### Phases | Phase | Scope | Key deliverables | | ----- | ---------------------------------- | ---------------- | | 1 | Structured logging + lease context | `spec.context` CRD field; JSON structured logs for all long-running services; correlation fields (`lease_id`, `exporter`, `operation`, `result`) in every log line. | | 2 | Metrics endpoints | `/metrics` scrape endpoints on Controller and Router; exporter-local `prometheus_client` counters/histograms/gauges with `driver_type`; Prometheus exemplars for high-cardinality context. | | 3 | Telemetry service | Optional `jumpstarter-telemetry` Deployment managed by the operator; reverse-scrape of exporter metrics via `MetricsStream`; Loki push for edge-originated logs and events. | | 4 | Exporter drivers telemetry | Provides a clean architecture to let drivers generate their own telemetry data. | | 5 | In-cluster log scraping | Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; `ServiceMonitor` CRDs for Prometheus autodiscovery. | | 6 | Dashboards + alerting | Perses CRD dashboards; starter alert rules; documentation and operator integration. | Each phase is independently useful and builds on the previous ones. Phase 1 can ship without any later phase; operators who only need structured logs benefit immediately. Phase 2 adds scrape-ready metrics without requiring the Telemetry service. ## Motivation Today, operators and CI maintainers need to answer questions that raw Kubernetes objects and ad hoc text logs do not always answer in one place: - *Which pipeline or image was being tested on this lease?* - *How often do flashes fail on this exporter?* - *What lease or user correlates a controller line with a failure on the client?* The `Lease` API already models scheduling and assignment; it does not yet provide a first-class, documented place for run metadata or a standard for lease-scoped operational events (beyond generic `conditions`). Exporters expose work to drivers, but there is no shared model for driver- or exporter-level metrics that a monitoring stack can scrape or receive. ### User Stories - **As a** lab operator, **I want to** see flash success/failure rates per exporter in a Prometheus dashboard, **so that** I can spot failing hardware before CI teams notice. - **As a** CI pipeline author, **I want to** attach my build ID and image digest to a lease, **so that** post-mortem queries in Loki can filter all logs for one pipeline run across controller, exporter, and client. - **As a** platform engineer, **I want** exporter processes to send telemetry without holding Loki or Prometheus credentials, **so that** I do not have to distribute and rotate secrets on every lab machine. - **As an** AI agent orchestrating CI, **I want** machine-readable structured logs and metric exemplars with lease context, **so that** I can programmatically identify failing exporters and correlate test results without parsing free-form text. ## Proposal ### Concepts - **Lease context** — Identifiers and labels supplied by a client or CI and associated for the life of a lease, propagated where safe so metrics, logs, and traces can be filtered and joined. - **Lease events** (or *operations*) — Annotated, structured log entries recording significant actions (for example *flash started*, *flash failed*, *image reference*) with typed fields, queryable in **Loki** alongside regular logs and distinct from higher-frequency debug output (see **DD-2**). - **Exporter metrics** — Counters (operations, bytes), histograms (operation duration), and gauges (active sessions) exposed from the exporter and enriched by individual drivers via the `driver_type` label. Each driver selects a category from a predefined set in jumpstarter core (e.g. `storage`, `power`, `network`, `serial`, `console`, `video`, `composite`). Composite drivers (e.g. Renode, QEMU) that bundle multiple sub-drivers do not emit a single top-level category for delegated work. Instead, each sub-driver emits its own `driver_type` when it performs an operation — a Renode storage sub-driver emits `driver_type="storage"`, its power sub-driver emits `driver_type="power"`, and so on. Any top-level methods on the composite driver itself (e.g. VM lifecycle) emit `driver_type="composite"`. - **Jumpstarter Telemetry** (optional) — a dedicated component that reverse-scrapes connected exporters for metrics via `MetricsStream` and receives structured logs via `PushLogs`, using the same trust model (mTLS, ServiceAccount) as Controller/Router. It isolates Loki/series work from the reconciler hot path (see **DD-7**). Multi-replica HA with persistent exporter connections is covered in **DD-8**; best-effort log deduplication in **DD-9**. ### What users see - When creating a lease, clients (or their tooling) can attach metadata via CRD fields and/or `spec.context` using documented keys and size limits. Example keys might include a build / pipeline identifier, image digest, or VCS. - The controller and/or data plane write structured, annotated log events (see **DD-2**) for significant operations such as flash attempts and outcomes. - Exporters maintain local `prometheus_client` counters and open a `MetricsStream` to the Jumpstarter Telemetry service over the existing exporter↔control-plane trust boundary. On each Prometheus scrape, the Telemetry service fans out to connected exporters and serves the merged `/metrics` output (see **DD-3**, **DD-7**), with cluster credentials — avoiding per-exporter Loki and metrics secrets. Exporters and clients also push structured log entries via `PushLogs` (not unbounded default chatter — see *Control-plane aggregation* below). - The `jmp` CLI output remains human-readable, but when a Telemetry endpoint is available, `jmp` also pushes structured JSON logs to the Jumpstarter Telemetry service for Loki ingest. ### API / Protocol Changes #### CRD (Lease) Additive changes only for the `spec.context` field. Backwards compatibility by making this field empty by default. #### gRPC: Telemetry endpoint discovery (`jumpstarter.proto`) A new RPC on the existing `ControllerService` lets both exporters and clients discover the optional Telemetry endpoint: ```protobuf // Added to ControllerService rpc GetServiceEndpoints(GetServiceEndpointsRequest) returns (GetServiceEndpointsResponse); message GetServiceEndpointsRequest {} message GetServiceEndpointsResponse { // Empty when telemetry is not enabled. repeated TelemetryEndpoint telemetry_endpoints = 1; } message TelemetryEndpoint { string endpoint = 1; // gRPC address (host:port) string certificate = 2; // Optional CA cert for the endpoint string min_severity = 3; // Minimum severity to forward (e.g. "warning") } ``` Exporters call `GetServiceEndpoints` after `Register`; clients call it after authentication. An empty `telemetry_endpoints` list means telemetry is not deployed — callers skip all telemetry RPCs. Older controllers that do not implement the method return `UNIMPLEMENTED`, which callers treat identically to an empty list. #### gRPC: Telemetry service (`telemetry.proto` — new file) A new `protocol/proto/jumpstarter/v1/telemetry.proto` defines the `TelemetryService` implemented by `jumpstarter-telemetry`. It has two RPCs: one for metrics (reverse scrape) and one for log push. ##### Metrics: reverse scrape via `MetricsStream` Exporters maintain a local `prometheus_client.CollectorRegistry` with counters, histograms, and gauges. Rather than pushing increments, the exporter opens a persistent bidirectional stream to the Telemetry service; the Telemetry service periodically sends a scrape request and the exporter responds with the output of `prometheus_client.generate_latest()` in OpenMetrics text format. ```protobuf service TelemetryService { // Persistent bidirectional stream: telemetry sends scrape requests, // exporter responds with full metric snapshots. rpc MetricsStream(stream MetricsStreamRequest) returns (stream MetricsStreamResponse); // Structured log / event push (used by both exporters and clients). rpc PushLogs(PushLogsRequest) returns (PushLogsResponse); } // Exporter → Telemetry message MetricsStreamRequest { oneof msg { MetricsRegister register = 1; // First message: identify this exporter MetricsScrapeResponse scrape_response = 2; // Subsequent: reply to a scrape } } message MetricsRegister { string identity = 1; // Exporter CRD name (verified against mTLS and auth token by server) } message MetricsScrapeResponse { bytes metrics_text = 1; // generate_latest() OpenMetrics output google.protobuf.Timestamp timestamp = 2; } // Telemetry → Exporter message MetricsStreamResponse { oneof msg { MetricsScrapeRequest scrape_request = 1; } } message MetricsScrapeRequest {} // "send your /metrics now" ``` The stream lifecycle: 1. Exporter opens the stream and sends `MetricsRegister`, the jumpstarter-telemetry service authenticates the exporter identity and labels from cluster information. 2. When Prometheus (or any scraper) hits the Telemetry service's `/metrics` endpoint, Telemetry fans out `MetricsScrapeRequest` to all connected exporters. 3. Each exporter calls `generate_latest(registry)` and replies with `MetricsScrapeResponse`. 4. Telemetry merges the responses and serves the combined result, adds and filters any necessary labels or exemplars from data. This on-demand approach avoids stale data and unnecessary background traffic; it can be changed to periodic pre-fetching later if scrape latency became problematic. **Client-side metrics are not collected.** All metrically-interesting operations are observable from the exporter side: `DriverCall` methods run on the exporter and can be instrumented there. Client-side drivers that orchestrate complex workflows (e.g. serial-console-driven flashing) report outcomes back to the exporter via regular `DriverCall` methods, keeping the exporter as the single source of truth for metrics. ##### Logs: push via `PushLogs` Both exporters and clients push structured log entries to the Telemetry service for Loki ingest: ```protobuf message PushLogsRequest { repeated LogEntry entries = 1; } message PushLogsResponse { uint32 accepted = 1; // Entries accepted uint32 dropped = 2; // Entries dropped (backpressure) } message LogEntry { google.protobuf.Timestamp timestamp = 1; string severity = 2; // debug, info, warning, error, critical string message = 3; string component = 4; // Log stream label: cli, exporter string exporter = 5; // Log stream label: exporter CRD name string lease_id = 6; // High-cardinality, log body only string client = 7; // High-cardinality, log body only string operation = 8; // flash, power, etc. string result = 9; // success, failure string driver_type = 10; // storage, power, network, etc. map extra_fields = 11; // Driver-specific structured data } ``` The Telemetry service maps `component` and `exporter` to Loki stream labels and everything else into the JSON body, following the cardinality rules in *Cardinality guidelines*. The `exporter` and `client` fields are verified server-side with the authenticated identity to prevent impersonation. `spec.context` entries associated with the active lease (e.g. `build_id`, `image_digest`) are placed in `extra_fields` by the caller. Empty fields or details that can be obtained from `lease_id` are incorporated into the log. **`extra_fields` limits:** To prevent unbounded log payloads, `extra_fields` is capped at 16 entries per `LogEntry`. Key names are limited to 64 characters and values to 256 characters. The Telemetry service enforces these limits server-side, silently truncating or dropping entries that exceed them before forwarding to Loki. **String fields rationale:** Fields such as `severity`, `operation`, `result`, and `driver_type` are intentionally `string` rather than protobuf enums. These values end up as JSON log body fields in Loki where string representation is required regardless. Using strings keeps the wire format forward-compatible (new categories or result codes do not require a proto regeneration), and validation of allowed values is enforced at the application layer using the operator's configuration (e.g. `driverTypeEnum` allowlist). The same reasoning applies to `extra_fields` and `structured_fields` in `LogStreamResponse` — they carry driver-specific key-value data destined for log bodies, not typed metrics. #### gRPC: `AuditStream` removal (`jumpstarter.proto`) The existing `AuditStream` RPC on `ControllerService` and its `AuditStreamRequest` message are removed. Analysis of the codebase shows this is dead code: - The Go controller has no implementation — calls fall through to `UnimplementedControllerServiceServer` which returns `codes.Unimplemented`. - No Python code (exporter or client) calls the RPC. - No tests exercise it beyond generated stubs. Its intended purpose (tracking exporter activity) is fully superseded by `TelemetryService.PushLogs` with a richer, properly-designed message format. #### gRPC: `LogStreamResponse` enrichment (`jumpstarter.proto`) The existing `LogStream` RPC on `ExporterService` is kept — it serves a fundamentally different purpose (real-time session logs from exporter to connected client) from the Telemetry log push. However, the `LogStreamResponse` message is enriched with optional additive fields to support richer client-side display and optional dual-path forwarding to telemetry: ```protobuf message LogStreamResponse { string uuid = 1; string severity = 2; string message = 3; optional LogSource source = 4; // New additive fields: optional string driver_type = 5; // Category when source=DRIVER optional string operation = 6; // When the log is part of a known operation optional google.protobuf.Timestamp timestamp = 7; map structured_fields = 8; } ``` These fields are optional and backward compatible — older clients ignore unknown fields; older exporters simply do not set them. The same size limits as `LogEntry.extra_fields` apply to `structured_fields` (16 entries, 64-char keys, 256-char values). #### Tracing scope This JEP covers *correlation only* — `lease_id`, `trace_id`, and `span_id` are propagated as log fields and Prometheus exemplar keys so that metrics, logs, and (future) traces can be joined. Full distributed tracing (span creation, sampling policies, trace storage and visualization) is deferred to a future JEP. Optional propagation of `traceparent` and lease identifiers in gRPC metadata remains backward compatible (unknown metadata ignored by older servers). ### Hardware Considerations - No hardware considerations. ## Design Decisions ### DD-1: How lease-scoped *context* metadata is stored **Scope:** This decision is about where to store generic metadata on a `Lease` that describes *why* a run exists or *where* it came from — for example an external build id, pipeline id, VCS revision, or other operator-defined keys (team, environment), within the cardinality and size limits defined in *Cardinality guidelines*. The same stored context is the intended source to propagate (where safe) into metric series labels and into log line fields for emissions that occur during the lease and for logs produced during client access to the platform (for example `jmp`) or during exporter and control-plane handling, so Prometheus and Loki can correlate on one lease-level identity without re-typing it on every line. **Alternatives considered:** 1. **Annotation and label only** on the `Lease` object — Kube-native, no spec change; limited size for annotations; labels for select queries only. 2. **Typed subfields under `spec`** (for example `observability` or `context`) — easier validation, clearer API, migration path in CRD. 3. **Only client-side** (environment / local config) — no cluster visibility; hard for operators to audit; no stable object-level link to per-lease metrics and server logs in the cluster. **Decision:** **(2)** — a typed `spec.context` map under the Lease CRD for first-class, validated context. **(1)** (labels/annotations) remains allowed for integration with generic tooling that only understands Kubernetes metadata or benefits from lease label filtering. **Rationale:** Typed fields make validation and documentation clear; labels are still useful for selection and for tools that only understand metadata. ### DD-2: Where operational events (flash, image) live **Alternatives considered:** 1. **Kubernetes `Event` objects** — built-in, TTL-limited, good for "what happened" in `kubectl get events` but not long-term history by default. 2. **`Lease.status.conditions` only** — compact but poor for a sequence of operations with payloads (image id, size). 3. **Dedicated CRD** (for example per-event or a single stream object) — more design and RBAC, better long-term retention and querying if backed properly. 4. **Annotated log events** Provides a lightweight alternative that can be traced and filtered along logs. **Decision:** (4), since the other alternatives add additional pressure to the cluster etcd via CRDs, annotated logs still provide the same level of functionality and can be browsed together with logs. **Rationale:** Annotated log events naturally flow through the Loki pipeline this JEP already establishes (**DD-5**, **DD-7**), so operational records (flash started, flash failed, image reference) are queryable, filterable, and correlated with surrounding exporter and controller logs using the same correlation fields (`lease_id`, `exporter`, `result`, …) without a second query domain. Kubernetes `Event` objects **(1)** have a short default TTL (~1 h) and still write to etcd on every occurrence; `status.conditions` **(2)** is a poor fit for a sequence of operations with variable payloads (image digest, byte count, duration); a dedicated CRD **(3)** adds schema versioning, RBAC surface, and per-event etcd writes that scale with flash volume — all pressure the cluster does not need for data whose primary consumers are dashboards and post-mortem queries, not reconciliation loops. Structured log events carry arbitrary fields without CRD migration, support configurable retention in Loki, and keep the etcd write budget reserved for scheduling and assignment where it matters most. ### DD-3: Metrics: Prometheus scrape of `/metrics` as the reference path **Alternatives considered:** 1. **HTTP `GET /metrics` in Prometheus text format** (pull) — the default for in-cluster Prometheus in scrape mode; works with the Prometheus Operator (`ServiceMonitor`), `kube-prometheus`, and self-hosted jobs. The optional Jumpstarter Telemetry service exposes this for aggregated counters it holds after receiving +1 / +N from exporters. 2. **Prometheus remote write** (or a Mimir / Cortex receiver) from a Jumpstarter component — useful in advanced topologies; not part of the reference implementation in this JEP; operators can add a federation or `remote_write` from Prometheus to long-term storage without the application pushing to Prometheus. 3. **Both** — **(1)** is required for the documented path; **(2)** is optional infrastructure behind Prometheus, not a second required app protocol. 4. **Reverse scrape via gRPC** — exporters maintain a local `prometheus_client.CollectorRegistry` and connect to the Telemetry service via a persistent bidirectional gRPC stream (`MetricsStream`). When Prometheus scrapes the Telemetry service's `/metrics` endpoint, Telemetry fans out scrape requests to all connected exporters, merges the `generate_latest()` responses, and serves the combined result. Controller and Router still expose `/metrics` directly for Prometheus scrape (no change). This avoids push-increment complexity on the wire and keeps full counter state on the exporter at all times. **Decision:** **(4)** — exporter-originated metrics are reverse-scraped through the Telemetry service via `MetricsStream`. **Rationale:** Exporters are often behind NAT or firewalls and cannot be directly scraped by Prometheus. The reverse-scrape model **(4)** solves this: the exporter initiates an outbound gRPC stream (NAT-friendly, same direction as the existing controller connection), the Telemetry service requests metric snapshots on demand, and full counter state remains on the exporter at all times — eliminating lost-increment concerns (see **DD-9**). The exporter uses standard `prometheus_client` primitives locally, so driver authors instrument with familiar counters and histograms. The OpenMetrics exposition format natively carries exemplars, enabling high-cardinality context (`client`, `lease_id`, and `trace_id` when present) on individual samples without additional infrastructure. See **DD-6** (no OTel), **DD-7** (Telemetry Deployment), **DD-8** (HA replicas). **Exemplar trade-offs and details:** - **Wire format.** On the OpenMetrics `/metrics` endpoint an exemplar is appended after the sample value: ```text jumpstarter_operations_total{exporter="lab-01",operation="flash",result="success"} 42 # {client="ci-bot",lease_id="abc123",build_id="nightly-42"} 1.0 1625000000.000 ``` The `# {key=value,...} value timestamp` suffix is the exemplar. Grafana (≥ 7.4) renders these as clickable dots on metric panels; clicking a dot reveals the attached keys and can link to a Loki log query (filtered by `lease_id`) or a trace view (filtered by `trace_id`). - **Size limit.** The [OpenMetrics 1.0 spec](https://prometheus.io/docs/specs/om/open_metrics_spec) imposes a **128 UTF-8 character** limit on the combined length of exemplar label names and values per exemplar. [OpenMetrics 2.0](https://github.com/prometheus/docs/blob/main/docs/specs/om/open_metrics_spec_2_0.md) (experimental, 2026) relaxes this to a soft cap measured in bytes. The exemplar key budget is discussed further in *Exemplars for high-cardinality context*. - **Sampling.** Client libraries rate-limit exemplar updates internally; the last-seen exemplar per series is served on each scrape, not one per data point. For the Jumpstarter use case this is sufficient: the most recent `lease_id` / `trace_id` on a counter is the value operators need when investigating a spike. - **Library support.** Go client support is mature (`prometheus/client_golang` ≥ 1.16). The Python `prometheus_client` library is used on the exporter side to maintain local registries and produce `generate_latest()` output for the reverse-scrape path (see *API / Protocol Changes*). Exemplar support in the Python library is functional but less complete than Go; if limitations arise, exemplar data can be sent as a sidecar field in `MetricsScrapeResponse` for the Telemetry service to merge server-side. - **Infrastructure requirements.** Prometheus ≥ 2.26 with `--enable-feature=exemplar-storage` and `--storage.tsdb.max-exemplars` (e.g. 100 000). Grafana ≥ 7.4 for exemplar visualization. Perses does not yet support exemplar rendering; until it does, operators who want exemplar click-through can use Grafana alongside Perses or wait for upstream support. These limitations are acceptable for the correlation use case this JEP targets. ### DD-4: Log format for services vs CLI **Alternatives considered:** 1. **JSON always** for every process — best for machines; hard for humans. 2. **Human text default for `jmp`**, **JSON for long-running services** and a CLI push via the Telemetry ingest endpoint in JSON format (in addition to the human-friendly output) 3. **Single format** with a pretty-printer in front of developers — more moving parts. **Decision:** **(2)**. Long-running services (`jumpstarter-controller`, `jumpstarter-router`, `jumpstarter-telemetry`, Exporter) emit structured JSON to stdout. The Controller and Router do not push logs directly to Loki; instead, a cluster-level log shipper (Promtail, Grafana Alloy, Vector, or equivalent DaemonSet) scrapes their pod logs and delivers them to Loki. Only `jumpstarter-telemetry` writes to Loki directly (push API) because the exporter/client data it aggregates does not originate as any pod's stdout. **Rationale:** Matches the requirement that *clients* stay human-readable, and at the same time all services get parseable, joinable log lines. Writing JSON to stdout and relying on the cluster log shipper for Loki delivery decouples the Controller reconciler and Router session handling from Loki availability — a Loki outage does not affect lease operations. The Telemetry service retains a direct Loki-push because it is an isolated workload (**DD-7**) whose core job is Loki ingest. **Format:** JSONL (one JSON object per line), produced by setting `--zap-encoder=json` on the existing `controller-runtime` / Zap logger (no changes to log call sites — existing `logr` structured fields become JSON keys automatically). The `ts`, `level`, and `msg` fields follow Zap's default JSON encoder output; application code adds domain fields via the standard `logr` `WithValues` / `Info` / `Error` API. Base fields present in every log line: | Field | Format | Loki label | Description | | ------------- | ------------------------------------------------------------------- | :--------: | ----------------------------------------- | | `ts` | ISO-8601 (`2026-04-28T10:15:30.123Z`) | no | Timestamp (Zap default). | | `level` | Lower-case string (`debug`, `info`, `warn`, `error`) | no | Log severity (Zap default; Go services map `warn`→`warning` when populating `LogEntry.severity`). | | `msg` | Free-form string | no | Human-readable message (Zap default). | | `component` | Fixed enum (`cli`, `controller`, `router`, `telemetry`, `exporter`) | **yes** | Emitting service. | | `exporter` | CRD name (when applicable) | **yes** | Exporter CRD name; bounded by cluster size.| | `lease_id` | UID string (when applicable) | no | Lease UID (high cardinality). | | `operation` | String (when applicable) | no | Operation name (flash, power, …). | | `result` | String (when applicable) | no | Outcome (success, failure, …). | | `driver_type` | Category from predefined set (when applicable) | no | Driver category (storage, power, …). | | `client` | CRD name (when applicable) | no | Client CRD name (high cardinality). | | *`spec.context` keys* | User-defined strings (during active lease) | no | All `lease.spec.context` entries (e.g. `build_id`, `image_digest`, VCS ref) added as JSON fields. High cardinality, never stream labels. | | *`exporterLabels` keys* | Values from Exporter CRD labels (when configured) | no | Operator-defined exporter labels (e.g. `board-type`); see `spec.telemetry.exporterLabels`. | `namespace` is emitted by the application from its own runtime context (the namespace in which the process is running). Log shippers (Promtail, Grafana Alloy, Vector) may also inject `pod` and `container` from Kubernetes pod metadata via service discovery. Fields marked as **Loki stream labels** are extracted by the log shipper and used as indexed stream selectors. They must be low-cardinality to keep the active stream count manageable (Grafana recommends < 100 k active streams per tenant). With the labels above, a deployment with 200 exporters across 5 namespaces produces roughly 1 000 streams — well within budget. High-cardinality fields like `client` or `lease_id` must stay in the JSON body: promoting `client` to a stream label in a 1 000-client, 200-exporter cluster would create up to 1 000 000 streams, overwhelming the Loki ingester. These fields are instead queried with `| json | client="value"` filter expressions after selecting the relevant streams. Multi-line content (e.g. stack traces) is embedded as an escaped string within the JSON value (typically in a `stacktrace` or `error` field), never as bare multi-line text, so each physical line is always one complete JSON object. ### DD-5: Where Loki and Prometheus (or remote-write) credentials live **Alternatives considered:** 1. **Each exporter and edge host** holds credentials (or a sidecar) to push directly to Loki and to Prometheus (or a metrics gateway) — maximum flexibility; maximum secret distribution and rotation burden on lab and remote sites. 2. **Jumpstarter Controller and/or Router** receive metrics and structured events from exporters and (optionally) from client traffic they already handle, and forward to the Loki push API and to Prometheus-compatible sinks (scrape registration) with in-cluster auth — one credential surface; enriched with lease, exporter, and client context in one place; must be non-blocking, bounded, and optional so the control path does not depend on Loki or Prometheus availability. 3. **Hybrid** — generic in-cluster collectors for raw pod logs and scrape; (2) for lease-scoped events and aggregated exporter metrics the platform understands. 4. **Dedicated Jumpstarter Telemetry Deployment** (see **DD-7**) instead of folding everything into the Controller — only Telemetry holds Loki-push credentials; isolated failure domain and scaling for reverse-scrape and log ingest. Router and Controller write structured JSON to stdout (see **DD-4**) and expose `/metrics` for Prometheus scrape; a cluster log shipper delivers their pod logs to Loki without Jumpstarter-specific Loki credentials. **Decision:** (4) **Rationale:** The goal is to avoid propagating Loki- and cluster-ingest authentication to every exporter process while still attaching Jumpstarter-specific context. Among Jumpstarter components, only `jumpstarter-telemetry` holds Loki-push credentials — the Controller and Router have no Loki client dependency (see **DD-4**); their pod logs reach Loki via the cluster's existing log shipping infrastructure. Generic in-cluster collectors solve *credentials* but not *semantic* correlation unless integrated; alternative (2)'s trust-model advantage — which (4) inherits — reuses the existing exporter→controller relationship and can inject labels and tenant context in one place. A separate Deployment (**4** / **DD-7**) is preferable to overloading the main reconciler when load or residency of counters matters. ### DD-6: OpenTelemetry (OTLP / Collector) as a *mandated* layer **Alternatives considered:** 1. **Adopt OpenTelemetry** — instrument Controller, Router, Exporter, and clients with the OTel SDK, export OTLP to a cluster-local OpenTelemetry Collector, and let the Collector fan out to Loki, Prometheus (remote write), and Tempo. 2. **Integrate directly** with each backend: Loki HTTP `POST /loki/api/v1/push` or gRPC; Prometheus text on `/metrics`; structured JSON (or logfmt) logs to stdout for shippers; optional W3C `traceparent` in gRPC metadata for correlation *without* shipping full distributed traces in the first iteration. If traces are ever needed, use Tempo ingest where practical, *or* a thin sender — still without a project-wide requirement on the OTel SDK in every binary. 3. **Hybrid (OTel in one language, direct in another)** — lowest common implementation cost but inconsistent contributor experience and two operational models. **Decision:** **(2).** This JEP does not make OpenTelemetry (SDK or Collector) part of the required reference architecture. Vendors and operators who already run an OpenTelemetry Collector may scrape the same `/metrics`, receive logs shipped by existing agents, or receive the Loki body the hub would have sent — compatibility is welcome; dependency is not mandatory. **Rationale:** The proposed Jumpstarter Telemetry service (**DD-7**) admittedly reimplements a subset of OTel Collector functionality — metric aggregation, log forwarding, backpressure, and multi-replica HA. The decision to build a purpose-built component rather than adopt the OTel Collector rests on three arguments, ordered by importance: 1. **Identity enforcement (primary)** — The Telemetry service operates inside Jumpstarter's existing authentication and trust domain (mTLS, registered client and exporter identities). It validates that every incoming `MetricsStream` or `PushLogs` call originates from the claimed exporter — preventing impersonation or label injection — using identities the platform already manages. A generic OTel Collector has no awareness of Jumpstarter identities; achieving the same guarantee would require an external auth policy layer (e.g. custom processors, mTLS-to-attribute mapping, and a sidecar or admission webhook to enforce label provenance), adding complexity that offsets the Collector's generality. 2. **Operational simplicity** — The Telemetry service is a single Go binary with a single config surface (the operator CR), no separate version matrix, and no generic pipeline DSL. An OTel Collector requires operator familiarity with its configuration model (receivers, processors, exporters, and connectors), dual OTel SDK stacks (Go + Python) add version drift and test matrix, and the Collector itself is another versioned service to upgrade and monitor. This overhead is not justified when the data paths are known in advance. 3. **Narrow scope** — Jumpstarter metrics and lease events map directly to Prometheus and Loki wire protocols that operators already use. Full three-pillar OTel (unified logs and metrics via OTLP) is *optional product territory*; this JEP optimizes for low ceremony and direct integration with exactly those two backends. **Future extension:** because the Telemetry service already aggregates metrics snapshots and structured log entries in well-defined formats, adding an OTLP push output (logs and metrics) alongside the existing Loki and `/metrics` paths would be a trivial change. This would let operators route Jumpstarter data into an OTel Collector or any OTLP-compatible backend without altering the exporter or client side. The change is additive and does not require adopting the OTel SDK as a project dependency. ### DD-7: Optional Jumpstarter Telemetry service (dedicated Deployment vs. Controller/Router only) **Alternatives considered:** 1. **In-process** in the Controller (and Router) reconciler — few moving parts; risk of CPU / GC pressure and stronger coupling between leases and high-volume increments or Loki writes. 2. A **dedicated** in-cluster Service and Deployment (working name `jumpstarter-telemetry`, TBD) that: receives gRPC/HTTP increments from exporters and clients, applies them to counters in memory, POSTs to Loki, exposes `/metrics`, and uses the same K8s ServiceAccount / mTLS as other control-plane binaries. 3. **Split** into separate sidecars (Loki-only, metrics-only) — more images to build and version. 4. **Dedicated Deployment with reverse-scrape for metrics and push for logs** — same dedicated `jumpstarter-telemetry` Deployment as **(2)**, but instead of receiving increment RPCs the service reverse-scrapes connected exporters via `MetricsStream` (see *API / Protocol Changes*). Exporters maintain local `prometheus_client` registries; the Telemetry service requests `generate_latest()` snapshots on demand when its `/metrics` endpoint is hit, merges the results, and serves them to Prometheus. Logs and events are still pushed by exporters and clients via `PushLogs`. Client-side metrics are not collected — all metrically-interesting operations are observable from the exporter side. **Decision:** Prefer **(4)** for the optional aggregated-metrics + Loki path at scale; allow **(1)** in small or dev clusters; **(3)** only if review shows a need. In deployments without Loki, the Telemetry service's own pod logs (structured JSON to stdout) still provide a centralized, queryable event source via the cluster log shipper. **Rationale:** A dedicated workload can scale and restart independently; Loki spikes and ingest load cannot starve lease reconciliation in the controller. The reverse-scrape model **(4)** is preferred over the increment-push model **(2)** because full counter state stays on the exporter — no metrics are lost when the Telemetry service restarts or is temporarily unavailable, and idempotency concerns are eliminated (see **DD-9**). The service has exactly two core responsibilities: **(a)** reverse-scraping exporter metrics and aggregating them for Prometheus, and **(b)** ingesting structured logs from exporters and clients with backpressure management and forwarding them to Loki. Everything else (HA, identity enforcement, configuration) is supporting detail, not an independent responsibility. **Identity enforcement:** The Telemetry service validates the source identity of every `MetricsStream` connection and `PushLogs` RPC from the mTLS certificate or ServiceAccount token. The `exporter` and `client` labels on incoming data are enforced server-side to match the authenticated identity — a compromised or misconfigured exporter cannot submit metrics under another exporter's name or inject arbitrary labels. **Failure modes:** | Scenario | Behavior | | ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Telemetry service unavailable | Exporters keep counting locally; no metrics are lost. When the exporter reconnects, the next scrape returns the full current counter state. Log push RPCs are fire-and-forget with bounded retry; log entries may be lost but device operations are unaffected. | | Telemetry pod restart | Metric state is rebuilt on the next scrape from each connected exporter — no permanent data loss. Prometheus `rate()` and `increase()` handle the apparent counter reset transparently. | | Loki unreachable | The Telemetry service buffers log entries in a bounded queue (see *Backpressure* in the control-plane section). On overflow, entries are dropped and `jumpstarter_telemetry_dropped_total` incremented. | | Prometheus scrape fails | No data loss — the next successful scrape triggers a fresh fan-out to connected exporters and returns current values. | The Telemetry service exposes `/healthz` (liveness) and `/readyz` (readiness, gated on Loki reachability and at least one connected exporter) endpoints for Kubernetes probes. **Scrape fan-out:** When Prometheus hits `/metrics`, the Telemetry service fans out `MetricsScrapeRequest` to **all connected exporters in parallel** and waits up to `spec.telemetry.metrics.scrapeTimeout` (default: 7 s) for responses. **Only metrics received during the current fan-out are included in the response.** Exporters that do not respond in time are omitted entirely — no cached or stale data is ever served. This eliminates any risk of double-counting from stale connections where the exporter may have already migrated to another replica (see **DD-8**). **Memory budget:** During a scrape fan-out the Telemetry service temporarily holds metric snapshots from responding exporters until the merged response is written to Prometheus. With 200 exporters each producing ~50 series (bounded by `{operation, result, driver_type}` label combinations), the peak is ~10 000 series at ~200–300 bytes each, costing ~2–3 MB. Snapshots are discarded as soon as the `/metrics` response is flushed — no metric data is retained between scrapes. ### DD-8: Multiple Telemetry replicas (HA) and persistent exporter connections **Context:** With the reverse-scrape model (see **DD-3** alternative 4 and *API / Protocol Changes*), the Telemetry service does not hold authoritative counter state — exporters maintain their own local `prometheus_client` registries. The Telemetry service only caches the latest metric snapshot per exporter. Each exporter opens a single long-lived `MetricsStream` to one Telemetry replica. **Alternatives considered:** 1. **Single replica** for Telemetry — no cross-pod `sum` issue; SPOF for ingest and scrape of that `Service`. 2. **Multiple replicas** behind a load balancer; each RPC updates one pod, which only advances its partial counters for the label sets it has seen. Prometheus scrapes all pods (or separate `PodMonitor` targets). In PromQL, `sum by (exporter, operation, result, driver_type) (…)` after dropping `pod` / `instance` matches the global total, as long as each real event is applied at most once in the system (counters are additive; increments are partitioned by traffic). 3. **Strong consistency** (Raft, Redis as source of truth for counters) — higher operating cost than this JEP’s v1 scope. 4. **Multiple replicas with persistent exporter connections** — each exporter opens a single long-lived `MetricsStream` to one replica (persistent by stream). Each replica only caches metric snapshots for its connected exporters. Prometheus scrapes all replicas (via `PodMonitor`); `sum by (exporter, operation, result, driver_type) (…)` after dropping `pod` / `instance` yields the exact global total with no double-counting, because each exporter’s metrics appear on exactly one replica’s `/metrics` output. On replica failure the exporter reconnects to a survivor and the next scrape returns its full current counter state — no data is lost. **Decision:** **(4)** **Rationale:** Persistent exporter connections naturally partition metric snapshots across replicas with no overlap, so `sum` across replicas is exact and double-counting is impossible. Full counter state lives on the exporter, not on the Telemetry service, so replica restarts or failovers cause no data loss. Loki log pushes (`PushLogs`) are naturally per-replica as well and do not require deduplication. Alternative (3) adds operational complexity with no benefit given the reverse-scrape model. ### DD-9: Idempotency vs. best-effort **Context:** With the reverse-scrape model, metrics idempotency is a non-issue — each scrape returns the full current counter state from the exporter, so there are no increments to deduplicate or double-count. The only remaining idempotency concern is for `PushLogs` RPCs, where a retry could result in duplicate log entries in Loki. **Alternatives considered:** 1. **Idempotent** log pushes (deduplication keys per `LogEntry`) — appropriate for billing- or SLO-sensitive log pipelines; requires a dedup store or Loki-side dedup. 2. **Best effort** (at-least-once) for `PushLogs` without global deduplication — simpler; rare duplicate log entries on retries. 3. **Metrics idempotency** (dedup keys on metric increments) — no longer applicable; the reverse-scrape model returns full state, making increment deduplication moot. **Decision:** (2) for `PushLogs`; metrics idempotency is not needed. **Rationale:** Duplicate log entries from occasional retries are acceptable for informative/diagnostic logs. Loki queries are tolerant of rare duplicates. No global dedup store is needed in v1; operators treat these logs as diagnostic signals, not audit trails. ### DD-10: Perses over Grafana for dashboarding **Alternatives considered:** 1. **Grafana** — mature, widely deployed, massive plugin and datasource ecosystem; governed by Grafana Labs (commercial); AGPL v3 license; custom JSON dashboard format; external to Kubernetes architecture. 2. **Perses** — CNCF project (vendor-neutral governance); Apache 2.0 license; standardized dashboard spec (CUE/JSON) with built-in static validation and SDKs for GitOps; Kubernetes-native (CRD support for dashboards-as-code); data-source focus on Prometheus, Loki, and Tempo — exactly the backends this JEP targets. **Decision:** **(2)** **Rationale:** - **License alignment** — Jumpstarter is Apache 2.0; recommending an AGPL-licensed dashboard layer introduces license friction for downstream distributors and embedders. - **CNCF governance** — vendor-neutral stewardship matches the project's open-source posture; no single-vendor control over the dashboard layer. - **Kubernetes-native CRDs** — dashboards can be managed as K8s resources, fitting the same declarative, reconciler-driven model Jumpstarter already uses for Leases, Exporters, and the optional Telemetry Deployment. - **GitOps and validation** — CUE-based specs with static validation and SDKs enable dashboard-as-code in CI pipelines, consistent with the JEP's emphasis on automation and CI integration. - **Backend focus** — Perses targets Prometheus, Loki, and Tempo — exactly the three backends this JEP standardizes on — without carrying the cost of a broad plugin ecosystem the project does not need. **Perses vs Grafana — practical comparison:** | Aspect | Perses | Grafana | | -------------------- | --------------------------------------- | ------------------------------------------ | | License | Apache 2.0 | AGPL v3 | | Governance | CNCF (vendor-neutral) | Grafana Labs (commercial) | | Dashboard-as-code | CUE/JSON spec, static validation, SDKs | JSON export, no built-in validation | | K8s-native CRDs | Yes | Via third-party operator (grafana-operator)| | Exemplar rendering | Not yet (upstream roadmap) | Yes (>= 7.4) | | Data-source scope | Prometheus, Loki, Tempo | Broad plugin ecosystem | | Maturity / ecosystem | Early (CNCF sandbox/incubating) | Mature, widely deployed | The main Perses gap today is exemplar visualization. Operators who need exemplar overlays on dashboards should use Grafana alongside Perses or wait for upstream support. Grafana remains fully compatible — all `/metrics` and Loki endpoints are standard — so the choice is non-exclusive. Operators who prefer Grafana can still point it at the same `/metrics` and Loki endpoints; this DD only governs the *recommended* dashboard experience. ## Design Details ### Correlation and fields *Subject to review — names and cardinality rules should be fixed before "Implemented".* | Field / label | Prom label | Prom exemplar | Loki stream | Log line | Notes | | -------------------------------- | :--------: | :-----------: | :---------: | :------: | --------------------------------------------------- | | `exporter` | yes | — | yes | yes | CRD name; bounded by cluster size. | | `operation` | yes | — | no | yes | Small fixed enum (flash, power, …). | | `result` | yes | — | no | yes | Small fixed enum (success, failure, …). | | `driver_type` | yes | — | no | yes | Category from a predefined set in core (storage, power, …). | | `error_type` | yes | — | no | yes | Failure class (timeout, device_error, …); on errors. | | `direction` | yes | — | no | yes | tx / rx; for byte-counter and stream metrics only. | | `component` | no | — | yes | yes | Fixed set (cli, controller, router, telemetry, exporter).| | `namespace` | no | — | yes | yes | K8s namespace; bounded. | | `lease_id` | **no** | yes | **no** | yes | Unbounded; exemplar for drill-down. | | `client` | **no** | yes | **no** | yes | CRD name; exemplar for client identity. | | `image_digest`, `build_id`, etc. | **no** | yes | **no** | yes | From `spec.context`; included when listed in `exemplarKeys`. | | `trace_id` / `span_id` | **no** | yes | **no** | yes | W3C; links metrics to traces via exemplars. | | *`exporterLabels` keys* | **no** | yes | **no** | yes | From Exporter CRD labels; included when listed in `exemplarKeys`. | Additional `lease.spec.context` correlation fields can be added at runtime; they appear as structured log line fields and, when listed in the operator's `exemplarKeys` allowlist, as Prometheus exemplar keys (see *Exemplars for high-cardinality context* below and *Operator configuration*). ### Cardinality guidelines Unbounded identifiers (`lease_id`, `client`, `image_digest`, `trace_id`, and any operator-defined `spec.context` keys) must not be used as Prometheus metric labels or Loki stream labels. They belong inside structured log line JSON and Prometheus exemplars (see below), where Loki filter expressions (`| json | lease_id = "…"`) and dashboard exemplar overlays can surface them without inflating the label index or TSDB series count. Rules of thumb for this JEP: - **Prometheus labels**: each metric label dimension should have < 100 distinct values per scrape target. The label set for Jumpstarter metrics is `{exporter, operation, result, driver_type}` — all bounded enums. `error_type` is added on failure-path metrics and `direction` on byte-counter metrics. High-cardinality context is carried via exemplars, not labels. - **Loki**: stream labels should be a small fixed set (`{component, exporter, namespace}`) to keep active stream count per tenant manageable (Grafana's guidance: < 100 k active streams). High-cardinality fields go inside the log line body. - **Lease context fields** from `spec.context` are propagated into log line JSON and, when listed in `exemplarKeys`, into Prometheus exemplars. They never become Prometheus labels or Loki stream labels. #### Exemplars for high-cardinality context Prometheus exemplars attach arbitrary key-value pairs to individual counter increments and histogram observations without creating new time series. This is the primary mechanism this JEP uses to surface per-request context (`client`, `lease_id`, and `trace_id` when present) on metrics while keeping series cardinality flat. Default exemplar keys emitted on every counter/histogram observation: | Key | Source | Purpose | | ---------- | --------------------- | ----------------------------------------------- | | `client` | Client CRD name | "Which client caused this spike?" | | `lease_id` | Lease UID | Correlate a metric sample with lease logs. | | `trace_id` | W3C `traceparent` | Included **only when present** in gRPC metadata.| `trace_id` is not synthesized by Jumpstarter — it is included only when an external caller (CI pipeline, user code) propagates a `traceparent`. Full distributed tracing (spans, storage, visualization) is deferred to a future JEP; when it lands, `trace_id` becomes a default key. Until then, omitting it saves ~45 characters of exemplar budget. `spec.context` keys (e.g. `build_id`, `image_digest`) are included as exemplar keys when listed in the operator's `exemplarKeys` allowlist (see *Operator configuration*). Because exemplars are per-observation metadata — not label dimensions — they have zero impact on series cardinality regardless of how many distinct values appear. **Exemplar size budget:** The OpenMetrics 1.0 limit is 128 UTF-8 characters for the combined key-value pairs in a single exemplar. The two default keys (`client`, `lease_id`) consume roughly 30–50 characters, leaving ~80–100 characters for `spec.context` entries (or more when `trace_id` is absent). To stay within budget: 1. Keys are added in the order specified by the operator's `exemplarKeys` allowlist (see *Operator CRD fields* below). The default list starts with `client`, `lease_id`; `trace_id` is added when present in the request context. 2. Remaining `spec.context` entries are appended in allowlist order until the 128-char limit is reached; keys that do not fit are silently dropped from the exemplar (they remain available in structured log lines). This gives the operator full control over which keys are prioritized when space is tight. 3. The `Lease` CRD validates `spec.context` at admission time: key names are limited to 32 characters, values to 64 characters, and the total number of entries to 8. This prevents accidental budget exhaustion and ensures exemplar truncation is rare in practice. **Dashboard visualization**: when exemplars are enabled on a Prometheus data source, metric panels render clickable dots on each sample that carries exemplar data. Clicking a dot reveals the attached keys and can link to Loki log queries (filtered by `lease_id`) or a Tempo trace view (filtered by `trace_id`). Per-client analysis remains available via LogQL for operators who do not use exemplars: `sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m]))`. ### Proposed metrics *Names are illustrative; final naming should follow [Prometheus naming conventions](https://prometheus.io/docs/practices/naming/) and be fixed before "Implemented".* | Metric name | Type | Labels | Description | | -------------------------------------------- | --------- | -------------------------------------------- | ----------------------------------------- | | `jumpstarter_operations_total` | counter | `exporter`, `operation`, `result`, `driver_type` | Total operations performed. | | `jumpstarter_operation_duration_seconds` | histogram | `exporter`, `operation`, `result`, `driver_type` | Duration of each operation. | | `jumpstarter_operation_errors_total` | counter | `exporter`, `operation`, `driver_type`, `error_type` | Errors by class (timeout, device, …). | | `jumpstarter_stream_bytes_total` | counter | `exporter`, `driver_type`, `direction` | Bytes transferred (tx/rx) on streams. | | `jumpstarter_active_sessions` | gauge | `exporter` | Currently active lease sessions. | | `jumpstarter_lease_acquisitions_total` | counter | `result` | Lease acquire attempts (controller). | | `jumpstarter_telemetry_dropped_total` | counter | `destination` | Log entries dropped due to backpressure (e.g. `destination="loki"`). | | `jumpstarter_scrape_timeouts_total` | counter | `exporter` | Scrape fan-out timeouts per exporter (Telemetry-side). Each timeout also emits a `severity="warning"` log entry identifying the timed-out exporter. | All counters and histograms carry exemplar keys from the operator's `exemplarKeys` allowlist (by default `client` and `lease_id`; `trace_id` when present; `spec.context` and `exporterLabels` entries when listed) on every observation. ### Metric usage and alerting | Metric | Primary use | Alert? | Starter threshold | | -------------------------------------------- | ----------- | :----: | ---------------------------------------------- | | `jumpstarter_operations_total` | Dashboard | yes | Failure rate > 20 % over 15 min per exporter. | | `jumpstarter_operation_duration_seconds` | Dashboard | yes | p95 > 60 s per operation type. | | `jumpstarter_operation_errors_total` | Dashboard | yes | Error rate rising; group by `error_type`. | | `jumpstarter_stream_bytes_total` | Dashboard | no | — | | `jumpstarter_active_sessions` | Dashboard | yes | 0 sessions for > 30 min (possible exporter issue). | | `jumpstarter_lease_acquisitions_total` | Dashboard | yes | Failure rate > 10 % over 15 min. | | `jumpstarter_telemetry_dropped_total` | Alerting | yes | Any increment (telemetry pipeline saturated). | | `jumpstarter_scrape_timeouts_total` | Alerting | yes | Repeated timeouts for same exporter (connectivity or load issue). | Thresholds are suggestions; operators should tune them to their environment. The operator should ship a set of example `PrometheusRule` CRDs based on the table above that operators can enable and customize. These rules are opt-in and disabled by default to avoid noise in environments with different baselines. **High-frequency byte counters:** `jumpstarter_stream_bytes_total` can be incremented at very high rates on serial and video streams. Because metrics live in the exporter's local `prometheus_client` registry, high update rates do not generate any RPC traffic — the counter is updated in-process and only serialized when the Telemetry service sends a `MetricsScrapeRequest`. ### Example queries #### PromQL (Prometheus) **Flash failure rate per exporter:** ```promql sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[5m])) / sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m])) ``` **p95 flash duration per driver type:** ```promql histogram_quantile(0.95, sum by (driver_type, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m])) ) ``` **Top 5 busiest exporters (all operations, 1 h window):** ```promql topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h]))) ``` **Alert: exporter flash failure rate > 20% over 15 min:** ```promql ( sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[15m])) / sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[15m])) ) > 0.2 ``` **Error breakdown by class for a specific driver:** ```promql sum by (error_type) (rate(jumpstarter_operation_errors_total{driver_type="storage"}[1h])) ``` **Bytes per second by exporter and direction:** ```promql sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m])) ``` **Exporters with repeated scrape timeouts (last 30 min):** ```promql topk(10, sum by (exporter) (increase(jumpstarter_scrape_timeouts_total[30m]))) ``` **HA Telemetry: aggregate across replicas (drop pod/instance):** ```promql sum by (exporter, operation, result, driver_type) (rate(jumpstarter_operations_total[5m])) ``` #### LogQL (Loki) **All flash events for a specific lease:** ```text {component="exporter"} | json | operation="flash" | lease_id="" ``` **Flash failures per client over 5 min (log-based, no exemplars needed):** ```text sum by (client) ( count_over_time({component="exporter"} | json | operation="flash" | result="failure" [5m]) ) ``` **Controller logs for a specific lease (post-mortem):** ```text {component="controller"} | json | lease_id="" ``` **Error events across all exporters in a namespace:** ```text {component="exporter", namespace="production"} | json | result="failure" ``` **Telemetry service health (its own operational logs):** ```text {component="telemetry"} | json | level="error" ``` ### Control-plane aggregation (Controller / Router / optional Telemetry) When this mode is enabled in a deployment: - Exporters maintain local `prometheus_client` registries and open a `MetricsStream` to the optional `jumpstarter-telemetry` service (**DD-7**). On each Prometheus scrape the Telemetry service fans out `MetricsScrapeRequest` to all connected exporters in parallel, merges the responses, and serves the combined output on `/metrics` (**DD-3**). HA (multiple replicas with persistent exporter connections) uses `sum` in PromQL (**DD-8**). Exporter and edge processes never need Loki or cluster-scrape credentials directly (**DD-5**). - Exporters and clients (`jmp`) push structured log entries to the Telemetry service via `PushLogs`. The Telemetry service forwards these to Loki. Best-effort duplicate tolerance applies (**DD-9**). - Controller and Router emit structured JSON logs to stdout (see **DD-4**). They do not push logs directly to Loki; a cluster-level log shipper (Promtail, Grafana Alloy, Vector, or equivalent) scrapes their pod logs and delivers them to Loki. This decouples the reconciler and session-handling hot paths from Loki availability. - **Backpressure:** The Telemetry service uses a bounded ring buffer for the Loki log push path with a configurable depth (default: 10 000 entries, see `spec.telemetry.backpressure.queueDepth`). On overflow, dropped entries are replaced by a single **drop marker** — a standard `LogEntry` with `severity="warning"`, `component="telemetry"`, `operation="backpressure"`, and the drop count and time window placed in `extra_fields` (`{"count":"142","window_seconds":"12"}`). Subsequent drops while the buffer is still full accumulate into the same marker rather than adding new entries, so the queue always retains one slot for the current drop summary. Because the marker is a regular `LogEntry`, consumers do not need special-case parsing to detect or exclude it. A `jumpstarter_telemetry_dropped_total` counter (partitioned by `destination={loki}`) is also incremented on `/metrics` for alerting. Metrics do not need backpressure — the reverse-scrape model is pull-based and transient (no buffering between scrapes). Because the Controller and Router do not push to Loki, their lease/session operations are inherently isolated from Loki slowdowns. - **Multi-tenancy:** write-side tenant scoping (e.g. namespace-based separation in Loki and Prometheus) is a deployment concern handled by the log shipper and Prometheus configuration. Read-side access control (who can query which metrics or logs) is likewise a deployment concern and out of scope for this JEP. - Metric facts originate on the exporter (local `prometheus_client` counters/histograms); the Telemetry service is a transparent scrape-aggregation proxy. Controller and Router expose their own `/metrics` for Prometheus scrape and rely on the log shipper for their stdout logs. ### High-level data flow #### Client (`jmp`) ```{mermaid} flowchart LR jmp([jmp CLI]) -->|session gRPC| exp[Exporter] jmp -->|structured logs| tel[jumpstarter-telemetry] ``` The CLI connects to the Exporter for device sessions and sends structured logs to the Telemetry service for Loki ingest (see **DD-4**). #### Exporter ```{mermaid} flowchart LR ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter] drv[Drivers] --> exp exp <-->|MetricsStream| tel[jumpstarter-telemetry] exp -->|PushLogs| tel ``` The Controller assigns leases; the Exporter delegates to Drivers and maintains local `prometheus_client` counters. It opens a `MetricsStream` to Telemetry for reverse-scrape and pushes structured logs via `PushLogs` (see **DD-2**, **DD-3**, **DD-5**, **DD-7**). #### Telemetry to backends ```{mermaid} flowchart LR prom[(Prometheus)] -->|scrape /metrics| tel[jumpstarter-telemetry] tel <-->|MetricsStream fan-out| exp[Exporters] tel -->|push API| loki[(Loki)] tel -->|JSON stdout| shipper[Log shipper] shipper -->|pod logs| loki ``` On each Prometheus scrape, Telemetry fans out `MetricsScrapeRequest` to all connected exporters in parallel, merges responses, and serves the combined output. Logs received via `PushLogs` are forwarded to Loki (**DD-3**, **DD-7**, **DD-8**). #### Controller to backends ```{mermaid} flowchart LR ctrl[jumpstarter-controller] -->|JSON stdout| shipper[Log shipper] shipper -->|pod logs| loki[(Loki)] ctrl -->|/metrics| prom[(Prometheus)] ``` The Controller writes structured JSON to stdout (see **DD-4**). A cluster log shipper scrapes pod logs and delivers them to Loki. The Controller exposes `/metrics` for reconciliation and lease-level counters. #### Router to backends ```{mermaid} flowchart LR router[jumpstarter-router] -->|JSON stdout| shipper[Log shipper] shipper -->|pod logs| loki[(Loki)] router -->|/metrics| prom[(Prometheus)] ``` The Router writes structured JSON to stdout (see **DD-4**). A cluster log shipper scrapes pod logs and delivers them to Loki. The Router exposes `/metrics` for routing and session-level counters. The diagrams above summarize the reverse-scrape hub model described in *Control-plane aggregation*. For credential isolation see **DD-5**; for the Telemetry Deployment see **DD-7**; for HA with persistent exporter connections see **DD-8**; for best-effort log semantics see **DD-9**. No OpenTelemetry Collector is *required* (see **DD-6**); operators may run one *alongside* and scrape the same targets if they choose. ### Common open-source backends (direct integration; no mandatory OTel) This JEP’s target wire protocols and components are Prometheus and Loki (and, if trace export is ever added, Tempo or Jaeger with native ingest or HTTP — not OTLP as a *Jumpstarter* requirement; see **DD-6**). OpenTelemetry is a parallel ecosystem: teams can run a Collector next to Jumpstarter and still scrape `/metrics` and ship logs with Promtail-class agents; the reference design does not depend on the OTel SDK in application code. - Prometheus for metrics (and Alertmanager for routing alerts): scrape the `/metrics` endpoint, remote-write to long-term store if needed, and drive dashboards in Perses or self-hosted UIs (see **DD-10**). `kube-state-metrics` and the Prometheus Operator are common in Kubernetes; vendors often package the same projects, but this JEP refers to the open-source components by name. - Loki (Grafana Labs, AGPL) for log storage and querying; it pairs with Perses (see **DD-10**) for search and with Promtail, Grafana Agent, or Grafana Alloy to ship logs, or with application push to Loki’s HTTP API as already discussed in the control-plane path. - Traces (optional, future work) — if adopted, Grafana Tempo and Jaeger are typical stores; use W3C Trace Context in RPC metadata for correlation even when full trace export is off. OTLP may be *only* a convenience for operators; it is not a JEP-0011 core dependency. - A typical Kubernetes integration path: `ServiceMonitor` + Prometheus (or a compatible remote-write consumer), a Loki endpoint for logs — any EKS, GKE, AKS, self-managed Kubernetes, or bare-metal install that runs these same projects can be the target; the implementation plan should name tested combinations (Prometheus and Loki version pairs where relevant) in `Implementation History`, not a single product bundle. ### Operator configuration The Jumpstarter operator CR controls telemetry behavior cluster-wide. Observability settings live under `spec.telemetry` so that administrators can tune metrics, logging, and exemplar behavior without editing code. **Key configurable fields:** | Field | Type | Default | Description | | ----------------------------------------- | ---------- | ------------------------------------------------ | ---------------------------------------------------------------------------------------------- | | `spec.telemetry.enabled` | `bool` | `false` | Deploy the optional Telemetry service. | | `spec.telemetry.loki.url` | `string` | — | Loki push endpoint; optional — Telemetry can run metrics-only without Loki. | | `spec.telemetry.loki.secretRef` | `string` | — | Secret with Loki credentials (see **DD-5**). | | `spec.telemetry.loki.tls.caSecretRef` | `string` | — | Secret containing a CA bundle (`ca.crt` key) to trust for the Loki endpoint. | | `spec.telemetry.loki.tls.insecureSkipVerify` | `bool` | `false` | Disable TLS certificate verification (development/testing only). | | `spec.telemetry.exporterLabels` | `[]string` | `[]` | Exporter-level label keys (e.g. `board-type`) copied from Exporter CRD labels into log JSON fields and exemplar candidates. | | `spec.telemetry.metrics.exemplarKeys` | `[]string` | `["client", "lease_id"]` | Allowlist of keys to include in exemplars (including `spec.context` and `exporterLabels` keys). Only listed keys are emitted; unlisted keys are omitted even if present. | | `spec.telemetry.metrics.driverTypeEnum` | `[]string` | `["power", "storage", "network", "serial", …]` | Allowed `driver_type` label values. Drivers reporting an unlisted type are mapped to `other`. | | `spec.telemetry.metrics.serviceMonitor` | `bool` | `true` | Create `ServiceMonitor` CRDs for Prometheus autodiscovery. | | `spec.telemetry.metrics.prometheusRules` | `bool` | `false` | Deploy starter `PrometheusRule` CRDs (opt-in). | | `spec.telemetry.metrics.scrapeTimeout` | `duration` | `7s` | Max time to wait for parallel exporter responses during a `/metrics` fan-out. Should be set lower than the Prometheus-side `scrape_timeout` to leave headroom for HTTP transport. | | `spec.telemetry.backpressure.queueDepth` | `int` | `10000` | Ring buffer depth for Loki log push queue. | **Example CR snippet:** ```yaml apiVersion: operator.jumpstarter.dev/v1alpha1 kind: Jumpstarter metadata: name: jumpstarter spec: telemetry: enabled: true exporterLabels: - board-type logging: filter: min_severity: warning loki: url: "https://loki-gateway.monitoring.svc:3100/loki/api/v1/push" secretRef: "loki-credentials" tls: caSecretRef: "loki-ca-bundle" metrics: exemplarKeys: - client - lease_id - build_id - board-type driverTypeEnum: - power - storage - network - serial - console - video - composite serviceMonitor: true prometheusRules: true scrapeTimeout: "7s" backpressure: queueDepth: 20000 ``` The `driverTypeEnum` list acts as an allowlist: drivers must select a category from this set (or fall back to `other`). This keeps the `driver_type` Prometheus label bounded and prevents cardinality surprises from third-party drivers. Administrators can extend the list for site-specific driver categories. The `exporterLabels` list names Exporter CRD label keys whose values are copied into every log JSON field and made available as exemplar candidates for operations involving that exporter. For example, setting `exporterLabels: ["board-type"]` means an Exporter with the label `board-type: rpi4` will include `"board-type": "rpi4"` in its structured log lines and in the exemplar candidate pool. The list is empty by default — no exporter labels are propagated unless the administrator opts in. The `exemplarKeys` list is an **allowlist** that controls which keys are included in Prometheus exemplars. This filters *everything* — built-in keys (`client`, `lease_id`), `spec.context` keys, and `exporterLabels` keys alike. Only keys present in `exemplarKeys` are emitted; unlisted keys are omitted even if available. This gives administrators full control over exemplar budget usage: adding `board-type` to both `exporterLabels` and `exemplarKeys` propagates hardware type into exemplars, while removing `lease_id` frees budget for other entries. **Loki transport:** During implementation, evaluate whether the Telemetry service should connect to Loki via the HTTP push API (`/loki/api/v1/push`) or the gRPC endpoint. gRPC may offer better throughput and streaming semantics (aligned with Jumpstarter's existing gRPC infrastructure), while the HTTP API is simpler to debug and more broadly supported by Loki-compatible backends. The `spec.telemetry.loki.url` field should accept either scheme (`http://` / `grpc://`) so the choice remains a deployment decision. **Loki TLS:** Many deployments terminate Loki behind a TLS endpoint with an internal or self-signed CA. The `spec.telemetry.loki.tls` subsection follows the same pattern as the existing operator TLS configuration: `caSecretRef` names a Kubernetes Secret whose `ca.crt` key contains the PEM-encoded CA bundle to trust. When set, the Telemetry service adds this CA to its TLS root pool when connecting to Loki. `insecureSkipVerify` disables certificate verification entirely and should only be used in development or testing environments. ## Test Plan ### Unit Tests - Log field builders and redaction: ensure defaults strip secrets; optional fields behind flags. - Metric registration helpers: label validation and naming conventions. ### Integration Tests - Operator + exporter: scrape or receive metrics; assert presence of a minimal documented set of series after a known operation. - If the control-plane forward path is implemented: with a test Loki and a Prometheus-compatible sink (or mock), assert that records arrive with expected correlation fields (`lease_id`, `exporter`, …) and that exporter pods do not require Loki or cluster-scrape credentials in their spec. - If Telemetry runs with >1 replica: one test verifies that `sum` by business labels (dropping `pod`/`instance`) matches expected totals with persistent exporter connections (see **DD-8**). - Lease with metadata: objects validate; events or status updates match expected structure. ### Hardware-in-the-Loop - Flashing and power paths: at least one driver records an event and/or metrics counter on success and failure on real hardware in a lab. - Serial and stream paths expose tx/rx byte counts. ### Independent testability Each component must be testable in isolation without deploying the full stack: - **Structured logging**: unit tests validate JSON output format, base fields, and `spec.context` propagation using an in-memory logger — no Loki required. - **Exporter metrics**: unit tests verify counter/histogram registration, label correctness, and exemplar attachment using a local Prometheus registry — no Telemetry service required. - **Telemetry service**: integration tests use mock gRPC clients and a mock Loki endpoint to verify ingest, counter aggregation, backpressure behavior, and drop markers — no real exporters required. - **Operator configuration**: unit tests validate CRD admission (e.g. `spec.context` size limits) and `ServiceMonitor` generation. ### End-to-end (CI) The full telemetry pipeline should be exercised in GitHub Actions CI. Evaluate feasibility of running a minimal Prometheus + Loki stack inside the CI environment (e.g. single-binary mode containers); if resource constraints make this impractical, at minimum: - **Loki mock or single-binary**: a lightweight Loki instance (or a mock HTTP/gRPC endpoint that validates the Loki push API contract) receives logs from the Telemetry service and asserts expected fields, stream labels, and `spec.context` propagation across the full exporter → Telemetry → Loki path. - **Prometheus scrape**: the existing Go/Ginkgo E2E test suite performs direct HTTP scrapes of the `/metrics` endpoints on Controller, Router, and Telemetry services — no separate Prometheus instance required. The test parses the OpenMetrics response and asserts that documented series, labels, and exemplars appear after a known operation sequence. - **Correlation round-trip**: an E2E test runs a lease lifecycle (create → flash → power-cycle → release) and verifies that the same `lease_id` and `exporter` values appear in both scraped metrics (label or exemplar) and ingested log entries, confirming cross-signal correlation. Feasibility of this stack should be evaluated early (Phase 1) so that all subsequent phases have E2E coverage from the start. ### Manual - `jmp` default output remains readable; JSON structured logs are only sent to jumpstarter-telemetry for general log ingest. ## Acceptance Criteria - [ ] Exporter (or sidecar) exposes a documented metrics surface; drivers can contribute without reimplementing the HTTP server ad hoc in each driver. - [ ] Controller and one data-plane service emit structured logs with a documented minimum field set; - [ ] Operator provides a section to enable metrics, with the right details/secret references to integrate with Loki for pushing logs. - [ ] Operator attempts to auto-configure Prometheus metric scraping on the right endpoints. - [ ] A JSON schema (or equivalent machine-readable specification) is published for the structured log format, enabling consumers to validate log entries and detect regressions in field names or types. - [ ] Backward compatibility: existing clients and manifests without the new fields continue to work; deployments that do not use hub forwarding behave as today. ## Graduation Criteria ### Experimental (first release behind flag or doc-only) - JEP in Discussion; partial implementation; known gaps listed in *Unresolved Questions*. ### Stable - Acceptance criteria met; SLOs for log volume and metric cardinality documented; upgrade notes for the operator and CLI. ## Backward Compatibility - New CRD fields and labels must be optional; existing lease flows unchanged. - gRPC: new metadata must be additive; servers tolerate missing trace and context fields from older clients; clients ignore unknown fields where applicable. - **`AuditStream` removal:** The `AuditStream` RPC and `AuditStreamRequest` message on `ControllerService` are removed. This RPC was never implemented or called by any client — `Grep` across the codebase confirms zero usage outside its protobuf definition. Removing it is a no-op for all existing deployments. The new `PushLogs` RPC on `TelemetryService` supersedes the intended use case. - `LogStreamResponse` enrichment (new optional fields `driver_type`, `operation`, `timestamp`, `structured_fields`) is purely additive and backward-compatible — existing clients ignore unknown fields. - No removal of current default CLI behavior; JSON logging only when selected. ## Consequences ### Positive - **Operators** can route logs and metrics to existing Prometheus, Loki, and Perses-based stacks (self-hosted or platform-managed under the hood) without a mandatory OpenTelemetry Collector in front of Jumpstarter (see **DD-6**, **DD-10**). - **CI** can correlate a failed run to equipment and build metadata. - **Driver authors** get a single pattern for operation counters and event emission. - **Security-conscious** users can run with minimal log fields and no trace. - **Operators** can keep Loki, Prometheus, and related API tokens in-cluster only; exporters keep a single Jumpstarter trust relationship (**DD-5**). - The optional Telemetry service isolates Loki/series work from the reconciler (**DD-7**, **DD-8**); Controller and Router carry no Loki client dependency, so a Loki outage cannot affect lease operations (**DD-4**). ### Negative - More code paths, dependencies (for example a Prometheus client library, Loki HTTP client, and structured log helpers), and operability and documentation burden. - Operators must run a functioning cluster log shipper (Promtail, Grafana Alloy, Vector, or equivalent) to see Controller and Router logs in Loki. This is near-universal in production Kubernetes but worth documenting for minimal or dev clusters. ### Risks - High-cardinality metadata accidentally promoted to metric *labels* could overload TSDB. *Cardinality guidelines* restricts labels to bounded enums and routes variable context through exemplars and log line fields instead. - Exemplars require the OpenMetrics exposition format and Prometheus >= 2.26 with exemplar storage enabled (on by default since Prometheus 2.39). Operators on older Prometheus versions still get full metrics and logs; exemplar-based drill-down is unavailable until they upgrade. - Prometheus / Loki / Perses-stack version drift in the field — document tested pairs; W3C Trace Context in gRPC remains best-effort across Python and Go (no OTel SDK requirement to propagate `traceparent` where needed). ## Rejected Alternatives - **"All metrics and facts are *generated* only in the controller"** — would miss per-exporter and per-driver truth; rejected. *Forwarding* exporter-originated series and events *through* the control-plane (with stable labels) is not the same and remains in scope (see DD-5). - *Requiring Loki- and Prometheus-ingest credentials on every exporter and edge* as the only supported model — rejected in favor of optional hub forwarding and of cluster-native collectors that also avoid per-host secrets, even though those collectors are not Jumpstarter-specific. - **"Mandatory OpenTelemetry SDK and Collector"** for all metrics, logs, and traces — rejected for the reference architecture; rationale in **DD-6** (optional parallel deployment by operators is still fine). - **"Unstructured logs everywhere; parse with regex"** — rejected as unscalable for joins with traces and multi-service incidents. - **"Mandatory full tracing for every command"** — high overhead; rejected; prefer sampling and opt-in for heavy paths. - **"Push metric increments from exporters to telemetry"** — exporters would send `+1`/`+N` counter increments and histogram observations to the Telemetry service, which would maintain in-memory counters and expose them on `/metrics`. Rejected because: (a) counter state would be lost on Telemetry restart, (b) retries introduce double-counting requiring idempotency logic, and (c) high-frequency counters (e.g. stream bytes) generate excessive RPC traffic. The reverse-scrape model keeps full counter state on the exporter and generates zero RPC traffic between scrapes (see **DD-3** alternative 4, **DD-7**). - **"Reuse `AuditStream` for telemetry log push"** — `AuditStream` was an unimplemented stub on `ControllerService` with no message schema for structured telemetry data. Rather than retrofitting it, a purpose-built `PushLogs` RPC on the new `TelemetryService` provides a cleaner contract and separates telemetry from the controller's reconciliation API. ## Prior Art - [Prometheus](https://prometheus.io/) and [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) — time-series metrics and alerting; [Prometheus naming and labels](https://prometheus.io/docs/practices/naming/) on cardinality and naming; remote write for non-scrape topologies; [Exemplars](https://prometheus.io/docs/instrumenting/exposition_formats/#exemplars) for attaching high-cardinality context to individual samples. - [Grafana exemplar support](https://grafana.com/docs/grafana/latest/fundamentals/exemplars/) — visualizing exemplars in metric panels and linking to traces or logs. - [Loki](https://grafana.com/oss/loki/) — log aggregation, label model, and push and query APIs; often combined with [Perses](https://perses.dev/) (see **DD-10**) and Grafana Agent / Alloy or [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) for log shipping. - [Grafana Tempo](https://grafana.com/oss/tempo/) or [Jaeger](https://www.jaegertracing.io/) — common trace backends (native or HTTP ingest; OTLP where the operator uses it — not a Jumpstarter code dependency; see **DD-6**). - [Perses](https://perses.dev/) — CNCF dashboard project; Apache 2.0; Kubernetes-native CRDs; CUE/JSON spec with GitOps SDKs; focused on Prometheus, Loki, and Tempo data sources (see **DD-10**). - [OpenTelemetry](https://opentelemetry.io/) and the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) — relevant as ecosystem and operator-side *optional* plumbing; this JEP intentionally does not adopt them in-process by default (**DD-6**). - Other HiL / test systems often separate "run metadata" (like Jenkins build id) from device state; similar separation maps well to this JEP’s lease context + events. ## Unresolved Questions - Event retention: Loki retention policy (per-tenant, per-stream retention classes) for annotated log events (**DD-2**); whether Jumpstarter should document recommended retention defaults or leave this to operators. ## Future Possibilities - SLOs and error budgets on lease acquisition time, flash success rate, and mean time to recovery of exporters. - Per-tenant or per-namespace dashboards as samples in the docs. - *Not* part of this JEP: billing usage metering (could reuse metrics later). ## Implementation History - JEP-0011 proposed: 2026-04-23 - JEP-0011 updated based on feedback: 2026-04-29 ## References - [JEP-0000 — JEP Process](JEP-0000-jep-process.md) - [Kubernetes Events](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/) - [W3C Trace Context](https://www.w3.org/TR/trace-context/) (`traceparent`) - Upstream project docs for the Prometheus, Loki, and Perses versions (and optional Tempo / Jaeger if used) in a given deployment; pin versions in release notes and integration tests. --- *This JEP is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)*